CN116723077A - Distributed IT automatic operation and maintenance system - Google Patents

Distributed IT automatic operation and maintenance system Download PDF

Info

Publication number
CN116723077A
CN116723077A CN202211646280.4A CN202211646280A CN116723077A CN 116723077 A CN116723077 A CN 116723077A CN 202211646280 A CN202211646280 A CN 202211646280A CN 116723077 A CN116723077 A CN 116723077A
Authority
CN
China
Prior art keywords
module
service
instance
automatic
automation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211646280.4A
Other languages
Chinese (zh)
Inventor
方宇炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DIGITAL CHINA ADVANCED SYSTEMS SERVICES CO LTD
Original Assignee
DIGITAL CHINA ADVANCED SYSTEMS SERVICES CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DIGITAL CHINA ADVANCED SYSTEMS SERVICES CO LTD filed Critical DIGITAL CHINA ADVANCED SYSTEMS SERVICES CO LTD
Priority to CN202211646280.4A priority Critical patent/CN116723077A/en
Publication of CN116723077A publication Critical patent/CN116723077A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/046Network management architectures or arrangements comprising network management agents or mobile agents therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/042Network management architectures or arrangements comprising distributed management centres cooperatively managing the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0876Aspects of the degree of configuration automation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The distributed IT automatic operation and maintenance system provided by the invention comprises an agent module, an operation service module, a service arrangement module, a timing scheduling module, a scene module, an API gateway, a service registration configuration module, an operation and maintenance management module, a cache module and a database module, is a distributed architecture system which is suitable for a large-scale network environment and can meet the IT automatic operation and maintenance requirements of various service scenes, all modules of the distributed architecture system can be respectively and independently transversely expanded, the operation performance expansion and the high availability are realized, and the problems that a plurality of IT automatic systems and tools are dispersedly built, the unified operation management and self-service capability to IT infrastructure is lacked, and the agility requirement of a digital era to the IT architecture cannot be met at present can be effectively solved.

Description

Distributed IT automatic operation and maintenance system
Technical Field
The invention relates to the field of large-scale network environments, in particular to a distributed IT (information technology) automatic operation and maintenance system.
Background
The digital transformation brings deep changes to the enterprise operation concept in the aspects of organization mode, internal flow, upstream and downstream cooperation mode and the like so as to cope with increasingly uncertain, complicated and personalized internal and external environments, and the agile enterprise management concept requires an agile IT architecture support matched with the enterprise management concept. The dual-mode IT architecture, the distributed micro-service application architecture, the DevOps management idea and the cloud native technology play an increasingly important role in establishing an IT system which is suitable for agile management in the digital era, IT automation is a catalyst of the digital transformation support technology in the aspects of quality and efficiency, and is an engine for pushing the digital technology to create value.
The current enterprise has a plurality of IT automation systems and tools in a decentralized construction mode, lacks of unified operation management on IT infrastructure, cannot provide safe, reliable and flexible enterprise-level IT infrastructure operation self-service capability for ITSM, devOps, AI/MLOps systems, and cannot meet the requirement of digital times on IT architecture agility.
Disclosure of Invention
The present invention has been made in view of the above problems, and IT is to provide a distributed IT automation operation and maintenance system that overcomes or at least partially solves the above problems.
According to one aspect of the present invention, there is provided a distributed IT automation operation and maintenance system, the system comprising: the system comprises an operation layer, a control layer, a coordination layer, an application layer and a capability opening layer, and specifically comprises:
the agent module is an operation layer component and is used for realizing specific automatic operation functions;
the operation service module is a control layer component and is used for realizing agent control and grouping management;
the service arrangement module is used for arranging the automatic operation flow for the coordination layer component, receiving the instruction of each scene module and executing the automatic operation flow;
the timing scheduling module is used for realizing the triggering execution of all timing tasks of the whole system for the coordination layer component;
the scene module is an application layer component and realizes a specific application scene function;
an API gateway which provides an automatic service capability for the capability opening layer component;
the service registration configuration module is used for providing service registration and centralized configuration management for each module instance except the agent for the global management component;
the operation and maintenance management module is a global management component and is used for monitoring the health states of all module instances;
the data storage module is used for storing data and comprises a cache module and a database module;
the invention provides a distributed IT automatic operation and maintenance system, which comprises:
(1) The proxy module establishes long connection with the operation service module by the socket client identity, configures two or more operation service module addresses to realize high availability of the main and standby modes, and automatically switches to the backup operation service module when communication with the current operation service module is abnormal.
(2) For IT resource objects operated by remote protocol, more than two proxy modules can be configured to operate on these IT resource objects, ensuring high reliability of proxy operation. The operation service module receives an operation instruction of the target IT resource object from the service arrangement module or the scene module, and selects an available agent to execute the operation.
(3) For the operation of the host server where the agent is, a plurality of servers needing high availability or load sharing can be divided into a group, the service arrangement module or the scene module sends the target equipment group to the operation service module, and the operation service module performs task allocation among the plurality of servers of the same equipment group according to a strategy, so that the high availability of operation is realized.
(4) The three points described above ensure high availability of the communication link from the operational service module to the agent to the IT resource.
(5) The plurality of operation service modules realize domain management of the agent modules, and the automation operation scale is enlarged. The operation service module is responsible for maintaining IT resource objects and agent modules within the management domain range, and maintaining the communication relationship of the IT resource objects, the agent modules and the agent modules in the cache module. When the agent starts, the agent registers itself with the operation service module, and reports the on-line state of IT resources responsible for operation. And when the communication between the proxy and the currently connected operation service module is abnormal, automatically switching to a backup operation service module, and automatically updating the connection relation between the proxy and the operation service module in the cache module.
(6) The operation and maintenance management module detects the online state of the operation service module at regular time through a heartbeat mechanism, and when the operation service module is offline, the operation service module and all agents and IT resource communication relations under the operation service module are deleted from the cache module.
(7) The plurality of service orchestration modules implement parallel computation of the automated process. The service arrangement module periodically updates task load information to the service registration configuration module, and the scene module, the API gateway module and the timing scheduling module apply for the service arrangement module with the lowest load to execute before calling the arrangement service module to execute the automatic flow. And when each automation task of the automation flow is executed, the service arrangement module finds an operation service module according to the target IT resource and issues an execution instruction to the operation service module. The service arrangement module writes the automation flow instance information, the execution state and the result information into the database and simultaneously caches the automation flow instance information, the execution state and the result information in the cache module. Normally, the operation service module returns execution information to the service orchestration module that sends the automation task. If the service arrangement module for sending the automation task fails, the operation service module acquires the backup service arrangement module through the service registration configuration module and returns the execution information of the automation task. The newly taken over service arrangement module acquires the information of the automatic process instance from the cache module and drives the process instance to execute.
(8) The timed schedule task module is highly available. And a global task timing scheduling mechanism is adopted, each timing task is started by a timing scheduling task module according to a set scheduling strategy, and specific task execution is completed by an operation service module, a service arrangement module and a scene module. The invention provides a simplified main selection algorithm by structurally adopting a main-auxiliary or a main-multi-auxiliary to ensure high availability and combining with a service registration configuration module, so that real-time main selection among a plurality of timing scheduling task modules is realized.
(9) The service registration configuration module provides service registration and centralized configuration management service for all modules except the proxy module of the distributed IT automatic operation and maintenance system, and adopts a plurality of module cluster structures.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system structure according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a distributed operation provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of addressing information of a band of data passing between layers according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a distributed orchestration service provided by an embodiment of the present invention;
fig. 5 is a flowchart of a main selection algorithm provided in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terms "comprising" and "having" and any variations thereof in the description embodiments of the invention and in the claims and drawings are intended to cover a non-exclusive inclusion, such as a series of steps or elements.
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings and the examples.
1. Fig. 1 is a schematic diagram of a system structure according to an embodiment of the invention.
The agent module 1 is deployed on a physical server or a virtual machine to realize a specific automatic operation function.
The agent module 1 receives the target IT resource information (address, account, password, etc.), parameters, commands or scripts, executes shell commands or scripts, or invokes api to implement the operations of the hardware environment, OS, database, middleware, and application program of the host server. The operation of remote servers, network devices, storage devices, cloud environments, databases, middleware, application systems may also be implemented through access protocols (e.g., ssh, http, jdbc, etc.). And returning the operation process data and the result data according to the agreed unified format. When communication with the operation service module is abnormal, the operation process data and the result data are cached in a local file, and the communication is re-reported after recovery. Periodically detecting and reporting the online state of the responsible remote equipment.
The operation service module 2 is deployed on a physical server or virtual machine to realize agent control and packet management, one operation service module manages N agents, and the lateral performance of the operation service modules is expanded and high-availability is realized by deploying a plurality of operation service modules. And deploying an operation service module on a server crossing network domains to realize the communication with IT resources under the multiple network domains.
The operation service module mainly realizes the following functions:
(1) And the agent module is remotely managed. The agent module is deployed remotely, started, stopped and restarted, and the agent module parameters and the online upgrading agent module are set in a centralized manner.
(2) And (5) data communication. And providing a unified API interface for proxy operation, and bearing communication bridges between the proxy module and the service orchestration module and between the proxy module and various scene modules, and shielding operation details for the service orchestration module and the various scene modules.
(3) And (5) flow control. And realizing flow control according to the job priority and the busyness of the target IT resources.
(4) IT resource nanotubes. The unified management of the managed devices is realized, which comprises synchronizing the managed devices and the operation accounts thereof from a configuration management database (cmdb), and setting the operation relationship between the agents and the managed devices.
(5) And (5) centralized management of scripts. The method realizes centralized management, online editing, security inspection and testing of the script, supports the atomic task of encapsulating the script into input and output parameters and specific functions, and is convenient for users to use.
The service scheduling module 3 is deployed on a physical server or virtual machine to realize automatic workflow scheduling and execution. The service arrangement module adopts a cluster architecture, can deploy N instances, and realizes the performance expansion and high availability of the execution of the workflow.
The service arrangement module mainly realizes the following functions:
(1) And (5) arranging a patterning process. The patterning tool is provided to realize automatic operation flow arrangement. Serial, parallel, branch judgment, concurrency, aggregation, cyclic flow semantics and sub-flow nesting are supported. Support manual, time and automation operations. And supporting the transmission of the flow variable value among various operations and parent-child flows of the whole flow. Each job can define an automatic processing action linked with the job state, such as notification, rework and the like when the job is abnormal, and set the flow control indexes such as priority, concurrency and the like.
(2) The flow is performed. And receiving the instructions of each scene module, executing an automatic operation flow, and driving manual operation, time operation and automatic operation to execute according to flow semantics. The manual work allows the operator to input information through the interface. The time operation realizes time delay or timing between the front operation and the rear operation. And the automatic job calling operation service module transmits parameters, commands or scripts to the target IT resource and receives the execution process and result information of the automatic task.
(3) And (5) monitoring the flow. The execution information and results of the workflow and each automation job are monitored and some notification or control class actions are automatically performed based on the status of the results. Allowing the operator to pause, stop, redo, skip, etc. human intervention when the job is executing abnormally.
The timing scheduling module 4 is deployed on a physical server or a virtual machine, and adopts a master-slave mode to realize the triggering execution of all timing tasks of the whole system.
The scene module 5 implements specific application scene functions, such as a job scheduling module, a patrol module, a compliance checking module, an application deployment module, a disaster recovery switching module, a configuration management module, an environment preparation module, a fault self-healing module, and the like. The scene module calls the automatic flow service provided by the service arrangement module and the automatic operation service provided by the operation service module to complete various automatic tasks.
The scene modules are deployed on physical servers or virtual machines, and each scene module can deploy a plurality of instance implementation clusters.
The API gateway module 6 provides automation service capability to the outside through the Restful interface, including:
(1) An automated flow information query interface;
(2) An automated flow execution interface;
(3) An automated process execution result query interface;
(4) A job information query interface;
(5) A job information execution interface;
(6) A job information execution result query interface;
(7) Agent information query interface.
The API gateway module is deployed on a physical server or a virtual machine, and can deploy a plurality of instance implementation clusters.
The service registration configuration module 7 provides service registration and centralized configuration management for each module instance except for the proxy module. Deployed on a physical server or virtual machine, multiple instance implementation clusters may be deployed.
The operation and maintenance management module 8 monitors the health status of all module instances. The method is deployed on a physical server or a virtual machine, and a master multi-slave mode is adopted to realize high availability.
The buffer memory module 9 realizes the global memory storage of state data related to the automatic flow and operation, connection relation between the proxy and the operation service module, timing scheduling schedule and scheduling state and running state information of each module instance, and provides global data sharing access for a plurality of instances of the same module.
The cache module provides a publish-subscribe mechanism, and the clients can realize asynchronous message communication through the publish-subscribe mechanism.
The cache module adopts a cluster architecture and a plurality of examples and is deployed on a physical server or a virtual machine.
Database module 10 uses SQL database to persist various configuration parameters, process data and result data related to automatic flow and operation, and information such as timing task execution status. The cluster architecture is adopted and deployed on a physical server or a virtual machine.
2. Fig. 2 is a schematic diagram of a distributed operation according to an embodiment of the present invention.
2.1 proxy Module registration
(1) Connection of the identifier 1 in fig. 2, when the proxy modules (211..2nn in fig. 2) start, a TCP connection is established with the operation service modules (31..3n in fig. 2) by using the socket client identity, and each proxy module has a unique ID identifier, so that two operation service module addresses (primary and secondary) are configured.
(2) The operations service module provides a graphical management interface to configure the proxy module of the operations target device (device 111 in fig. 2..1 nn), as well as the operations protocol, account, password information, or automatically synchronize the target device and account information from a third party configuration management database (cmdb).
The agent module 211 may operate the devices 111 to 11n as in fig. 2, and the agent module 2n1 operates the devices 1n1 to 1nn. To achieve the high availability effect, the same device may configure two or more proxy modules, for example, the device 11n may be simultaneously operated by the proxy module 211 and the proxy module 221 in the figure, and the system sends an operation command to one of the online proxy modules according to the configuration sequence.
(3) The operation service module stores the connection relation between the operation service module and the proxy module and the equipment relation which is responsible for operation of the proxy module into the cache module to form a global tree structure.
(4) The information in the cache module is accessed by all service orchestration modules and operation service modules.
(5) The agent module monitors the online state of the target device at regular time and actively reports the online state to the operation service module, and the operation service module deletes the device in the offline state from the cache module.
(6) The operation service module sends a heartbeat packet to the proxy module at regular time to detect whether the heartbeat packet is online, and the proxy module judges whether the connection with the operation service module is normal or not according to the heartbeat packet.
(7) After the operation service module judges that the agent module is offline, the agent module and all equipment information below the agent module are deleted from the cache module.
(8) After judging that the communication with the current operation service module is abnormal, the proxy module actively switches to another operation service module, and the other operation service module can update the relation between the operation service module and the proxy in the cache module, delete the proxy module and the equipment information below the proxy module from the original operation service module and then add the proxy module and the equipment information below the new operation service module.
2.2 instruction execution
All automation scenarios can be converted into automation operation flow, and for convenience of description, the instruction execution flow of the present embodiment will be described below by taking "example automation operation flow" in fig. 2 as an example.
(1) The operation service module (31 to 3n in fig. 2) and the orchestration service module (41 to 4n in fig. 2) register service information (service name, service instance IP, service instance port, service instance state) with the service registration configuration module at startup.
(2) The operation and maintenance management module acquires information of all operation service modules and arrangement service modules from the service registration configuration module, the heartbeat interfaces of the operation service modules and the arrangement service modules are called at regular time to detect the online state of the operation service modules and the arrangement service modules, the operation and maintenance management module is updated to the service registration configuration module, and the service registration configuration module stores the online state information of the operation service modules and the arrangement service modules.
(3) Example automated workflow there are 3 automated task nodes connected in series: task J1 operates device 111 (via agent module 211), task J2 operates device 11n (both agent modules 211 and 221 can operate device 11n, both of which are mutually available) and task J3 operates device 12n (via agent module 221).
(4) The scenario module (any of 51 to 5n in fig. 2) performs an automation operation, calls the service registration configuration module to acquire the service orchestration module with the lightest current load, and sends a request to the service orchestration module 41 to perform an example automation operation flow, assuming the service orchestration module 41.
(5) Service orchestration module 41 instantiates an example automation operation flow (creates an execution instance of an automation flow in a database, loads an entire flow model and execution parameter information, etc.). The requestor context module records its service ID and service instance ID, as well as automation operational flow instance information (e.g., flow instance ID, instance state, etc.) in a cache module and database.
(6) The service orchestration module 41 executes J1, queries the operation path of the target device 111 from the cache module as the operation service module 31- > the proxy module 211- > the device 111, and the service orchestration module 41 sends the flow instance ID, the J1 task instance ID, the J1 execution command and parameter information, the target device 111 address information, the target proxy module 221 address information, and the service instance ID number registered by the service orchestration module 41 in the service registration configuration module 6 to the operation service 31. The service orchestration module 41 saves the task instance ID, the execution state of task J1 into the caching module 6 and the database.
(7) The operation service module 31 performs flow control according to the policy configured by the system, and sends the J1 execution command to the proxy module 221 after the flow control condition is satisfied.
(8) The agent module 221 performs an automation operation on the device 111, and returns an execution result.
(9) The normal case agent module 221 returns the execution result to the operation service module 31, which returns to the service orchestration module 41.
(10) The service orchestration module 41 saves the execution state and result information of the J1 task into the caching module 6 and database system.
(11) Service orchestration module 41 performs J2 tasks. There are two paths of operation for querying the target device 11n from the cache module (8 in fig. 2): the operation service module 31- > the proxy module 211- > the device 11n, the operation service module 32- > the proxy module 221- > the device 11n. Service orchestration module 41 may arbitrarily select one for execution. The other operation procedures are the same as those of the steps (6) to (10).
(12) Service orchestration module 41 performs the J3 tasks, steps (6) - (10).
(13) The service registration configuration module and the operation management module in fig. 2 are both cluster deployment, so that single-point faults are avoided.
2.3 operation service Module failure during instruction execution
In steps (6) and (9) of 2.2 above, if the operation service module fails, the implementation example of the present invention is handled as follows:
(1) In step (6) of 2.2, the service orchestration module 41 sends a task execution command to the operation service module 31, finding that the operation service module 31 is faulty. Since the proxy module connected to the operation service module 31 determines that the operation service module 31 is failed and needs a period of time (e.g. 3 heartbeat intervals) to switch to the backup operation service module, the operation path acquired by the service orchestration module 41 from the cache module 8 is still the operation service module 31- > the proxy module 211- > the device 111. The service orchestration module 41 sends a command to the operation service module 31 to detect a communication anomaly, and then returns anomaly information to the scene module and back to the user. The user performs manual intervention on the graphical management interface, such as re-doing the J1 task after waiting for a period of time, at this time, the proxy module originally connected under the operation service module 31 will be re-connected to the backup operation service module, and a new operation path will be re-built in the cache module 8.
(2) In step (8) of 2.2, the agent module 221 returns the result of the execution of the automation task to the operation service module 31, and at this time, finds that the operation service module 31 is faulty. The agent module 221 caches the task execution result information in the local file, and then reconnects to the backup operation service module, such as the operation service module 32, and returns the task execution result to the operation service module 32.
(3) The operation service module 32 returns the result to the service orchestration module 41 according to the service orchestration module service instance ID number in the result response message returned by the proxy module.
2.4 failure of service orchestration Module during instruction execution
In step (9) of 2.2 above, the processing logic of the service orchestration module failure when the operation service module returns task result information to the service orchestration module is described in detail below at point 5.
3. As shown in FIG. 3, addressing information of the bands when data is transferred between layers according to the embodiment of the present invention
Each of the scenario modules has a scenario service ID and a scenario service instance ID, respectively identifying which type of service is provided and the specific service instance.
The scene module executes an operation and maintenance scene function, invokes a service arrangement module, instantiates a corresponding automatic flow instance, identifies an upper flow instance ID and a service arrangement instance ID of a currently executed flow instance for each flow instance, establishes a corresponding relation between the information and the scene service instance ID and between the information and the scene service instance ID, and stores the corresponding relation in the cache module.
When the service orchestration module executes a task in the automation flow, a task instance ID is generated, and the task instance ID, the flow instance ID, the service orchestration instance ID, the scene service instance ID and the scene service ID are packaged and sent to the operation service module. The service orchestration module saves this information into a cache module so that other service orchestration modules can access it.
The operation service module forwards the task instance ID, the flow instance ID, the service orchestration instance ID, the scene service ID, the task related parameters and the target equipment information to the proxy module.
The agent returns task execution information with task instance ID, flow instance ID, service orchestration instance ID, scene service ID information sent when the task requests.
And the operation service module returns the task execution information to the corresponding service orchestration module according to the service orchestration instance ID in the agent return message. If the original service orchestration instance fails at this time, the operation service module applies for the backup instance of the service orchestration instance from the service registration configuration module, and returns task execution information.
And when the execution of the flow instance is completed, the service orchestration module returns according to the scene module instance ID in the return message. And if the original scene module instance fails at the moment, applying for the backup instance of the scene module instance from the service registration configuration module, and returning the execution information.
4. As shown in FIG. 4, a distributed orchestration service is provided according to an embodiment of the present invention
(1) The scenario module 51 performs a scenario function, retrieving the service orchestration instance 401 from the service registration configuration module, assuming the service orchestration module 41. Scene module 51 sends an automated flow execution request 402 to service orchestration module 41.
(2) Service orchestration module 41 creates an automation flow instance, saving instance information to caching module 403 and database module 404.
(3) The service orchestration module 41 instantiates task information in the automation flow, saves the task information to the caching module 403 and the database module 404, and issues the task information to the operation agent module 31 on the target device path.
(4) The operation service module 31 receives the task execution result information from the proxy module and returns it to the requested service orchestration module normally (identification 406, i.e. service orchestration module 41 in fig. 4, for a specific addressing procedure, see point 3).
(5) If the operation service module 31 finds that the originally requested service orchestration module 41 is abnormal, a backup service orchestration module of the service orchestration module 41 is acquired from the service registration configuration module (specific acquisition algorithm is described below at point 5), assuming that it is the service orchestration module 4n.
(6) The operation service module 31 returns the task execution result information 408 to the service orchestration module 4n. After the service orchestration module 4n receives the task instance result information and the state information of the automated process instance in the update caching module 409 and the database module 410, and drives the next process node to execute according to the process semantics.
(7) If the current automation task is the last node of the process, meaning that the automation process instance execution is complete, the service orchestration instance 4n returns automation process instance result information to the scene module 51.
5. Backup relationship among multiple instances of service module
The problem of selecting backup instances when the original service instance is abnormal under the condition that the 4 th point (6) and the 7 th point relate to N service instances is solved. When a service has N service instances, one service instance is down, the task originally allocated to the service instance needs to be taken over by another service instance, in order to take over orderly, the following backup relationship between the service instances is agreed: the number value of the IP address of the appointed service instance is added with the port number to serve as a unique identification of the service instance, all available service instances of the same service are arranged into a ring from small to large according to the unique identification, the backup of the service instance is defaulted to be the backup of the former, for example, id1< id2< id3, id2 is the backup of id1, id3 is the backup of id2, and id1 is the backup of id 3.
6. Referring to fig. 5, a main selection algorithm flow chart provided for an embodiment of the present invention
The timing scheduling module 4 can only execute the timing task global scheduling by the main module instance at the same time, and other instances are in a hot standby state. It is desirable to implement a majority between multiple instances.
Also, the operation and maintenance management module 8 can only monitor the health condition of each module by the main module instance at the same time, and other instances are in the hot standby state. It is desirable to implement a majority between multiple instances.
Common main selection algorithms include Paradox, raft and the like, and the algorithms not only realize main selection among multiple instances, but also realize state synchronization among multiple instances to ensure consistency. The state synchronization between the multiple instances of the timing scheduling module and the operation and maintenance management module in the embodiment of the invention can be realized through a cache module and a database system, so that the simplified main selection algorithm of fig. 5 is adopted.
The selected main process triggers execution when the service instance of the system initialization and service registration configuration module provided with the service module (particularly the timing scheduling or operation and maintenance management module in the embodiment of the invention) changes (the service instance is online and offline).
And step 1, reading all instance data of the current service from the service registration configuration module.
If at least one roll of the online service instances of the current service is a leader, then:
step 2, register itself with the service registration configuration module, set role= "follower", registerTimeStamp= "current time".
The flow ends.
If the online service instance of the current service is null, or the service instance that is not null but role is "leader" is null, then:
step 3, registering itself with the service registration configuration module, setting role= "leader", registerTimeStamp= "current time".
And step 4, reading all instance data of the current service from the service registration configuration module.
If only one role in the online service instance of the current service is "leader", the flow ends.
If the roll of the online service instance of the current service with a plurality of service instances is a "leader", then:
if the own registerTimeStamp is the smallest in all service instances with roll being "leader" and there are no other service instances for which the registerTimeStamp is equal, the flow ends.
If the own registerTimeStamp is not the smallest in all service instances with role as "leader", the configuration module updates itself to be set as role= "follower", registerTimeStamp= "current time", and the flow ends.
If the own registerTimeStamp is the smallest of all service instances with roll being "leader", but there are additional service instances for which the registerTimeStamp is equal, then:
wait randomly for a period of time (e.g., 100 milliseconds).
Update itself with the service registration configuration module, set registerTimeStamp= "current time".
Jump to step 4.
The beneficial effects are that: the system has the characteristics of high reliability and transversely extensible performance, realizes a unified automatic operation platform of an IT system in a large-scale network environment, and meets the requirements of various business scenes on IT automation.
The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.

Claims (4)

1. A distributed IT automation operation and maintenance system, the operation and maintenance system comprising: the system comprises an operation layer, a control layer, a coordination layer, an application layer and a capability opening layer, and specifically comprises:
the agent module is an operation layer component and is used for realizing specific automatic operation functions;
the operation service module is a control layer component and is used for realizing agent control and grouping management;
the service arrangement module is used for arranging the automatic operation flow for the coordination layer component, receiving the instruction of each scene module and executing the automatic operation flow;
the timing scheduling module is used for realizing the triggering execution of all timing tasks of the whole system for the coordination layer component;
the scene module is an application layer component and realizes a specific application scene function;
an API gateway which provides an automatic service capability for the capability opening layer component;
the service registration configuration module is used for providing a global management component and providing various module examples except agents;
the operation and maintenance management module is a global management component and is used for monitoring the health states of all module instances;
the cache module is used for realizing the memory storage of state data related to the automatic flow and operation;
and the database module is used for storing various configuration parameters, process data and result data related to automatic flow and operation.
2. The distributed IT automation operation and maintenance system according to claim 1, wherein the service arrangement module supporting various scene functions of the system is distributed, and a plurality of automation flow instances are executed in parallel in a plurality of service arrangement module instances, and the automation flow instance execution information is globally shared by a cache module;
the scene module can initiate an automatic flow execution request to any one of the service orchestration modules;
in the execution process of the automatic flow instance, when a service arrangement module of the execution flow instance is down, a backup service arrangement module automatically takes over the execution of the automatic flow instance;
the mutual failover mechanism among the plurality of service orchestration modules is orderly and can be automatically completed without prior setting.
3. The distributed IT automation operation and maintenance system according to claim 1, wherein the system is distributed by an automation task operation subsystem composed of an operation service module and a proxy module;
the distributed IT automatic operation and maintenance system consists of a plurality of automatic task operation subsystems, so that the transverse expansion of the operation scale is realized;
each automation task operation subsystem comprises at least two operation service modules and a plurality of agent modules;
each IT resource is operated by one or more agents simultaneously;
the operation service module, the agent module and the IT resource form an operation path, and the operation service module is responsible for maintenance and is cached in the cache module to construct a global operation path of the whole system;
the service arrangement module sends a task execution command to the target operation service module according to the global operation path;
the high availability of proxy operations is achieved by setting multiple operation paths for one IT resource;
the proxy establishes TCP connection with the operation service module by using the socket client identity, and the socket client identity and the operation service module mutually detect heartbeat;
when the operation service module is abnormal, the proxy module is actively switched to a backup operation service module, and the backup operation service module automatically updates the global operation tree in the cache module;
and the agent module firstly caches the task execution result information in the fault transfer process to a local file system, and then reports the task execution result information again after the switching is completed.
4. The distributed IT automation operation and maintenance system according to claim 1, wherein a timing scheduling module and an operation and maintenance management module of the system adopt a master-slave mode, and can only be executed by a master module instance at the same time, so that a plurality of instances need to be selected;
state synchronization among a plurality of instances is realized through a cache module and a database module. The distributed IT automation operation and maintenance system adopts a simplified main selection algorithm by means of a service registration configuration module;
when an instance is newly added or exited, the roles of all the current instances are acquired from the service registration configuration module, then the roles of the current instance are judged according to the leader and the follower roles of each instance, and the proposal is submitted to the service registration configuration module, and when the instance conflicts, the proposal is submitted again after waiting for a period of time at random until the roles can be determined.
CN202211646280.4A 2022-12-21 2022-12-21 Distributed IT automatic operation and maintenance system Pending CN116723077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211646280.4A CN116723077A (en) 2022-12-21 2022-12-21 Distributed IT automatic operation and maintenance system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211646280.4A CN116723077A (en) 2022-12-21 2022-12-21 Distributed IT automatic operation and maintenance system

Publications (1)

Publication Number Publication Date
CN116723077A true CN116723077A (en) 2023-09-08

Family

ID=87863749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211646280.4A Pending CN116723077A (en) 2022-12-21 2022-12-21 Distributed IT automatic operation and maintenance system

Country Status (1)

Country Link
CN (1) CN116723077A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117579700A (en) * 2024-01-11 2024-02-20 中国人民解放军国防科技大学 General micro-service processing method, system and equipment based on message queue

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117579700A (en) * 2024-01-11 2024-02-20 中国人民解放军国防科技大学 General micro-service processing method, system and equipment based on message queue
CN117579700B (en) * 2024-01-11 2024-04-02 中国人民解放军国防科技大学 General micro-service processing method, system and equipment based on message queue

Similar Documents

Publication Publication Date Title
CN112000448B (en) Application management method based on micro-service architecture
US20210042266A1 (en) Geographically-distributed file system using coordinated namespace replication over a wide area network
CN106790595B (en) Docker container active load balancing device and method
Botelho et al. On the design of practical fault-tolerant SDN controllers
CN105607954B (en) A kind of method and apparatus that stateful container migrates online
JP4342441B2 (en) OPC server redirection manager
CN109936622B (en) Unmanned aerial vehicle cluster control method and system based on distributed resource sharing
CN108604202A (en) The working node of parallel processing system (PPS) is rebuild
JP2008210412A (en) Method of controlling remotely accessible resource in multi-node distributed data processing system
Botelho et al. Smartlight: A practical fault-tolerant SDN controller
CN110391940B (en) Service address response method, device, system, equipment and storage medium
WO2021043124A1 (en) Kbroker distributed operating system, storage medium, and electronic device
Dustdar et al. Dynamic replication and synchronization of web services for high availability in mobile ad-hoc networks
CN111935244B (en) Service request processing system and super-integration all-in-one machine
Spalla et al. AR2C2: Actively replicated controllers for SDN resilient control plane
CN113515316A (en) Novel edge cloud operating system
CN116841705A (en) Distributed scheduling monitoring system based on cloud protogenesis and deployment method thereof
CN114116912A (en) Method for realizing high availability of database based on Keepalived
CN116723077A (en) Distributed IT automatic operation and maintenance system
CN112199178A (en) Cloud service dynamic scheduling method and system based on lightweight container
WO2015196692A1 (en) Cloud computing system and processing method and apparatus for cloud computing system
CN114564340B (en) High availability method for distributed software of aerospace ground system
Kitamura et al. Development of a Server Management System Incorporating a Peer-to-Peer Method for Constructing a High-availability Server System
CN113329102B (en) Ambari Server system and network request response method
CN105468446A (en) Linux based method for realizing high availability of HPC job scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination