CN113296840A - Cluster operation and maintenance method and device - Google Patents
Cluster operation and maintenance method and device Download PDFInfo
- Publication number
- CN113296840A CN113296840A CN202010103813.9A CN202010103813A CN113296840A CN 113296840 A CN113296840 A CN 113296840A CN 202010103813 A CN202010103813 A CN 202010103813A CN 113296840 A CN113296840 A CN 113296840A
- Authority
- CN
- China
- Prior art keywords
- cluster
- preset
- data
- configuration
- maintenance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
- G06F9/4451—User profiles; Roaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/65—Updates
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a cluster operation and maintenance method and a device, wherein the method comprises the following steps: collecting preset operation and maintenance data of a cluster; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster; and executing the first configuration updating instruction so as to update the configuration of the cluster.
Description
Technical Field
The invention relates to the field of cluster operation and maintenance, in particular to a cluster operation and maintenance method and device.
Background
With the continuous development of the technical field of computers, the deployment mode of a single server is difficult to meet the requirements of enterprises, and clusters are produced at the same time. A cluster includes a plurality of servers that have a shared data storage space, with the servers communicating with each other via an internal local area network. While the cluster size is enlarged, the operation and maintenance of the cluster face a greater challenge.
In the current operation and maintenance process of the cluster, the configuration of the cluster is often required to be adjusted. However, in the current scheme for adjusting the configuration of the cluster, the operation status of the cluster is observed manually, and then the configuration of the cluster is updated manually. Obviously, the efficiency of artificially updating the configuration of the cluster is relatively low, but no method for automatically updating the configuration of the cluster exists at present, which is a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a cluster operation and maintenance method and device, and solves the problem that no method for automatically updating cluster configuration exists in the prior art.
In a first aspect, an embodiment of the present application provides a cluster operation and maintenance method, including: collecting preset operation and maintenance data of a cluster; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster; and executing the first configuration updating instruction so as to update the configuration of the cluster.
In the method, after the preset operation and maintenance data of the cluster are obtained, the preset operation and maintenance data of the cluster can be automatically analyzed through a preset analysis strategy, so that a first configuration updating instruction of the cluster can be automatically determined; further, the first configuration update instruction may be executed, thereby providing a method of automatically updating the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy to determine a first configuration updating instruction of the cluster; the method comprises the following steps: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In the above method, the preset analysis strategy includes: the state values of the preset operation and maintenance data of the cluster corresponding to different value ranges can be determined according to the preset analysis strategy, and whether a trigger condition for triggering configuration update of the cluster is met or not can be automatically detected, so that the first configuration update instruction corresponding to the trigger condition can be generated.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the executing the first configuration update instruction comprises: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In the method, when the state value is set to be the first preset value or the second preset value, the first configuration update instruction is executed, and the cluster is instructed by the state value to output alarm information according to a preset alarm mechanism, or cluster operation corresponding to preset operation and maintenance data of the cluster is restarted, so that the availability of the cluster is improved.
In an optional implementation, after the determining the first configuration update instruction of the cluster and before the executing the first configuration update instruction, the method further includes: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution limit information is used for limiting the execution limit range of the first configuration updating instruction; the executing the first configuration update instruction comprises: executing the first configuration update instruction within the execution limit range indicated by the execution limit information.
In the above manner, before the first configuration update instruction is executed, the execution restriction information of the first configuration update instruction is acquired from the preset operation-maintenance database of the cluster, and since the execution restriction information is used for restricting the execution range of the first configuration update instruction, the first configuration update instruction can be executed within the execution range indicated by the execution restriction information, so that the partial execution of the first configuration update instruction is restricted to a certain extent, and the flexibility of executing the first configuration update instruction is increased.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In the above manner, the preset operation and maintenance data of the cluster is compared with the first sub-analysis strategy, the second sub-analysis strategy and the third sub-analysis strategy by presetting the first sub-analysis strategy, the second sub-analysis strategy and the third sub-analysis strategy which correspond to each other, so that the first configuration update instruction of the cluster can be automatically obtained.
In an optional embodiment, obtaining historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
In the above manner, the historical operation and maintenance data of the cluster can be optimized and analyzed according to a preset optimization strategy through the acquired historical operation and maintenance data of the cluster, and a second configuration updating instruction of the cluster is determined; and executing the second configuration updating instruction so as to update the configuration of the cluster, thereby realizing the automatic optimization of the cluster.
In a second aspect, the present application provides a cluster operation and maintenance device, including: the acquisition module is used for acquiring preset operation and maintenance data of the cluster; the analysis module is used for analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy and determining a first configuration updating instruction of the cluster; and the execution module is used for executing the first configuration updating instruction so as to update the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; the analysis module is specifically configured to: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the execution module is specifically configured to: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In an optional embodiment, the acquisition module is further configured to: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; the execution module is specifically configured to: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In an optional embodiment, the acquisition module is further configured to: acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; the apparatus further comprises an optimization module; the optimization module is configured to: according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
For the advantages of the second aspect and the embodiments of the second aspect, reference may be made to the advantages of the first aspect and the embodiments of the first aspect, which are not described herein again.
In a third aspect, an embodiment of the present application provides a computer device, which includes a program or instructions, and when the program or instructions are executed, the computer device is configured to perform the method of each embodiment of the first aspect and the first aspect.
In a fourth aspect, an embodiment of the present application provides a storage medium, which includes a program or instructions, and when the program or instructions are executed, the program or instructions are configured to perform the method of the first aspect and the embodiments of the first aspect.
Drawings
Fig. 1 is a schematic diagram of an architecture to which a cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 2 is a schematic flowchart illustrating steps of a cluster operation and maintenance method according to an embodiment of the present disclosure;
fig. 3 is an operation schematic diagram of a collector in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 4 is a schematic operation diagram of an analyzer in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 5 is a schematic operation diagram of a decision maker in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 6 is a schematic operation diagram of a memory in an architecture to which a cluster operation and maintenance method according to an embodiment of the present application is applicable;
fig. 7 is a schematic operation diagram of an optimizer in a framework to which a cluster operation and maintenance method according to an embodiment of the present application is applicable;
fig. 8 is a schematic structural diagram of a cluster operation and maintenance device according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
At present, the scheme for adjusting the configuration of the cluster is to manually observe the running condition of the cluster and then manually update the configuration of the cluster. The efficiency of the method is low, and therefore, the cluster operation and maintenance method for automatically updating the cluster configuration is provided.
Fig. 1 is a schematic diagram of an architecture to which the cluster operation and maintenance method provided by the present application is applicable. The architecture shown in FIG. 1 includes a cluster environment, a collector, an analyzer, a decision processor, an optimizer, and a memory. The operation flow of the framework can be as follows:
and acquiring operation and maintenance data in the cluster environment through the acquisition unit. For example, the operation and maintenance data may include cluster job execution and operation of nodes of the cluster in the cluster environment. The collector transmits the operation and maintenance data to the analyzer, the analyzer reads the analysis strategy and the analysis configuration from the memory, and transmits the analyzed result to the memory and the decision processor after analyzing and processing the data transmitted by the collector. The decision processor performs corresponding processing according to the processing result of the analyzer, for example, an alarm is performed when the cluster job fails to be executed, or the cluster job is restarted, an alarm is performed on the overtime cluster job, and the like. The memory stores historical operation and maintenance data of the cluster, and the optimizer can perform appropriate optimization configuration according to historical operation and maintenance data of the cluster, such as historical cluster job execution conditions of the cluster, resource conditions of the cluster and the like. The optimizer may also perform a period of summary reports, such as daily weekly monthly reports.
A cluster operation and maintenance method provided in the embodiment of the present application is described in detail below with reference to fig. 2.
Step 201: and collecting preset operation and maintenance data of the cluster.
Step 202: and analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster.
Step 203: and executing the first configuration updating instruction so as to update the configuration of the cluster.
the collection process of the collector can collect the node health state of each node in the cluster, such as the running conditions of a CPU, a memory, an IO and other machines, through the collection agent. Through the computing resource interface, the running condition of the cluster job, such as the task running state (running, failure and success), the task occupation resource condition, the resource use condition of the current cluster and the number of the left resources, is obtained. And acquiring the use condition of the cluster storage resources, the storage capacity of the current cluster, the health condition of the storage nodes of the cluster and the like through the storage resource interface. The collector is responsible for collecting the cluster data, preliminarily assembles the data and simultaneously transmits the data.
It should be noted that, in an alternative embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster. Specifically, the operation data of each node in the cluster may include: the operating rate of the node; central Processing Unit (CPU) occupancy of the node, etc. The execution data of each job in the cluster may include: the execution state and the execution duration of the cluster job. The storage capacity data of the cluster may include: hard disk storage occupancy rate; hard disk storage space, etc.
In the above optional embodiment, the preset analysis policy may be set as follows:
the operation data of each node in the cluster corresponds to a first sub-analysis strategy in the preset analysis strategies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
For example, the operation data of each node in the cluster is the CPU occupancy of the node, and the first sub-analysis policy is: the CPU occupancy rate is 0-20%, and the running state of the nodes is as follows: the load is light, and the first configuration updating instruction is null; the CPU occupancy rate is 20-60%, and the running state of the nodes is as follows: the load is moderate, and the first configuration updating instruction is null; the CPU occupancy rate is 60% -95%, and the running state of the nodes is as follows: the load is heavy, and the first configuration update instruction is as follows: ending the designated process; the CPU occupancy rate is 95-100%, and the running state of the nodes is as follows: the load is severe, and the first configuration update instruction is: the specified process is ended.
For example, the execution data of each job in the cluster is the execution duration of the cluster job, and the second sub-analysis policy is: the execution duration is less than or equal to the preset duration, and the execution state of the cluster job is as follows: normally, the first configuration update instruction is null; the execution duration is longer than the preset duration, and the execution state of the cluster job is as follows: the exception, the first configuration update instruction is: the cluster job is restarted.
For example, the storage capacity data of the cluster is the hard disk storage occupancy rate of the node; the hard disk storage occupancy rate is 0-20%, and the hard disk storage state of the node is as follows: idle, the first configuration update instruction is null; the hard disk storage occupancy rate is 20-80%, and the hard disk storage state of the node is as follows: the occupancy is moderate, and the first configuration updating instruction is null; the hard disk storage occupancy rate is 80% -100%, and the hard disk storage state of the node is as follows: the storage is tense, and the first configuration update instruction is as follows: and deleting the specified hard disk data.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; step 202 may be specifically performed in the following manner:
determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
For example, the hard disk storage occupancy rate is 85%, and is located in an interval of 80% to 100%, and the trigger condition for deleting the specified hard disk data is satisfied, then the first configuration update instruction is: and deleting the specified hard disk data.
Specifically, as shown in fig. 4, the analyzer has three main functions: 1. health analysis of cluster nodes; 2. monitoring storage resources; 3. and (4) monitoring computing resources, wherein the three functions respectively correspond to three cluster data of the collector. Meanwhile, the collector reads configuration data (daily task configuration, node health threshold and other configuration information) from the memory, and after analysis and processing, the decision data and the processed cluster data are fed back, wherein the decision data are submitted to the decision processor, and the cluster data are submitted to the memory.
In the above optional embodiment, the trigger condition may be that the state value is a first preset value or a second preset value; on this basis, step 203 may be specifically performed in the following manner:
if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
For example, for the storage capacity data of the cluster, the state values are: outputting alarm information of the storage shortage of the storage capacity of the cluster if the storage shortage exists; and restarting the cluster operation if the state value of the operation execution data of the cluster is abnormal.
After step 202, and before step 203, an alternative embodiment is as follows:
acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; on this basis, step 203 may be specifically performed in the following manner: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
For example, the first configuration update instruction is to end process one, process two, and process three; if the execution restriction information is reserved for process one, the update result of the first configuration update instruction is the end process two and the process three.
It should be noted that the operation process of the decision processor can be as shown in fig. 5. The decision processor is mainly used for acquiring decision data from the analyzer and acquiring configuration information required by the decision processor for decision processing from the memory, and the configuration information specifically comprises resource limitation information and authority information, and different decision processing is performed on different decision data. For example, firstly, the alarm information is output, then the cluster job recovery is performed, and the processed flow task display bar is displayed, so that the worker can know the current automatic operation and maintenance condition, and can perform reason investigation, find the corresponding scheme and perform manual intervention aiming at the job with failed recovery. Meanwhile, the decision information of the decision processor is transmitted to the memory in real time and is delivered to the memory to store the decision information.
It should be noted that the operation process of the decision maker can be as shown in fig. 6. The memory comprises two parts, namely a relational database and an ELK, so that the analyzer and the memory can conveniently acquire data in time. The ELK user stores cluster operation information, processing strategies of the decision processor, decision information of the analyzer, task operation conditions and machine node health information, and the data can be called by the optimizer and used as analysis data of the optimization center.
An alternative implementation other than step 201 to step 203 is as follows:
acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
For example, in the historical operation and maintenance data, the estimated execution time period of the first cluster job overlaps with the estimated execution time period of the second cluster job, and the CPU occupancy rate required by the operation of the first cluster job and the second cluster job is greater than the CPU occupancy rate threshold, such as 40%; if the preset optimization strategy is to stagger and execute the cluster jobs with the CPU occupancy rates larger than the CPU occupancy rate threshold value, the second configuration updating instruction is as follows: and modifying the execution starting time of the cluster job two to any time after the execution of the cluster job one is finished.
Specifically, as shown in fig. 7, the optimization center obtains storage information from the storage, and obtains an optimized configuration of the cluster through a series of analysis and calculation of resource matching, energy consumption, job resource, time consumption, and the like, so as to provide optimized configuration data of the cluster environment.
As shown in fig. 8, the present application provides a cluster operation and maintenance device, including: the acquisition module 801 is used for acquiring preset operation and maintenance data of the cluster; an analysis module 802, configured to analyze preset operation and maintenance data of the cluster according to a preset analysis policy, and determine a first configuration update instruction of the cluster; an executing module 803, configured to execute the first configuration updating instruction, so as to update the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; the analysis module 802 is specifically configured to: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the executing module 803 is specifically configured to: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In an optional implementation, the acquisition module 801 is further configured to: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; the executing module 803 is specifically configured to: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In an optional implementation, the acquisition module 801 is further configured to: acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; the apparatus further comprises an optimization module; the optimization module is configured to: according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
The embodiment of the present application provides a computer device, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a cluster operation and maintenance method and any optional method provided in the embodiment of the present application.
The embodiment of the present application provides a storage medium, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a cluster operation and maintenance method and any optional method provided in the embodiment of the present application.
Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
Claims (10)
1. A cluster operation and maintenance method is characterized by comprising the following steps:
collecting preset operation and maintenance data of a cluster;
analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster;
and executing the first configuration updating instruction so as to update the configuration of the cluster.
2. The method of claim 1, wherein the predetermined analysis strategy comprises: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy to determine a first configuration updating instruction of the cluster; the method comprises the following steps:
determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy;
and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
3. The method of claim 2, wherein the trigger condition is that the state value is a first preset value or a second preset value; the executing the first configuration update instruction comprises:
if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
4. The method of any of claims 1-3, wherein after the determining the first configuration update instruction for the cluster and before the executing the first configuration update instruction, further comprising:
acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction;
the executing the first configuration update instruction comprises:
executing the first configuration update instruction within the execution range indicated by the execution restriction information.
5. The method of any of claims 1-3, wherein the pre-set operation and maintenance data for the cluster comprises: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
6. The method of claim 5, wherein the operational data of each node in the cluster corresponds to a first sub-analysis strategy in the preset analysis strategies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
7. The method of any of claims 1-3, further comprising:
acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster;
according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster;
and executing the second configuration updating instruction so as to update the configuration of the cluster.
8. A cluster operation and maintenance device, comprising:
the acquisition module is used for acquiring preset operation and maintenance data of the cluster;
the analysis module is used for analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy and determining a first configuration updating instruction of the cluster;
and the execution module is used for executing the first configuration updating instruction so as to update the configuration of the cluster.
9. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 7.
10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010103813.9A CN113296840B (en) | 2020-02-20 | 2020-02-20 | Cluster operation and maintenance method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010103813.9A CN113296840B (en) | 2020-02-20 | 2020-02-20 | Cluster operation and maintenance method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113296840A true CN113296840A (en) | 2021-08-24 |
CN113296840B CN113296840B (en) | 2023-04-14 |
Family
ID=77317473
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010103813.9A Active CN113296840B (en) | 2020-02-20 | 2020-02-20 | Cluster operation and maintenance method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113296840B (en) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101820384A (en) * | 2010-02-05 | 2010-09-01 | 浪潮(北京)电子信息产业有限公司 | Method and device for dynamically distributing cluster services |
CN102394901A (en) * | 2011-06-23 | 2012-03-28 | 北京新媒传信科技有限公司 | Server cluster system and updating method of monitoring policies in same |
US20120317274A1 (en) * | 2011-06-13 | 2012-12-13 | Richter Owen E | Distributed metering and monitoring system |
CN106100894A (en) * | 2016-07-11 | 2016-11-09 | 华南理工大学 | A kind of highly reliable cluster operation management method |
CN106227469A (en) * | 2016-07-28 | 2016-12-14 | 乐视控股(北京)有限公司 | Data-erasure method and system for distributed storage cluster |
CN106656533A (en) * | 2015-10-29 | 2017-05-10 | 大唐移动通信设备有限公司 | Method and device for monitoring load processing of cluster system |
CN107734035A (en) * | 2017-10-17 | 2018-02-23 | 华南理工大学 | A kind of Virtual Cluster automatic telescopic method under cloud computing environment |
CN107896175A (en) * | 2017-11-30 | 2018-04-10 | 北京小度信息科技有限公司 | Collecting method and device |
CN108197251A (en) * | 2017-12-29 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | A kind of big data operation and maintenance analysis method, device and server |
CN108566287A (en) * | 2018-01-08 | 2018-09-21 | 福建星瑞格软件有限公司 | A kind of cluster server O&M optimization method based on deep learning |
CN109189575A (en) * | 2018-08-20 | 2019-01-11 | 北京奇虎科技有限公司 | A kind of Explore of Unified Management Ideas and device of more OpenStack clusters |
CN109413125A (en) * | 2017-08-18 | 2019-03-01 | 北京京东尚科信息技术有限公司 | The method and apparatus of dynamic regulation distributed system resource |
CN109614236A (en) * | 2018-12-07 | 2019-04-12 | 深圳前海微众银行股份有限公司 | Cluster resource dynamic adjusting method, device, equipment and readable storage medium storing program for executing |
CN109960690A (en) * | 2019-03-18 | 2019-07-02 | 新华三大数据技术有限公司 | A kind of operation and maintenance method and device of big data cluster |
CN110543410A (en) * | 2019-09-05 | 2019-12-06 | 曙光信息产业(北京)有限公司 | Method for processing cluster index, method and device for inquiring cluster index |
-
2020
- 2020-02-20 CN CN202010103813.9A patent/CN113296840B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101820384A (en) * | 2010-02-05 | 2010-09-01 | 浪潮(北京)电子信息产业有限公司 | Method and device for dynamically distributing cluster services |
US20120317274A1 (en) * | 2011-06-13 | 2012-12-13 | Richter Owen E | Distributed metering and monitoring system |
CN102394901A (en) * | 2011-06-23 | 2012-03-28 | 北京新媒传信科技有限公司 | Server cluster system and updating method of monitoring policies in same |
CN106656533A (en) * | 2015-10-29 | 2017-05-10 | 大唐移动通信设备有限公司 | Method and device for monitoring load processing of cluster system |
CN106100894A (en) * | 2016-07-11 | 2016-11-09 | 华南理工大学 | A kind of highly reliable cluster operation management method |
CN106227469A (en) * | 2016-07-28 | 2016-12-14 | 乐视控股(北京)有限公司 | Data-erasure method and system for distributed storage cluster |
CN109413125A (en) * | 2017-08-18 | 2019-03-01 | 北京京东尚科信息技术有限公司 | The method and apparatus of dynamic regulation distributed system resource |
CN107734035A (en) * | 2017-10-17 | 2018-02-23 | 华南理工大学 | A kind of Virtual Cluster automatic telescopic method under cloud computing environment |
CN107896175A (en) * | 2017-11-30 | 2018-04-10 | 北京小度信息科技有限公司 | Collecting method and device |
CN108197251A (en) * | 2017-12-29 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | A kind of big data operation and maintenance analysis method, device and server |
CN108566287A (en) * | 2018-01-08 | 2018-09-21 | 福建星瑞格软件有限公司 | A kind of cluster server O&M optimization method based on deep learning |
CN109189575A (en) * | 2018-08-20 | 2019-01-11 | 北京奇虎科技有限公司 | A kind of Explore of Unified Management Ideas and device of more OpenStack clusters |
CN109614236A (en) * | 2018-12-07 | 2019-04-12 | 深圳前海微众银行股份有限公司 | Cluster resource dynamic adjusting method, device, equipment and readable storage medium storing program for executing |
CN109960690A (en) * | 2019-03-18 | 2019-07-02 | 新华三大数据技术有限公司 | A kind of operation and maintenance method and device of big data cluster |
CN110543410A (en) * | 2019-09-05 | 2019-12-06 | 曙光信息产业(北京)有限公司 | Method for processing cluster index, method and device for inquiring cluster index |
Non-Patent Citations (1)
Title |
---|
肖海琴: "Zabbix性能监控软件在高性能集群上的应用", 《中国管理信息化》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113296840B (en) | 2023-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3180695B1 (en) | Systems and methods for auto-scaling a big data system | |
CN109412870B (en) | Alarm monitoring method and platform, server and storage medium | |
WO2016188100A1 (en) | Information system fault scenario information collection method and system | |
US20060117059A1 (en) | System and method for monitoring and managing performance and availability data from multiple data providers using a plurality of executable decision trees to consolidate, correlate, and diagnose said data | |
CN112751726B (en) | Data processing method and device, electronic equipment and storage medium | |
EP1527395A2 (en) | Method and system for monitoring performance of application in a distributed environment | |
WO2019223174A1 (en) | Automatic task rerunning method and system, computer device and storage medium | |
CN111901405B (en) | Multi-node monitoring method and device, electronic equipment and storage medium | |
US20190171446A1 (en) | Value stream graphs across heterogeneous software development platforms | |
CN110363381B (en) | Information processing method and device | |
CN111970151A (en) | Flow fault positioning method and system for virtual and container network | |
CN116560893B (en) | Computer application program operation data fault processing system | |
CN113886130A (en) | Method, device and medium for processing database fault | |
CN113296840B (en) | Cluster operation and maintenance method and device | |
US11758021B2 (en) | System for processing coherent data | |
CN115934487A (en) | Log monitoring and alarming method and device, computer equipment and storage medium | |
CN105630580A (en) | Scheduling platform based data summarizing method and data summarizing apparatus | |
CN113824601A (en) | Electric power marketing monitored control system based on service log | |
CN112037017A (en) | Method, device and equipment for determining batch processing job evaluation result | |
WO2011005073A2 (en) | Job status monitoring method | |
CN117389841B (en) | Method and device for monitoring accelerator resources, cluster equipment and storage medium | |
CN115296976B (en) | Internet of things equipment fault detection method, device, equipment and storage medium | |
CN109631280B (en) | Device management system and method | |
CN117632625A (en) | Abnormality reporting method, device, equipment and storage medium | |
CN115617817A (en) | Full-link-based global asset report generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |