CN113296840A - Cluster operation and maintenance method and device - Google Patents

Cluster operation and maintenance method and device Download PDF

Info

Publication number
CN113296840A
CN113296840A CN202010103813.9A CN202010103813A CN113296840A CN 113296840 A CN113296840 A CN 113296840A CN 202010103813 A CN202010103813 A CN 202010103813A CN 113296840 A CN113296840 A CN 113296840A
Authority
CN
China
Prior art keywords
cluster
preset
data
configuration
maintenance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010103813.9A
Other languages
Chinese (zh)
Other versions
CN113296840B (en
Inventor
于振国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Data Services Co ltd
Original Assignee
China Unionpay Data Services Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Data Services Co ltd filed Critical China Unionpay Data Services Co ltd
Priority to CN202010103813.9A priority Critical patent/CN113296840B/en
Publication of CN113296840A publication Critical patent/CN113296840A/en
Application granted granted Critical
Publication of CN113296840B publication Critical patent/CN113296840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a cluster operation and maintenance method and a device, wherein the method comprises the following steps: collecting preset operation and maintenance data of a cluster; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster; and executing the first configuration updating instruction so as to update the configuration of the cluster.

Description

Cluster operation and maintenance method and device
Technical Field
The invention relates to the field of cluster operation and maintenance, in particular to a cluster operation and maintenance method and device.
Background
With the continuous development of the technical field of computers, the deployment mode of a single server is difficult to meet the requirements of enterprises, and clusters are produced at the same time. A cluster includes a plurality of servers that have a shared data storage space, with the servers communicating with each other via an internal local area network. While the cluster size is enlarged, the operation and maintenance of the cluster face a greater challenge.
In the current operation and maintenance process of the cluster, the configuration of the cluster is often required to be adjusted. However, in the current scheme for adjusting the configuration of the cluster, the operation status of the cluster is observed manually, and then the configuration of the cluster is updated manually. Obviously, the efficiency of artificially updating the configuration of the cluster is relatively low, but no method for automatically updating the configuration of the cluster exists at present, which is a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a cluster operation and maintenance method and device, and solves the problem that no method for automatically updating cluster configuration exists in the prior art.
In a first aspect, an embodiment of the present application provides a cluster operation and maintenance method, including: collecting preset operation and maintenance data of a cluster; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster; and executing the first configuration updating instruction so as to update the configuration of the cluster.
In the method, after the preset operation and maintenance data of the cluster are obtained, the preset operation and maintenance data of the cluster can be automatically analyzed through a preset analysis strategy, so that a first configuration updating instruction of the cluster can be automatically determined; further, the first configuration update instruction may be executed, thereby providing a method of automatically updating the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy to determine a first configuration updating instruction of the cluster; the method comprises the following steps: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In the above method, the preset analysis strategy includes: the state values of the preset operation and maintenance data of the cluster corresponding to different value ranges can be determined according to the preset analysis strategy, and whether a trigger condition for triggering configuration update of the cluster is met or not can be automatically detected, so that the first configuration update instruction corresponding to the trigger condition can be generated.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the executing the first configuration update instruction comprises: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In the method, when the state value is set to be the first preset value or the second preset value, the first configuration update instruction is executed, and the cluster is instructed by the state value to output alarm information according to a preset alarm mechanism, or cluster operation corresponding to preset operation and maintenance data of the cluster is restarted, so that the availability of the cluster is improved.
In an optional implementation, after the determining the first configuration update instruction of the cluster and before the executing the first configuration update instruction, the method further includes: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution limit information is used for limiting the execution limit range of the first configuration updating instruction; the executing the first configuration update instruction comprises: executing the first configuration update instruction within the execution limit range indicated by the execution limit information.
In the above manner, before the first configuration update instruction is executed, the execution restriction information of the first configuration update instruction is acquired from the preset operation-maintenance database of the cluster, and since the execution restriction information is used for restricting the execution range of the first configuration update instruction, the first configuration update instruction can be executed within the execution range indicated by the execution restriction information, so that the partial execution of the first configuration update instruction is restricted to a certain extent, and the flexibility of executing the first configuration update instruction is increased.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In the above manner, the preset operation and maintenance data of the cluster is compared with the first sub-analysis strategy, the second sub-analysis strategy and the third sub-analysis strategy by presetting the first sub-analysis strategy, the second sub-analysis strategy and the third sub-analysis strategy which correspond to each other, so that the first configuration update instruction of the cluster can be automatically obtained.
In an optional embodiment, obtaining historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
In the above manner, the historical operation and maintenance data of the cluster can be optimized and analyzed according to a preset optimization strategy through the acquired historical operation and maintenance data of the cluster, and a second configuration updating instruction of the cluster is determined; and executing the second configuration updating instruction so as to update the configuration of the cluster, thereby realizing the automatic optimization of the cluster.
In a second aspect, the present application provides a cluster operation and maintenance device, including: the acquisition module is used for acquiring preset operation and maintenance data of the cluster; the analysis module is used for analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy and determining a first configuration updating instruction of the cluster; and the execution module is used for executing the first configuration updating instruction so as to update the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; the analysis module is specifically configured to: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the execution module is specifically configured to: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In an optional embodiment, the acquisition module is further configured to: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; the execution module is specifically configured to: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In an optional embodiment, the acquisition module is further configured to: acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; the apparatus further comprises an optimization module; the optimization module is configured to: according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
For the advantages of the second aspect and the embodiments of the second aspect, reference may be made to the advantages of the first aspect and the embodiments of the first aspect, which are not described herein again.
In a third aspect, an embodiment of the present application provides a computer device, which includes a program or instructions, and when the program or instructions are executed, the computer device is configured to perform the method of each embodiment of the first aspect and the first aspect.
In a fourth aspect, an embodiment of the present application provides a storage medium, which includes a program or instructions, and when the program or instructions are executed, the program or instructions are configured to perform the method of the first aspect and the embodiments of the first aspect.
Drawings
Fig. 1 is a schematic diagram of an architecture to which a cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 2 is a schematic flowchart illustrating steps of a cluster operation and maintenance method according to an embodiment of the present disclosure;
fig. 3 is an operation schematic diagram of a collector in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 4 is a schematic operation diagram of an analyzer in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 5 is a schematic operation diagram of a decision maker in a framework to which the cluster operation and maintenance method provided in the embodiment of the present application is applicable;
fig. 6 is a schematic operation diagram of a memory in an architecture to which a cluster operation and maintenance method according to an embodiment of the present application is applicable;
fig. 7 is a schematic operation diagram of an optimizer in a framework to which a cluster operation and maintenance method according to an embodiment of the present application is applicable;
fig. 8 is a schematic structural diagram of a cluster operation and maintenance device according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
At present, the scheme for adjusting the configuration of the cluster is to manually observe the running condition of the cluster and then manually update the configuration of the cluster. The efficiency of the method is low, and therefore, the cluster operation and maintenance method for automatically updating the cluster configuration is provided.
Fig. 1 is a schematic diagram of an architecture to which the cluster operation and maintenance method provided by the present application is applicable. The architecture shown in FIG. 1 includes a cluster environment, a collector, an analyzer, a decision processor, an optimizer, and a memory. The operation flow of the framework can be as follows:
and acquiring operation and maintenance data in the cluster environment through the acquisition unit. For example, the operation and maintenance data may include cluster job execution and operation of nodes of the cluster in the cluster environment. The collector transmits the operation and maintenance data to the analyzer, the analyzer reads the analysis strategy and the analysis configuration from the memory, and transmits the analyzed result to the memory and the decision processor after analyzing and processing the data transmitted by the collector. The decision processor performs corresponding processing according to the processing result of the analyzer, for example, an alarm is performed when the cluster job fails to be executed, or the cluster job is restarted, an alarm is performed on the overtime cluster job, and the like. The memory stores historical operation and maintenance data of the cluster, and the optimizer can perform appropriate optimization configuration according to historical operation and maintenance data of the cluster, such as historical cluster job execution conditions of the cluster, resource conditions of the cluster and the like. The optimizer may also perform a period of summary reports, such as daily weekly monthly reports.
A cluster operation and maintenance method provided in the embodiment of the present application is described in detail below with reference to fig. 2.
Step 201: and collecting preset operation and maintenance data of the cluster.
Step 202: and analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster.
Step 203: and executing the first configuration updating instruction so as to update the configuration of the cluster.
Step 201 may be implemented according to the process shown in fig. 3. Specifically, the method comprises the following steps:
the collection process of the collector can collect the node health state of each node in the cluster, such as the running conditions of a CPU, a memory, an IO and other machines, through the collection agent. Through the computing resource interface, the running condition of the cluster job, such as the task running state (running, failure and success), the task occupation resource condition, the resource use condition of the current cluster and the number of the left resources, is obtained. And acquiring the use condition of the cluster storage resources, the storage capacity of the current cluster, the health condition of the storage nodes of the cluster and the like through the storage resource interface. The collector is responsible for collecting the cluster data, preliminarily assembles the data and simultaneously transmits the data.
It should be noted that, in an alternative embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster. Specifically, the operation data of each node in the cluster may include: the operating rate of the node; central Processing Unit (CPU) occupancy of the node, etc. The execution data of each job in the cluster may include: the execution state and the execution duration of the cluster job. The storage capacity data of the cluster may include: hard disk storage occupancy rate; hard disk storage space, etc.
In the above optional embodiment, the preset analysis policy may be set as follows:
the operation data of each node in the cluster corresponds to a first sub-analysis strategy in the preset analysis strategies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
For example, the operation data of each node in the cluster is the CPU occupancy of the node, and the first sub-analysis policy is: the CPU occupancy rate is 0-20%, and the running state of the nodes is as follows: the load is light, and the first configuration updating instruction is null; the CPU occupancy rate is 20-60%, and the running state of the nodes is as follows: the load is moderate, and the first configuration updating instruction is null; the CPU occupancy rate is 60% -95%, and the running state of the nodes is as follows: the load is heavy, and the first configuration update instruction is as follows: ending the designated process; the CPU occupancy rate is 95-100%, and the running state of the nodes is as follows: the load is severe, and the first configuration update instruction is: the specified process is ended.
For example, the execution data of each job in the cluster is the execution duration of the cluster job, and the second sub-analysis policy is: the execution duration is less than or equal to the preset duration, and the execution state of the cluster job is as follows: normally, the first configuration update instruction is null; the execution duration is longer than the preset duration, and the execution state of the cluster job is as follows: the exception, the first configuration update instruction is: the cluster job is restarted.
For example, the storage capacity data of the cluster is the hard disk storage occupancy rate of the node; the hard disk storage occupancy rate is 0-20%, and the hard disk storage state of the node is as follows: idle, the first configuration update instruction is null; the hard disk storage occupancy rate is 20-80%, and the hard disk storage state of the node is as follows: the occupancy is moderate, and the first configuration updating instruction is null; the hard disk storage occupancy rate is 80% -100%, and the hard disk storage state of the node is as follows: the storage is tense, and the first configuration update instruction is as follows: and deleting the specified hard disk data.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; step 202 may be specifically performed in the following manner:
determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
For example, the hard disk storage occupancy rate is 85%, and is located in an interval of 80% to 100%, and the trigger condition for deleting the specified hard disk data is satisfied, then the first configuration update instruction is: and deleting the specified hard disk data.
Specifically, as shown in fig. 4, the analyzer has three main functions: 1. health analysis of cluster nodes; 2. monitoring storage resources; 3. and (4) monitoring computing resources, wherein the three functions respectively correspond to three cluster data of the collector. Meanwhile, the collector reads configuration data (daily task configuration, node health threshold and other configuration information) from the memory, and after analysis and processing, the decision data and the processed cluster data are fed back, wherein the decision data are submitted to the decision processor, and the cluster data are submitted to the memory.
In the above optional embodiment, the trigger condition may be that the state value is a first preset value or a second preset value; on this basis, step 203 may be specifically performed in the following manner:
if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
For example, for the storage capacity data of the cluster, the state values are: outputting alarm information of the storage shortage of the storage capacity of the cluster if the storage shortage exists; and restarting the cluster operation if the state value of the operation execution data of the cluster is abnormal.
After step 202, and before step 203, an alternative embodiment is as follows:
acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; on this basis, step 203 may be specifically performed in the following manner: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
For example, the first configuration update instruction is to end process one, process two, and process three; if the execution restriction information is reserved for process one, the update result of the first configuration update instruction is the end process two and the process three.
It should be noted that the operation process of the decision processor can be as shown in fig. 5. The decision processor is mainly used for acquiring decision data from the analyzer and acquiring configuration information required by the decision processor for decision processing from the memory, and the configuration information specifically comprises resource limitation information and authority information, and different decision processing is performed on different decision data. For example, firstly, the alarm information is output, then the cluster job recovery is performed, and the processed flow task display bar is displayed, so that the worker can know the current automatic operation and maintenance condition, and can perform reason investigation, find the corresponding scheme and perform manual intervention aiming at the job with failed recovery. Meanwhile, the decision information of the decision processor is transmitted to the memory in real time and is delivered to the memory to store the decision information.
It should be noted that the operation process of the decision maker can be as shown in fig. 6. The memory comprises two parts, namely a relational database and an ELK, so that the analyzer and the memory can conveniently acquire data in time. The ELK user stores cluster operation information, processing strategies of the decision processor, decision information of the analyzer, task operation conditions and machine node health information, and the data can be called by the optimizer and used as analysis data of the optimization center.
An alternative implementation other than step 201 to step 203 is as follows:
acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
For example, in the historical operation and maintenance data, the estimated execution time period of the first cluster job overlaps with the estimated execution time period of the second cluster job, and the CPU occupancy rate required by the operation of the first cluster job and the second cluster job is greater than the CPU occupancy rate threshold, such as 40%; if the preset optimization strategy is to stagger and execute the cluster jobs with the CPU occupancy rates larger than the CPU occupancy rate threshold value, the second configuration updating instruction is as follows: and modifying the execution starting time of the cluster job two to any time after the execution of the cluster job one is finished.
Specifically, as shown in fig. 7, the optimization center obtains storage information from the storage, and obtains an optimized configuration of the cluster through a series of analysis and calculation of resource matching, energy consumption, job resource, time consumption, and the like, so as to provide optimized configuration data of the cluster environment.
As shown in fig. 8, the present application provides a cluster operation and maintenance device, including: the acquisition module 801 is used for acquiring preset operation and maintenance data of the cluster; an analysis module 802, configured to analyze preset operation and maintenance data of the cluster according to a preset analysis policy, and determine a first configuration update instruction of the cluster; an executing module 803, configured to execute the first configuration updating instruction, so as to update the configuration of the cluster.
In an optional embodiment, the preset analysis strategy includes: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; the analysis module 802 is specifically configured to: determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy; and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
In an optional embodiment, the trigger condition is that the state value is a first preset value or a second preset value; the executing module 803 is specifically configured to: if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
In an optional implementation, the acquisition module 801 is further configured to: acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction; the executing module 803 is specifically configured to: executing the first configuration update instruction within the execution range indicated by the execution restriction information.
In an optional embodiment, the preset operation and maintenance data of the cluster includes: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
In an optional implementation manner, the operation data of each node in the cluster corresponds to a first sub-analysis policy in the preset analysis policies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
In an optional implementation, the acquisition module 801 is further configured to: acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster; the apparatus further comprises an optimization module; the optimization module is configured to: according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster; and executing the second configuration updating instruction so as to update the configuration of the cluster.
The embodiment of the present application provides a computer device, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a cluster operation and maintenance method and any optional method provided in the embodiment of the present application.
The embodiment of the present application provides a storage medium, which includes a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used to execute a cluster operation and maintenance method and any optional method provided in the embodiment of the present application.
Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A cluster operation and maintenance method is characterized by comprising the following steps:
collecting preset operation and maintenance data of a cluster;
analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy, and determining a first configuration updating instruction of the cluster;
and executing the first configuration updating instruction so as to update the configuration of the cluster.
2. The method of claim 1, wherein the predetermined analysis strategy comprises: the preset operation and maintenance data of the cluster correspond to state values in different value ranges; analyzing preset operation and maintenance data of the cluster according to a preset analysis strategy to determine a first configuration updating instruction of the cluster; the method comprises the following steps:
determining a state value corresponding to a value range of preset operation and maintenance data of the cluster according to the preset analysis strategy;
and if the state value corresponding to the value range of the preset operation and maintenance data of the cluster meets the trigger condition for triggering the configuration update of the cluster, taking the instruction corresponding to the trigger condition as the first configuration update instruction.
3. The method of claim 2, wherein the trigger condition is that the state value is a first preset value or a second preset value; the executing the first configuration update instruction comprises:
if the state value is the first preset value, outputting alarm information according to a preset alarm mechanism; or, if the state value is the second preset value, restarting cluster operation corresponding to preset operation and maintenance data of the cluster.
4. The method of any of claims 1-3, wherein after the determining the first configuration update instruction for the cluster and before the executing the first configuration update instruction, further comprising:
acquiring execution limit information of the first configuration updating instruction from a preset operation and maintenance database of the cluster; the execution restriction information is used for restricting the execution range of the first configuration updating instruction;
the executing the first configuration update instruction comprises:
executing the first configuration update instruction within the execution range indicated by the execution restriction information.
5. The method of any of claims 1-3, wherein the pre-set operation and maintenance data for the cluster comprises: operating data of each node in the cluster; execution data of each job in the cluster; storage capacity data of the cluster.
6. The method of claim 5, wherein the operational data of each node in the cluster corresponds to a first sub-analysis strategy in the preset analysis strategies; the execution data of each operation in the cluster corresponds to a second sub-analysis strategy in the preset analysis strategies; the storage capacity data of the cluster corresponds to a third sub-analysis strategy in the preset analysis strategies; the first sub-analysis strategy comprises: mapping relation between the value range of the operation data of each node in the cluster and the operation state of each node in the cluster; the second sub-analysis strategy comprises: mapping relation between the value range of the execution data of each job in the cluster and the execution state of each job in the cluster; the third sub-analysis strategy comprises: and the mapping relation between the value range of the storage capacity of the cluster and the storage state of the cluster.
7. The method of any of claims 1-3, further comprising:
acquiring historical operation and maintenance data of the cluster from an operation and maintenance database of the cluster;
according to a preset optimization strategy, performing optimization analysis on historical operation and maintenance data of the cluster, and determining a second configuration updating instruction of the cluster;
and executing the second configuration updating instruction so as to update the configuration of the cluster.
8. A cluster operation and maintenance device, comprising:
the acquisition module is used for acquiring preset operation and maintenance data of the cluster;
the analysis module is used for analyzing the preset operation and maintenance data of the cluster according to a preset analysis strategy and determining a first configuration updating instruction of the cluster;
and the execution module is used for executing the first configuration updating instruction so as to update the configuration of the cluster.
9. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 7.
10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 7.
CN202010103813.9A 2020-02-20 2020-02-20 Cluster operation and maintenance method and device Active CN113296840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010103813.9A CN113296840B (en) 2020-02-20 2020-02-20 Cluster operation and maintenance method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010103813.9A CN113296840B (en) 2020-02-20 2020-02-20 Cluster operation and maintenance method and device

Publications (2)

Publication Number Publication Date
CN113296840A true CN113296840A (en) 2021-08-24
CN113296840B CN113296840B (en) 2023-04-14

Family

ID=77317473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010103813.9A Active CN113296840B (en) 2020-02-20 2020-02-20 Cluster operation and maintenance method and device

Country Status (1)

Country Link
CN (1) CN113296840B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820384A (en) * 2010-02-05 2010-09-01 浪潮(北京)电子信息产业有限公司 Method and device for dynamically distributing cluster services
CN102394901A (en) * 2011-06-23 2012-03-28 北京新媒传信科技有限公司 Server cluster system and updating method of monitoring policies in same
US20120317274A1 (en) * 2011-06-13 2012-12-13 Richter Owen E Distributed metering and monitoring system
CN106100894A (en) * 2016-07-11 2016-11-09 华南理工大学 A kind of highly reliable cluster operation management method
CN106227469A (en) * 2016-07-28 2016-12-14 乐视控股(北京)有限公司 Data-erasure method and system for distributed storage cluster
CN106656533A (en) * 2015-10-29 2017-05-10 大唐移动通信设备有限公司 Method and device for monitoring load processing of cluster system
CN107734035A (en) * 2017-10-17 2018-02-23 华南理工大学 A kind of Virtual Cluster automatic telescopic method under cloud computing environment
CN107896175A (en) * 2017-11-30 2018-04-10 北京小度信息科技有限公司 Collecting method and device
CN108197251A (en) * 2017-12-29 2018-06-22 百度在线网络技术(北京)有限公司 A kind of big data operation and maintenance analysis method, device and server
CN108566287A (en) * 2018-01-08 2018-09-21 福建星瑞格软件有限公司 A kind of cluster server O&M optimization method based on deep learning
CN109189575A (en) * 2018-08-20 2019-01-11 北京奇虎科技有限公司 A kind of Explore of Unified Management Ideas and device of more OpenStack clusters
CN109413125A (en) * 2017-08-18 2019-03-01 北京京东尚科信息技术有限公司 The method and apparatus of dynamic regulation distributed system resource
CN109614236A (en) * 2018-12-07 2019-04-12 深圳前海微众银行股份有限公司 Cluster resource dynamic adjusting method, device, equipment and readable storage medium storing program for executing
CN109960690A (en) * 2019-03-18 2019-07-02 新华三大数据技术有限公司 A kind of operation and maintenance method and device of big data cluster
CN110543410A (en) * 2019-09-05 2019-12-06 曙光信息产业(北京)有限公司 Method for processing cluster index, method and device for inquiring cluster index

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820384A (en) * 2010-02-05 2010-09-01 浪潮(北京)电子信息产业有限公司 Method and device for dynamically distributing cluster services
US20120317274A1 (en) * 2011-06-13 2012-12-13 Richter Owen E Distributed metering and monitoring system
CN102394901A (en) * 2011-06-23 2012-03-28 北京新媒传信科技有限公司 Server cluster system and updating method of monitoring policies in same
CN106656533A (en) * 2015-10-29 2017-05-10 大唐移动通信设备有限公司 Method and device for monitoring load processing of cluster system
CN106100894A (en) * 2016-07-11 2016-11-09 华南理工大学 A kind of highly reliable cluster operation management method
CN106227469A (en) * 2016-07-28 2016-12-14 乐视控股(北京)有限公司 Data-erasure method and system for distributed storage cluster
CN109413125A (en) * 2017-08-18 2019-03-01 北京京东尚科信息技术有限公司 The method and apparatus of dynamic regulation distributed system resource
CN107734035A (en) * 2017-10-17 2018-02-23 华南理工大学 A kind of Virtual Cluster automatic telescopic method under cloud computing environment
CN107896175A (en) * 2017-11-30 2018-04-10 北京小度信息科技有限公司 Collecting method and device
CN108197251A (en) * 2017-12-29 2018-06-22 百度在线网络技术(北京)有限公司 A kind of big data operation and maintenance analysis method, device and server
CN108566287A (en) * 2018-01-08 2018-09-21 福建星瑞格软件有限公司 A kind of cluster server O&M optimization method based on deep learning
CN109189575A (en) * 2018-08-20 2019-01-11 北京奇虎科技有限公司 A kind of Explore of Unified Management Ideas and device of more OpenStack clusters
CN109614236A (en) * 2018-12-07 2019-04-12 深圳前海微众银行股份有限公司 Cluster resource dynamic adjusting method, device, equipment and readable storage medium storing program for executing
CN109960690A (en) * 2019-03-18 2019-07-02 新华三大数据技术有限公司 A kind of operation and maintenance method and device of big data cluster
CN110543410A (en) * 2019-09-05 2019-12-06 曙光信息产业(北京)有限公司 Method for processing cluster index, method and device for inquiring cluster index

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖海琴: "Zabbix性能监控软件在高性能集群上的应用", 《中国管理信息化》 *

Also Published As

Publication number Publication date
CN113296840B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
EP3180695B1 (en) Systems and methods for auto-scaling a big data system
CN109412870B (en) Alarm monitoring method and platform, server and storage medium
WO2016188100A1 (en) Information system fault scenario information collection method and system
US20060117059A1 (en) System and method for monitoring and managing performance and availability data from multiple data providers using a plurality of executable decision trees to consolidate, correlate, and diagnose said data
CN112751726B (en) Data processing method and device, electronic equipment and storage medium
EP1527395A2 (en) Method and system for monitoring performance of application in a distributed environment
WO2019223174A1 (en) Automatic task rerunning method and system, computer device and storage medium
CN111901405B (en) Multi-node monitoring method and device, electronic equipment and storage medium
US20190171446A1 (en) Value stream graphs across heterogeneous software development platforms
CN110363381B (en) Information processing method and device
CN111970151A (en) Flow fault positioning method and system for virtual and container network
CN116560893B (en) Computer application program operation data fault processing system
CN113886130A (en) Method, device and medium for processing database fault
CN113296840B (en) Cluster operation and maintenance method and device
US11758021B2 (en) System for processing coherent data
CN115934487A (en) Log monitoring and alarming method and device, computer equipment and storage medium
CN105630580A (en) Scheduling platform based data summarizing method and data summarizing apparatus
CN113824601A (en) Electric power marketing monitored control system based on service log
CN112037017A (en) Method, device and equipment for determining batch processing job evaluation result
WO2011005073A2 (en) Job status monitoring method
CN117389841B (en) Method and device for monitoring accelerator resources, cluster equipment and storage medium
CN115296976B (en) Internet of things equipment fault detection method, device, equipment and storage medium
CN109631280B (en) Device management system and method
CN117632625A (en) Abnormality reporting method, device, equipment and storage medium
CN115617817A (en) Full-link-based global asset report generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant