CN114020509A - Method, device and equipment for repairing work load cluster and readable storage medium - Google Patents

Method, device and equipment for repairing work load cluster and readable storage medium Download PDF

Info

Publication number
CN114020509A
CN114020509A CN202111274663.9A CN202111274663A CN114020509A CN 114020509 A CN114020509 A CN 114020509A CN 202111274663 A CN202111274663 A CN 202111274663A CN 114020509 A CN114020509 A CN 114020509A
Authority
CN
China
Prior art keywords
cluster
abnormal
workload
backup data
workload cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111274663.9A
Other languages
Chinese (zh)
Inventor
周国伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202111274663.9A priority Critical patent/CN114020509A/en
Publication of CN114020509A publication Critical patent/CN114020509A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a method, a device and equipment for repairing a workload cluster and a readable storage medium. Wherein, the method comprises the following steps: obtaining backup data corresponding to the workload cluster, wherein the backup data is data of normal operation of the workload cluster; detecting whether the running state of the workload cluster is normal or not; when the running state of the working load cluster is abnormal, determining the abnormal information of the working load cluster; and performing exception repair on the exception information of the workload cluster based on the backup data. By implementing the method and the device, the abnormal automatic repair of the cluster is realized, the manual intervention of operation and maintenance personnel is not needed, the abnormal operation and maintenance time of the cluster is reduced, and the abnormal repair efficiency of the cluster is improved.

Description

Method, device and equipment for repairing work load cluster and readable storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a method, a device and equipment for repairing a workload cluster and a readable storage medium.
Background
In the operation process of the cluster, various abnormal problems are frequently encountered, such as network disconnection, insufficient storage, memory overflow and the like, so that the cluster state becomes abnormal (the cluster normal state is READY, and the abnormal state is other fields which are not READY), once the cluster is abnormal, the container platform cannot use the cluster resources, cannot perform any operation on the cluster, cannot use the functions of plug-in management, application management and the like of the cluster, and cannot use the installed plug-ins, applications and the like. The container platform cluster refers to a kubernets (k8s) cluster, i.e., a cluster created through k8s, but the resources used by the container platform are virtualized, i.e., resources managed through OpenStack. The container platform creates a k8s cluster through OpenStack resources, OpenStack is responsible for managing virtual machines, k8s is responsible for managing containers, and the containers run in the virtual machines, such as a resource relation diagram of a k8s cluster and OpenStack shown in fig. 1. The container may be on a virtual machine or a physical machine, but in the field of cloud computing, since the physical machine cannot meet the scenarios such as high availability and load balancing, and is extremely resource-consuming, the container is usually run by using the virtual machine. By means of the combined operation of OpenStack and k8s, on one hand, the operation of a virtual machine can be guaranteed, the problems of calculation, storage, network and the like are solved, and safe isolation is provided, on the other hand, the operation environment of a container can be guaranteed, the problems of cluster resource scheduling, resource arrangement and the like are solved, and the operation of applications is provided.
Usually, the container platform is repairable when the cluster is abnormal, such as OpenStack is deleted. Although k8s has strong repair capability, it only aims at the container inside the k8s cluster, and if the cluster itself is abnormal, it needs to be manually checked and repaired, but the manual checking efficiency is low, and the cluster operation and maintenance time is long, resulting in low cluster repair efficiency.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device and a readable storage medium for repairing a workload cluster, so as to solve the problem of low efficiency of repairing a cluster due to low efficiency of manual troubleshooting and long operation and maintenance time of the cluster.
According to a first aspect, an embodiment of the present invention provides a method for repairing a workload cluster, including: obtaining backup data corresponding to a workload cluster, wherein the backup data is data of normal operation of the workload cluster; detecting whether the running state of the workload cluster is normal or not; when the running state of the working load cluster is abnormal, determining the abnormal information of the working load cluster; and performing exception recovery on the exception information of the workload cluster based on the backup data.
The method for repairing the workload cluster provided by the embodiment of the invention comprises the steps of obtaining backup data corresponding to the workload cluster, wherein the backup data is data backed up when the workload cluster is in a normal operation state, obtaining abnormal information of the workload cluster when the abnormal operation state of the workload cluster is detected, and performing abnormal repair on the abnormal information of the workload cluster according to the data backed up in the normal operation state. According to the method, when the workload cluster is abnormal, the cluster is automatically processed according to the backup data, so that the abnormal automatic restoration of the cluster is realized, the manual intervention of operation and maintenance personnel on the cluster is reduced, the restoration performance of the cluster is improved, the abnormal operation and maintenance time of the cluster is reduced, the cluster can be timely restored to a normal operation state, the abnormal restoration efficiency of the cluster is improved, the continuous availability of the cluster is ensured, and the reliability of the cluster is improved.
With reference to the first aspect, in a first implementation manner of the first aspect, the performing exception repair on the exception information of the workload cluster based on the backup data includes: acquiring an abnormal reason corresponding to the abnormal information; and performing exception repair on the workload cluster based on the exception reason and the backup data.
With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, when the anomaly cause is a network anomaly, the performing anomaly repair on the workload cluster based on the anomaly cause and the backup data includes: acquiring the network repair times of the workload cluster; judging whether the network repairing times exceed preset times or not; when the network repair times exceed the preset times and are not repaired, regenerating the working load cluster according to the backup data; and configuring the regenerated workload clusters based on the configuration files in the backup data.
With reference to the first implementation manner of the first aspect, in a third implementation manner of the first aspect, when the anomaly cause is a service anomaly, the performing anomaly repair on the workload cluster based on the anomaly cause and the backup data includes: acquiring abnormal service data corresponding to the workload cluster; determining backup service data corresponding to the abnormal service data from the backup data; and replacing and repairing the abnormal service data by the backup service data.
According to the method for repairing the workload cluster, the abnormal reason corresponding to the abnormal information is obtained, and different abnormal repairing modes are determined according to different abnormal reasons, so that the abnormal information of the cluster can be repaired in a targeted manner, the abnormal repairing efficiency of the cluster is further improved, and the working stability of the cluster is improved.
With reference to the first aspect, in a fourth implementation manner of the first aspect, before performing exception repair on the exception information of the workload cluster based on the backup data, the method further includes: acquiring the duration of the working load cluster in an abnormal operation state; and when the duration reaches a preset duration, performing exception recovery on the exception information of the workload cluster based on the backup data.
With reference to the fourth embodiment of the first aspect, in a fifth embodiment of the first aspect, the method further comprises: obtaining an abnormal level corresponding to the workload cluster, wherein the abnormal level is used for representing the abnormal degree of the workload cluster; and performing exception repair on the exception information of the workload cluster based on the exception level and the backup data.
According to the method for repairing the workload cluster, provided by the embodiment of the invention, before abnormal information of the workload cluster is repaired, the duration of the abnormal operation state of the workload cluster is obtained, and when the duration reaches the preset duration, the abnormal information of the workload cluster is repaired based on the backup data, so that the self-healing capability of the cluster can be fully exerted. By acquiring the abnormal level corresponding to the workload cluster and performing abnormal restoration on the abnormal information of the workload cluster based on the abnormal level, the cluster can be restored according to the abnormal degree, and the recoverability and the troubleshooting efficiency of the cluster are improved.
With reference to the first aspect, in a sixth implementation manner of the first aspect, the detecting whether an operation state of the workload cluster is normal includes: detecting whether the work load cluster and the management cluster are in network intercommunication or not; when the work load cluster is communicated with the management cluster network, detecting whether each service state of the work load cluster is normal or not; and when the service states are normal, detecting whether the network states among the services are normal.
The method for repairing a workload cluster provided in the embodiments of the present invention detects whether a workload cluster and a management cluster are in network intercommunication, detects whether each service state of the workload cluster is normal when the workload cluster and the management cluster are in network intercommunication, and continues to detect whether a network state between services is normal when each service state is normal, thereby being capable of comprehensively detecting a working state of the workload cluster, so as to perform corresponding exception handling according to a detection result when an exception is detected, so that the cluster can be timely restored to a normal operating state, and thus, the cluster is ensured to be continuously available.
According to a second aspect, an embodiment of the present invention provides a repair apparatus for a workload cluster, including: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring backup data corresponding to a workload cluster, and the backup data is data of normal operation of the workload cluster; the detection module is used for detecting whether the running state of the working load cluster is normal or not; the determining module is used for determining the abnormal information of the working load cluster when the running state of the working load cluster is abnormal; and the repairing module is used for performing exception repairing on the exception information of the working load cluster based on the backup data.
According to the restoration device for the working load cluster, provided by the embodiment of the invention, when the working load cluster is abnormal, the cluster is automatically processed according to the backup data, so that the abnormal automatic restoration of the cluster is realized, the manual intervention of operation and maintenance personnel on the cluster is reduced, the restoration performance of the cluster is improved, the abnormal operation and maintenance time of the cluster is reduced, the cluster can be timely restored to a normal operation state, the abnormal restoration efficiency of the cluster is improved, the continuous availability of the cluster is ensured, and the cluster reliability is improved.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the method for repairing a workload cluster according to the first aspect or any embodiment of the first aspect.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the method for repairing a workload cluster according to the first aspect or any implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 shows a resource relationship diagram of a k8s cluster and OpenStack in an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of repairing a workload cluster according to an embodiment of the invention;
FIG. 3 is another flow diagram of a method of repairing a workload cluster according to an embodiment of the invention;
FIG. 4 is another flow diagram of a method of repairing a workload cluster according to an embodiment of the invention;
FIG. 5 is a block diagram of a repair device for a workload cluster according to an embodiment of the invention;
fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Usually, the container platform is repairable when the cluster is abnormal, such as OpenStack is deleted. Although k8s has strong repair capability, it only aims at the container inside the k8s cluster, and if the cluster itself is abnormal, it needs to be manually checked and repaired, but the manual checking efficiency is low, and the cluster operation and maintenance time is long, resulting in low cluster repair efficiency.
Based on the above, according to the technical scheme of the invention, the data of the working load cluster in the normal operation state is backed up, and the cluster is automatically subjected to exception handling according to the backed-up data when the working load cluster is abnormal, so that the automatic recovery of the cluster from exception is realized, the manual intervention of operation and maintenance personnel on the cluster is reduced, the abnormal operation and maintenance time of the cluster is reduced, and the abnormal recovery efficiency of the cluster is improved.
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for repairing a workload cluster, it is noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.
In this embodiment, a method for repairing a workload cluster is provided, which may be used for an electronic device, such as a tablet computer, a server, and the like, fig. 2 is a flowchart of a method for repairing a workload cluster according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:
and S11, obtaining backup data corresponding to the workload cluster, wherein the backup data is data of the workload cluster which normally operates.
The backup data is key data when the workload cluster is in a normal operation state, specifically, the backup data may include a basic configuration of the workload cluster, a configuration information set ConfigMap of the workload cluster, a configuration file corresponding to each resource, and the like, and specifically, the configuration file corresponding to each resource includes a Deployment, a demoset, a Service, and the like.
Specifically, for each workload cluster, the electronic device may start a timed backup to backup the operating data of the workload cluster, so as to ensure the accuracy of the backup data. Considering that the workload cluster is generally configured with multiple cores, a large memory and a large amount of storage, and using full backup consumes resources, when data backup is performed, only the key data of the workload cluster can be backed up without using full backup. Specifically, the electronic device may collect all the configuration files related to the cluster operation when the workload cluster is in a normal operation state, and store the configuration files in the storage of the container platform.
Alternatively, to prevent the backup data from occupying too much space, only one or more backups closest to the current time may be generally retained, and the older backup data may be deleted after each backup.
And S12, detecting whether the operation state of the workload cluster is normal.
The operational status of the workload cluster will become unavailable due to an anomaly. The normal state of the workload cluster is READY, and for other states which are not READY, the workload cluster can be judged to be in an abnormal state. The attribute of the workload cluster is used as a cluster self-defined resource (CRD), when the workload cluster operates normally, a tuning method is started, the tuning method monitors the resources of the workload cluster constantly, even if the workload cluster is notified after the workload resources change, the tuning method is started to process. Specifically, the tuning method may start a Controller, and the Controller monitors the change of the resource, such as creation, deletion, update, etc., of the resource through the API-Server.
After the cluster is abnormal, the cluster cannot be used, so that the resource monitoring is abnormal and cannot be monitored at an upper layer. In order to guarantee the normal use of the workload cluster functions, it is necessary to guarantee that the workload cluster is in READY state. Therefore, the electronic device can detect the operation state of the workload cluster in real time to determine whether the operation state is normal. When the operation status of the workload cluster is abnormal, step S13 is executed, otherwise, another operation is executed, where the another operation may be to continue to detect the operation status of the workload cluster, or to backup normal operation data of the workload cluster, and the another operation is not specifically limited herein.
S13, determining the abnormal information of the work load cluster.
When the operating state of the workload cluster is abnormal, the workload cluster is unavailable, and at this time, in order to repair the workload cluster in a targeted manner so that the workload cluster can be restored to a normal operating state in time, the electronic device may analyze the operating state of the workload cluster to determine abnormal information of the workload cluster. The exception information may include internal service exceptions (e.g., service exceptions such as APIServer, kubel, kube-proxy, etc.), network exceptions (e.g., network loss, IP change, etc.), and so on.
And S14, performing exception repair on the exception information of the workload cluster based on the backup data.
The electronic device selects backup information corresponding to the abnormal information from the latest backup data according to the abnormal information of the workload cluster, specifically, a time point of the abnormal information, at which the workload cluster is not in the abnormal state, can be found from the backup data according to a time point of the abnormal information, the backup data corresponding to the time point is the backup information corresponding to the abnormal information, and the abnormal information is replaced by the backup information, that is, the workload cluster is repaired from the time point, so that the workload cluster can be successfully repaired.
According to the method for repairing the workload cluster, the data of the workload cluster in the normal operation state is backed up, the backup data corresponding to the workload cluster is obtained, and when the workload cluster is abnormal, the cluster is automatically subjected to abnormal processing according to the backup data, so that the abnormal automatic repair of the cluster is realized, the manual intervention of operation and maintenance personnel on the cluster is reduced, the cluster recovery is improved, the abnormal operation and maintenance time of the cluster is reduced, the cluster can be timely restored to the normal operation state, the abnormal repair efficiency of the cluster is improved, the continuous availability of the cluster is ensured, and the cluster reliability is improved.
In this embodiment, a method for repairing a workload cluster is provided, which may be used for an electronic device, such as a tablet computer, a server, and the like, fig. 3 is a flowchart of a method for repairing a workload cluster according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:
and S21, obtaining backup data corresponding to the workload cluster, wherein the backup data is data of the workload cluster which normally operates. For a detailed description, refer to the corresponding related description of the above embodiments, which is not repeated herein.
And S22, detecting whether the operation state of the workload cluster is normal. For a detailed description, refer to the corresponding related description of the above embodiments, which is not repeated herein.
And S23, when the operation state of the work load cluster is abnormal, determining the abnormal information of the work load cluster. For a detailed description, refer to the corresponding related description of the above embodiments, which is not repeated herein.
And S24, performing exception repair on the exception information of the workload cluster based on the backup data.
Specifically, the step S24 may include:
and S241, acquiring an abnormal reason corresponding to the abnormal information.
The abnormal reason is the reason causing the abnormal operation of the workload cluster. The electronic equipment can monitor the running state of the working load cluster in real time so as to capture abnormal information in time when the working load cluster is abnormal, and determine the corresponding abnormal reason by analyzing the abnormal information.
And S242, performing exception repair on the workload cluster based on the exception reason and the backup data.
And determining backup information corresponding to the abnormal reason from the backup data according to different abnormal reasons, and performing abnormal processing on the workload cluster through the backup information to replace the abnormal resources corresponding to the abnormal reasons.
Specifically, when the abnormality is caused by a network abnormality, the step S242 may include:
(1) and acquiring the network repairing times of the workload cluster.
The number of network repairs is the number of times the workload cluster attempts to connect to the cluster network. When the operation state of the workload cluster is changed from READY state to non-READY state due to a network anomaly, the electronic device may attempt to automatically connect the cluster network again to attempt to recover the cluster, and may record the number of times the workload cluster attempts to connect the cluster network.
(2) And judging whether the network repairing times exceed the preset times.
The preset times are the maximum times that the preset working load cluster is connected with the cluster network and the network abnormality is not repaired. The electronic equipment compares the network repairing times with preset times to determine whether the network repairing times exceed the preset times. And (4) when the network repairing times exceed the preset times, executing the step (3), otherwise, indicating that the network is connected and the network abnormity is repaired.
(3) And regenerating the working load cluster according to the backup data.
And when the network repair times exceed the preset times and are not repaired, triggering the automatic repair of the working load cluster, and regenerating the working load cluster according to the latest primary backup data.
(4) And configuring the regenerated workload clusters based on the configuration files in the backup data.
The electronic device does not include the configuration file required for normal operation in the workload cluster regenerated according to the backup data, and at this time, the electronic device may copy each configuration file in the backup data to the regenerated workload cluster to reconfigure the workload cluster so as to restore the workload cluster to a normal operation state.
Specifically, when the anomaly reason is service anomaly, the step S242 may include:
(1) and acquiring abnormal service data corresponding to the workload cluster.
The abnormal service data is service data with an abnormality in the workload cluster. If the operation state of the workload cluster is changed from READY state to non-READY state due to some abnormal services, the electronic device may obtain the abnormal service of the workload cluster at this time to determine abnormal service data.
(2) Backup service data corresponding to the abnormal service data is determined from the backup data.
The backup service data is service data which is backed up before the working load cluster is abnormal, namely normal service data. The electronic device can determine the time point of the abnormal service data by analyzing the abnormal service data, and can determine the backup service data corresponding to the abnormal service data from the backup data according to the time point.
(3) And replacing and repairing the abnormal service data by the backup service data.
And copying the backup service data to a time point corresponding to the abnormal service data, so that the backup service data can replace the abnormal service data, and restarting the service to complete the repair of the abnormal service after the abnormal service data is replaced.
According to the method for repairing the workload cluster, the abnormal reason corresponding to the abnormal information is obtained, and different abnormal repairing modes are determined according to different abnormal reasons, so that the abnormal information of the cluster can be repaired in a targeted manner, the abnormal repairing efficiency of the cluster is further improved, and the working stability of the cluster is improved.
In this embodiment, a method for repairing a workload cluster is provided, which may be used for an electronic device, such as a tablet computer, a server, and the like, fig. 4 is a flowchart of a method for repairing a workload cluster according to an embodiment of the present invention, and as shown in fig. 4, the flowchart includes the following steps:
and S31, obtaining backup data corresponding to the workload cluster, wherein the backup data is data of the workload cluster which normally operates. For a detailed description, refer to the corresponding related description of the above embodiments, which is not repeated herein.
And S32, detecting whether the operation state of the workload cluster is normal.
Specifically, the step S32 may include:
s321, detecting whether the work load cluster and the management cluster are in network intercommunication.
After the work load cluster is started to operate, the electronic equipment can detect whether the electronic equipment is normal or not and the operating environment of the electronic equipment is normal or not in real time. First, the electronic device may detect network connectivity of the workload cluster to determine whether the management cluster is interworking with the workload cluster. When the work load cluster and the management cluster are intercommunicated, the step S322 is continuously executed, otherwise, it indicates that the work load cluster has network abnormality, and network repair is needed to ensure normal operation of the work load cluster.
S322, detecting whether each service state of the work load cluster is normal.
When the workload cluster and the management cluster are in network intercommunication, because the workload cluster normally depends on internal services, the electronic device needs to detect whether the internal services of the workload cluster are normal, such as the services of DeployMent, DemonSet, ConfigMap, and the like. And when the service states are normal, executing the step S323, otherwise, indicating that the service abnormality exists in the workload cluster, and performing abnormal service repair to ensure normal operation of the workload cluster.
S323, detecting whether the network status between the services is normal.
When the service states are normal, the electronic device continues to monitor whether the network states among the services in the workload cluster are normal. The network state can be monitored through a detection method carried by k8s, when the network state among the services is normal, the operation of the workload cluster is normal, otherwise, the network abnormality of the workload cluster is indicated, and network repair is needed to ensure the normal operation of the workload cluster.
And S33, when the operation state of the work load cluster is abnormal, determining the abnormal information of the work load cluster. For a detailed description, refer to the corresponding related description of the above embodiments, which is not repeated herein.
S34, obtaining the duration of the work load cluster in the abnormal operation state.
Because the workload cluster itself has a strong self-healing capability, for a general abnormal state, the workload cluster itself can be recovered, for example, the cluster itself can repair some network abnormalities and the capability of restarting resources, and the POD of the cluster can be restarted continuously after failure. Therefore, when the work load cluster is abnormal, automatic repair of the cluster is not required to be triggered immediately, and abnormal repair can be carried out when the abnormal time reaches a certain value, so that the work load cluster is prevented from automatically repairing and damaging the existing data.
The duration is the abnormal repair waiting time when the workload cluster is abnormal, the electronic device may record the duration when the workload cluster is abnormal, and determine whether the duration reaches a preset duration, when the duration reaches the preset duration, execute step S35, otherwise, continue the duration when the workload cluster is abnormal.
And S35, when the duration reaches the preset duration, performing exception repair on the exception information of the workload cluster based on the backup data.
The preset time is the maximum waiting time for abnormal restoration, and if the scale of the workload cluster is small, the preset time can be set to 1 hour, and the preset time is not limited herein, and can be set by a person skilled in the art according to actual requirements. And triggering the abnormal restoration of the workload cluster after the duration of the abnormal state of the workload cluster reaches the preset duration, counting time again if the state of the midway workload cluster changes, and clearing the time if the running state of the workload cluster becomes normal.
Optionally, the method may further include:
(1) and acquiring the duration of the working load cluster in the abnormal operation state, and acquiring the abnormal level corresponding to the working load cluster, wherein the abnormal level is used for representing the abnormal degree of the working load cluster.
The degree of anomaly of a workload cluster is characterized by an anomaly level, which may include severe, moderate, and mild. In order to more accurately repair the workload cluster and timely recover the workload cluster to the normal operation state, the electronic device may obtain the abnormal level corresponding to the workload cluster while obtaining the duration of the abnormal operation state of the workload cluster.
(2) And performing exception recovery on the exception information of the workload cluster based on the exception level and the backup data.
After the duration of the abnormal operation state of the workload cluster reaches the preset duration, the electronic device may perform abnormal restoration on the workload cluster according to the abnormal level of the current workload cluster. Specifically, for the minor-level repair, the electronic device may determine a configuration file of the abnormal resource from the backup data, and only replace the configuration file, such as ConfigMap; for medium-level repair, the electronic device may determine backup data corresponding to the abnormal resource from the backup data, and replace the abnormal resource data with the backup data corresponding to the abnormal resource to repair the resource of the workload cluster, such as a DeployMent, DemonSet, Service, and the like of the cluster; for the severity level repair, the electronic device may reestablish a workload cluster according to the backup data, and replace the original workload cluster with the reestablished workload cluster to repair the entire workload cluster.
According to the method for repairing the workload cluster, before abnormal information of the workload cluster is repaired, the duration of the abnormal operation state of the workload cluster is obtained, and when the duration reaches the preset duration, the abnormal information of the workload cluster is repaired based on the backup data, so that the self-healing capability of the cluster can be fully exerted. By acquiring the abnormal level corresponding to the workload cluster and performing abnormal restoration on the abnormal information of the workload cluster based on the abnormal level, the cluster can be restored according to the abnormal degree, and the recoverability and the troubleshooting efficiency of the cluster are improved. The working state of the working load cluster is comprehensively detected, so that corresponding exception handling can be carried out according to the detection result when exception is detected, the cluster can be timely recovered to a normal running state, and the cluster is ensured to be continuously available.
In this embodiment, a device for repairing a workload cluster is further provided, where the device is used to implement the foregoing embodiments and preferred embodiments, and details are not described again after the description is given. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This embodiment provides a repair apparatus for a workload cluster, as shown in fig. 5, including:
the obtaining module 41 is configured to obtain backup data corresponding to the workload cluster, where the backup data is data of the workload cluster running normally. For a detailed description, reference is made to the corresponding related description of the above method embodiments, which is not repeated herein.
And the detection module 42 is configured to detect whether the operation state of the workload cluster is normal. For a detailed description, reference is made to the corresponding related description of the above method embodiments, which is not repeated herein.
The determining module 43 is configured to determine, when the operation state of the workload cluster is abnormal, abnormal information of the workload cluster. For a detailed description, reference is made to the corresponding related description of the above method embodiments, which is not repeated herein.
And a repair module 44, configured to perform exception repair on the exception information of the workload cluster based on the backup data. For a detailed description, reference is made to the corresponding related description of the above method embodiments, which is not repeated herein.
The repairing device for the workload cluster provided by the embodiment automatically processes the cluster according to the backup data when the workload cluster is abnormal, so that the abnormal automatic repairing of the cluster is realized, the manual intervention of operation and maintenance personnel on the cluster is reduced, the recovery performance of the cluster is improved, the abnormal operation and maintenance time of the cluster is reduced, the cluster can be timely recovered to a normal operation state, the abnormal repairing efficiency of the cluster is improved, the continuous availability of the cluster is ensured, and the reliability of the cluster is improved.
The repair facility of the workload cluster in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, which includes the repair apparatus for a workload cluster shown in fig. 5.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, and as shown in fig. 6, the electronic device may include: at least one processor 501, such as a CPU (Central Processing Unit), at least one communication interface 503, memory 504, and at least one communication bus 502. Wherein a communication bus 502 is used to enable connective communication between these components. The communication interface 503 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 503 may also include a standard wired interface and a standard wireless interface. The Memory 504 may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 504 may optionally be at least one storage device located remotely from the processor 501. Wherein the processor 501 may be in connection with the apparatus described in fig. 5, an application program is stored in the memory 504, and the processor 501 calls the program code stored in the memory 504 for performing any of the above-mentioned method steps.
The communication bus 502 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 502 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The memory 504 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 504 may also comprise a combination of the above types of memory.
The processor 501 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of CPU and NP.
The processor 501 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 504 is also used to store program instructions. Processor 501 may invoke program instructions to implement a method for repairing a workload cluster as shown in the embodiments of fig. 2-4 of the present application.
The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the processing method of the restoration method of the workload cluster in any method embodiment. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method for repairing a workload cluster, comprising:
obtaining backup data corresponding to a workload cluster, wherein the backup data is data of normal operation of the workload cluster;
detecting whether the running state of the workload cluster is normal or not;
when the running state of the working load cluster is abnormal, determining the abnormal information of the working load cluster;
and performing exception recovery on the exception information of the workload cluster based on the backup data.
2. The method of claim 1, wherein the performing exception repair on the exception information of the workload cluster based on the backup data comprises:
acquiring an abnormal reason corresponding to the abnormal information;
and performing exception repair on the workload cluster based on the exception reason and the backup data.
3. The method of claim 2, wherein when the anomaly cause is a network anomaly, the performing anomaly repair on the workload cluster based on the anomaly cause and the backup data comprises:
acquiring the network repair times of the workload cluster;
judging whether the network repairing times exceed preset times or not;
when the network repair times exceed the preset times and are not repaired, regenerating the working load cluster according to the backup data;
and configuring the regenerated workload clusters based on the configuration files in the backup data.
4. The method of claim 2, wherein when the anomaly cause is a service anomaly, the performing an anomaly repair on the workload cluster based on the anomaly cause and the backup data comprises:
acquiring abnormal service data corresponding to the workload cluster;
determining backup service data corresponding to the abnormal service data from the backup data;
and replacing and repairing the abnormal service data by the backup service data.
5. The method of claim 1, further comprising, prior to said repairing anomalies in the workload cluster based on the backup data,:
acquiring the duration of the working load cluster in an abnormal operation state;
and when the duration reaches a preset duration, performing exception recovery on the exception information of the workload cluster based on the backup data.
6. The method of claim 5, further comprising:
obtaining an abnormal level corresponding to the workload cluster, wherein the abnormal level is used for representing the abnormal degree of the workload cluster;
and performing exception repair on the exception information of the workload cluster based on the exception level and the backup data.
7. The method of claim 1, wherein detecting whether the operational status of the workload cluster is normal comprises:
detecting whether the work load cluster and the management cluster are in network intercommunication or not;
when the work load cluster is communicated with the management cluster network, detecting whether each service state of the work load cluster is normal or not;
and when the service states are normal, detecting whether the network states among the services are normal.
8. A workload cluster repair apparatus, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring backup data corresponding to a workload cluster, and the backup data is data of normal operation of the workload cluster;
the detection module is used for detecting whether the running state of the working load cluster is normal or not;
the determining module is used for determining the abnormal information of the working load cluster when the running state of the working load cluster is abnormal;
and the repairing module is used for performing exception repairing on the exception information of the working load cluster based on the backup data.
9. An electronic device, comprising:
a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of repairing a workload cluster according to any of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of repairing a workload cluster of any one of claims 1 to 7.
CN202111274663.9A 2021-10-29 2021-10-29 Method, device and equipment for repairing work load cluster and readable storage medium Pending CN114020509A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111274663.9A CN114020509A (en) 2021-10-29 2021-10-29 Method, device and equipment for repairing work load cluster and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111274663.9A CN114020509A (en) 2021-10-29 2021-10-29 Method, device and equipment for repairing work load cluster and readable storage medium

Publications (1)

Publication Number Publication Date
CN114020509A true CN114020509A (en) 2022-02-08

Family

ID=80059078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111274663.9A Pending CN114020509A (en) 2021-10-29 2021-10-29 Method, device and equipment for repairing work load cluster and readable storage medium

Country Status (1)

Country Link
CN (1) CN114020509A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees
CN115473793A (en) * 2022-08-19 2022-12-13 苏州浪潮智能科技有限公司 Automatic recovery method, device, terminal and medium for cluster EI host environment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115473793A (en) * 2022-08-19 2022-12-13 苏州浪潮智能科技有限公司 Automatic recovery method, device, terminal and medium for cluster EI host environment
CN115473793B (en) * 2022-08-19 2023-08-08 苏州浪潮智能科技有限公司 Automatic recovery method, device, terminal and medium for cluster EI host environment
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees

Similar Documents

Publication Publication Date Title
CN110798375B (en) Monitoring method, system and terminal equipment for enhancing high availability of container cluster
CN112948157B (en) Server fault positioning method, device and system and computer readable storage medium
CN107179957B (en) Physical machine fault classification processing method and device and virtual machine recovery method and system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN114020509A (en) Method, device and equipment for repairing work load cluster and readable storage medium
CN103607297A (en) Fault processing method of computer cluster system
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN114942859A (en) Method, device, equipment, medium and program product for processing node failure
CN111880992B (en) Monitoring and maintaining method for controller state in storage device
JP6880961B2 (en) Information processing device and log recording method
CN109104314B (en) Method and device for modifying log configuration file
CN110968456B (en) Method and device for processing fault disk in distributed storage system
CN107273291B (en) Processor debugging method and system
CN107590647A (en) The servo supervisory systems of ship-handling system
CN113656358A (en) Database log file processing method and system
CN114374627A (en) Method, device and system for restarting baseboard management controller and server
JPH1188471A (en) Test method and test equipment
CN117971564B (en) Data recovery method, device, computer equipment and storage medium
CN116506327B (en) Physical node monitoring method, device, computer equipment and storage medium
US20230126244A1 (en) Method, electronic device, and computer program product for managing operating system
CN115098224A (en) Cluster service process exception handling method, device and medium thereof
CN115269556A (en) Database fault processing method, device, equipment and storage medium
CN117573405A (en) Multipath exception handling method, multipath exception handling device, computer equipment and storage medium
CN116582422A (en) Network card exception handling method, network card exception handling system and related device
CN115484267A (en) Multi-cluster deployment processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination