WO2023011160A1 - Fault processing method and apparatus, device, and storage medium - Google Patents

Fault processing method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023011160A1
WO2023011160A1 PCT/CN2022/106444 CN2022106444W WO2023011160A1 WO 2023011160 A1 WO2023011160 A1 WO 2023011160A1 CN 2022106444 W CN2022106444 W CN 2022106444W WO 2023011160 A1 WO2023011160 A1 WO 2023011160A1
Authority
WO
WIPO (PCT)
Prior art keywords
healing
information
fault
self
fault self
Prior art date
Application number
PCT/CN2022/106444
Other languages
French (fr)
Chinese (zh)
Inventor
薛萍萍
王红玉
张亮
韩光耀
孔祥伟
王艺
许海洋
周玮
岳洪达
韩洋
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023011160A1 publication Critical patent/WO2023011160A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Definitions

  • the present disclosure relates to the technical field of data processing, in particular to the technical field of fault processing. It further relates to a fault handling method, device, equipment and storage medium.
  • a fault handling method including:
  • Obtain the alarm information generated by the business system obtain the associated data related to the above alarm information according to the monitoring data of the above business system; determine the cause information of the fault that triggered the above alarm information according to the above associated data; According to the above alarm information, a fault self-healing scheme including fault self-healing tasks is obtained; by executing the fault self-healing tasks included in the above fault self-healing scheme, fault self-healing is performed.
  • a fault handling device including:
  • the information acquisition module is used to obtain the alarm information generated by the business system; the data acquisition module is used to obtain the associated data related to the above alarm information based on the monitoring data of the above business system; the information determination module is used to obtain the associated data based on the above associated data , determine the cause information of the fault that triggers the above-mentioned alarm information; the scheme acquisition module is used to obtain the fault self-healing plan including the fault self-healing task according to the above-mentioned cause information and the above-mentioned alarm information; the fault self-healing module is used to execute the above-mentioned The fault self-healing task included in the fault self-healing scheme performs fault self-healing.
  • an electronic device including:
  • At least one processor and a memory connected in communication with the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processing
  • the controller is able to implement the fault handling method.
  • a computer program product including a computer program, and the above computer program implements a fault handling method when executed by a processor.
  • a computer program When running on a computer, the computer program causes the computer to execute the steps of any fault handling method provided in the above aspect.
  • the fault self-healing task since the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, the cause information can reflect the trigger Therefore, the fault self-healing task can not only perform fault self-healing from the visual level of the fault presented by the alarm information, but also perform fault self-healing from the root level presented by the cause information of the fault, and then The precise solution to the fault problem is realized, and the efficiency of fault stop loss is effectively improved.
  • the application of the solution provided by the embodiment of the present disclosure can perform fault self-healing for the faults generated by the business system, and can perform fault self-healing. Accurately solve the fault problem and effectively improve the efficiency of fault stop loss.
  • FIG. 1 is a schematic diagram of a whole process of fault management of a business system provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a first fault handling method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of a second fault handling method provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic flowchart of a third fault handling method provided by an embodiment of the present disclosure.
  • Fig. 5a is a schematic flowchart of a fourth fault handling method provided by an embodiment of the present disclosure.
  • Fig. 5b is a schematic flowchart of a fifth fault handling method provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic flowchart of a sixth fault handling method provided by an embodiment of the present disclosure.
  • FIG. 7 is a flowchart of a fault handling method provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a first fault handling device provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a second fault handling device provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a third fault handling device provided by an embodiment of the present disclosure.
  • Fig. 11a is a schematic structural diagram of a fourth fault handling device provided by an embodiment of the present disclosure.
  • Fig. 11b is a schematic structural diagram of a fifth fault handling device provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram of a sixth fault handling device provided by an embodiment of the present disclosure.
  • Fig. 13 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a failure handling method, device, equipment, and storage medium.
  • a fault handling method including:
  • the associated data related to the alarm information is obtained
  • the fault handling method provided in the present disclosure can be applied to electronic equipment, and the electronic equipment can be a terminal or a server for managing a service system, and the present disclosure does not limit the specific form of the electronic equipment. Since the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, and the cause information can reflect the cause of the fault that triggers the alarm information.
  • FIG. 1 shows a schematic diagram of the whole process of fault management of the business system.
  • the process of fault self-healing is: from responding to alarm information to generating a fault self-healing plan, from generating a fault self-healing plan to starting to execute fault self-healing Program, the whole process from the beginning of implementing the fault self-healing program to the end of implementing the fault self-healing program.
  • the length of time for self-healing of faults is called the stop loss time. The shorter the stop loss time, the smaller the business loss.
  • Step S201 Obtain alarm information generated by the business system.
  • Step S202 According to the monitoring data of the business system, the associated data related to the alarm information is obtained.
  • the above alarm information is the alarm information generated by the business system obtained in step S201.
  • Various types of data may be generated during the operation of the business system, and these data are monitored to obtain monitoring data.
  • the above-mentioned monitoring data can be monitoring data from different functional subsystems configured by the business system.
  • the above-mentioned monitoring data can be: various alarm information generated by the business system recorded in the monitoring functional subsystem, the operating status of the business system, etc.
  • the above monitoring data may also be: business system change information recorded in the system change subsystem, available resources of the business system recorded in the capacity subsystem, and the like.
  • the associated data may include monitoring data associated with the alarm information, and may also include data associated with the alarm information obtained by analyzing the monitoring data.
  • the above alarm information is the alarm information generated by the business system obtained in step S201.
  • the above cause information represents the cause of the failure that triggers the alarm information.
  • the failure of the business system is: the network link is faulty, and the alarm information triggered by this failure is: the network traffic of the business node drops.
  • the reason for the above failure is that the network link is disconnected, so the above alarm is triggered.
  • the cause information of the information failure is: "the network link is disconnected".
  • the associated data is the data that is related to the alarm information, and because when the business system fails, this failure may bring a series of chain effects, and this failure may also be caused by other problems.
  • the various information generated in the cascading effect of the system are interrelated, and the cause information that leads to the fault can be determined based on the various information with the correlation relationship. Therefore, the cause information that triggers the fault that generates the alarm information can be determined according to the associated data.
  • the corresponding relationship between the alarm information and the cause information can be preset. Since multiple faults may cause the same alarm, one alarm information may have a corresponding relationship with multiple cause information, based on Therefore, from the above-mentioned corresponding relationship, each candidate cause information corresponding to the current alarm information can be determined first, and the candidate cause information related to the associated data can be selected from each candidate cause information as the cause of the fault that triggers the alarm information information.
  • the current alarm information is: the traffic of business system A drops.
  • the three alternative cause information for the above alarm information are: "network failure", “system change”, “device A failure”, the associated data obtained includes: network traffic fluctuates greatly, device B fails, the alternative cause information is "network failure” is related to "network traffic fluctuates greatly” in the associated data, and "network occurrence Fault” is determined as the cause information of the fault that triggers the generation of the alarm message.
  • Step S204 Obtain a fault self-healing solution including a fault self-healing task according to the cause information and the alarm information.
  • the above alarm information is the alarm information generated by the business system obtained in step S201.
  • the number of fault self-healing tasks included in the fault self-healing scheme may be one or multiple.
  • the fault self-healing scheme also includes the execution sequence of each fault self-healing task, where the execution sequence of each fault self-healing task can be a parallel execution sequence or a serial execution sequence. row execution order.
  • Step S205 Perform fault self-healing by executing the fault self-healing tasks included in the fault self-healing solution.
  • each fault self-healing task may be executed sequentially according to the execution order of the fault self-healing tasks included in the fault self-healing scheme, so as to realize fault self-healing. That is to say, when there are multiple fault self-healing tasks, each fault self-healing task can be executed according to the execution order of each fault self-healing task that is set.
  • the task execution tool matching the task type of each fault self-healing task can be determined; according to the fault self-healing task in the fault self-healing scheme
  • the execution sequence calls the task execution tool corresponding to each fault self-healing task, executes each fault self-healing task, and performs fault self-healing.
  • Each task execution tool is used to execute tasks of different task types.
  • the above-mentioned task execution tool may include: a network link closing tool, a restart tool, a health check tool, and the like.
  • the task execution tool matching the task type of the task can execute the task. Therefore, by calling the above-mentioned task execution tools, the fault self-healing task can be executed , so as to realize fault self-healing. It should be noted that, when there is one fault self-healing task, the fault self-healing task can be executed independently or by means of a task execution tool.
  • the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, and the cause information can reflect the trigger generation.
  • the cause of the fault in the alarm information therefore, the fault self-healing task can not only perform fault self-healing from the intuitive level of the fault presented by the alarm information, but also perform fault self-healing from the root level presented by the cause information of the fault, and then realize The precise solution to the fault problem has effectively improved the efficiency of fault stop loss.
  • the associated data is the data that is related to the alarm information
  • this failure may bring a series of chain effects, and this failure may also be caused by other problems.
  • the various information generated in this series of cascading effects are interrelated, and based on the various information related to each other, the cause information of the fault can be determined. Therefore, the fault that triggers the alarm information can be determined more accurately based on the associated data.
  • the cause information, and then the fault self-healing scheme obtained according to the cause information and alarm information has a higher success rate of self-healing for the fault corresponding to the cause information.
  • Fig. 3 is a schematic flowchart of a second fault handling method provided by an embodiment of the present disclosure.
  • information may be obtained according to at least one of the following steps S2021-S2024 as associated data associated with alarm information.
  • Step S2021 From the monitoring data, obtain other alarm information for the target service node within the first time period where the alarm time recorded in the alarm information falls.
  • the above alarm time is: the time when the alarm information is generated.
  • the alarm information refers to the alarm information generated by the business system obtained in step S201.
  • other alarm information for the target service node in the first time period can be obtained from the monitoring data, and the first time period is the time period in which the alarm information recorded in the alarm information is located, that is to say, the first time period
  • the time period is the time period set based on the alarm time of the obtained alarm information, and the first time period includes the time period of the alarm information of the obtained alarm information, and the recorded alarm time in other alarm information is located in the second a period of time.
  • the above-mentioned first time period may be: a time period in which the alarm time is extended forward by a first preset time period, and the above-mentioned first preset time period may be set by the staff based on experience.
  • the alarm time is 00:10:00
  • the first preset duration is 5 minutes
  • the first time period is 00:05:00-00:10:00
  • the above first time period can also be: the alarm time is forward respectively Extends the time period backwards by a certain amount of time.
  • the alarm time is 00:10:00
  • the duration of the forward extension is 5 minutes
  • the duration of the backward extension is 8 minutes
  • the first time period is 00:05:00-00:18:00
  • the second preset duration may be set by a staff member based on experience.
  • the alarm time is 00:10:00
  • the second preset duration is 8 minutes
  • the first time period is 00:10:00-00:18:00.
  • the first preset duration and the second preset duration may be the same or different.
  • the above target service node is: the service node targeted by the alarm information.
  • the alarm information refers to the alarm information generated by the business system obtained in step S201.
  • the service node targeted by the alarm information refers to the service node that generates the above alarm information.
  • the above-mentioned target service node may be a service module, a computer room, or a device.
  • the alarm information generated in the first time period can be obtained from the monitoring data, and other alarm information for the target service node can be determined from the obtained alarm information.
  • the probability of correlation between the alarm information of the same node in a period of time is relatively high, the probability of correlation between other alarm information and alarm information of the target node in the first period of time is relatively high.
  • the above other alarm information is determined to have a high accuracy of associated data.
  • Step S2022 Determine the impact factor of the fault that triggers the target fault type, and obtain the first information representing the fluctuation of the impact factor according to the value of the impact factor recorded in the monitoring data and within the second time period where the alarm time is located.
  • the above target fault type is: the fault type recorded in the alarm information.
  • the alarm information refers to the alarm information generated by the business system obtained in step S201.
  • the aforementioned target fault type may be a fault type that triggers the fault generated by the alarm information.
  • the above-mentioned influencing factors represent the influencing factors that cause the failure of the target failure type.
  • the influence factor corresponding to the target failure type may be determined from the preset correspondence between the failure type and the influence factor as the influence factor triggering the failure of the target failure type.
  • the various influencing factors corresponding to the above target failure type include: the network traffic of the upstream business module, the quality of the external network link, the internal Network link quality.
  • the above-mentioned second time period can be the time period in which the alarm time is extended forward by the third preset duration, the alarm time can also be extended forward and backward by a certain period of time, and the alarm event can also be extended backward by the fourth preset time period. duration of time period.
  • the third preset duration and the fourth preset duration may be the same or different.
  • the alarm information refers to the alarm information generated by the business system obtained in step S201.
  • the above-mentioned first information represents the fluctuation of the influencing factors. Taking the influence factor as network traffic as an example, the above-mentioned first information represents fluctuations in network traffic.
  • the average value of the values of the impact factors within the second time period may also be calculated, and the above average value is determined as the first information.
  • the above average value is greater than the preset average threshold value, it means that the impact factor fluctuates greatly in the second time period; when the above average value is not greater than the preset average threshold value, it means that the impact factor fluctuates relatively smoothly in the first time period.
  • the determined influence factor is the influence factor of the failure type that triggers the failure type recorded in the alarm information
  • the value of the influence factor within a certain period of time has a correlation with the alarm information generated within this period of time, and because the above-mentioned first The information is determined based on the value of the influence factor in the second time period determined based on the alarm time of the alarm information.
  • the above-mentioned first information is related to the value of the influence factor in the second time period. Therefore, the first information It has an association relationship with the above alarm information, and the accuracy of determining the first information as associated data is high.
  • Step S2023 According to the latest system update time and alarm time recorded in the monitoring data, obtain the second information indicating whether the system update occurred in the business system within the third time period before the alarm information was generated.
  • the third time period is a time period in which the alarm time is extended forward by a fifth preset time period.
  • the latest system update time refers to the latest time for system update.
  • System update refers to operations such as upgrading and repairing the system.
  • information such as the update content, update object, and update time of the system update can also be obtained, and the obtained above-mentioned information can be determined as associated data related to the alarm information .
  • Step S2024 From the monitoring data, select the amount of available resources of the business system within the fourth time period where the alarm time is located.
  • the above-mentioned alarm time is the alarm time of the alarm information
  • the alarm information refers to the alarm information generated by the business system obtained in step S201.
  • the available resources of the business system selected from the monitoring data are: the available resources in the fourth time period; and the fourth time period is the time period of the alarm time.
  • the fourth time period can extend the alarm time forward by the sixth preset time period, can also extend the alarm time forward and backward by a certain period of time, and can also extend the alarm time backward by the seventh preset time period time period.
  • the sixth preset duration and the seventh preset duration may be the same or different.
  • the above-mentioned amount of available resources refers to the amount of available resources that the service system responds to user requests, and the above-mentioned available resources may include bandwidth resources, computing resources, and the like.
  • the amount of available resources at each moment in the fourth time period can be obtained from the monitoring data, statistical analysis is performed on each amount of available resources, and the statistical analysis value is determined as the amount of available resources of the business system in the fourth time period.
  • the above-mentioned statistical analysis may be in the form of calculating an average value, a median value, and the like.
  • the first time period, the second time period, the third time period, and the fourth time period may be the same or different.
  • one of the steps can be selected to obtain information as associated data associated with alarm information, and multiple steps can be selected to obtain information, and the obtained multiple information can be used as associated data associated with alarm information.
  • Linked data When determining associated data, one of the steps can be selected to obtain information as associated data associated with alarm information, and multiple steps can be selected to obtain information, and the obtained multiple information can be used as associated data associated with alarm information.
  • FIG. 4 is a schematic flowchart of a third fault handling method provided by an embodiment of the present disclosure.
  • the above step S204 can be implemented according to the following steps S2041-S2042.
  • Step S2041 According to the alarm information, among the known fault self-healing schemes, search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the reason information.
  • the above-mentioned known fault self-healing scheme may be: a scheme for performing self-healing processing on a fault that has already occurred.
  • the above-mentioned scheme may be stored in a scheme library in the server.
  • a self-healing console can be configured, and experts or operation and maintenance personnel can input a fault self-healing plan through the self-healing console, and the business system can store the fault self-healing plan entered by the user
  • the self-healing scheme rule base each known fault self-healing scheme is stored in the above-mentioned self-healing scheme rule base.
  • the target field value of the preset field in the alarm information can be extracted; based on the target field value, in the known fault self-healing scheme, it is found that the fault corresponding to the cause information is self-healing and includes the target Fault self-healing scheme for self-healing tasks.
  • the above-mentioned preset fields may include: the alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the device generating the alarm information, the identification of the computer room where the above-mentioned equipment is located, and the instance (for example, program, algorithm, etc.) that triggered the generation of the alarm information. etc.) and exception description information.
  • the identification of the service node targeted by the above-mentioned alarm information refers to the identification of the service node that triggers the occurrence of the fault that generates the alarm information, such as the number and name of the service node.
  • the above-mentioned service nodes may include service modules and the like.
  • the identification of the above-mentioned computer room refers to the identification of the computer room where the equipment generating the alarm information is located, and the above-mentioned identification of the computer room may be the location of the computer room, the number of the computer room, and the like.
  • the identification of the above-mentioned equipment refers to the identification of the equipment that generates the alarm information.
  • the identification of the above-mentioned equipment may be equipment IP address (Internet Protocol Address, Internet Protocol address), MAC address (Media Access Control Address, media access control address), etc.
  • the above target field value may be obtained by parsing and extracting the alarm information.
  • the above-mentioned cause information can be used as a keyword to perform keyword matching from the cause information recorded in each known fault self-healing scheme to obtain the above-mentioned cause information
  • Known fault self-healing scheme for self-healing processing of faults
  • the field value of the targeted preset field can be recorded in the fault self-healing task.
  • the target field value can be used as a keyword to obtain each known Keyword matching is performed on each field value of the fault self-healing task record included in the fault self-healing plan, and the fault self-healing task set according to the above-mentioned target field value is determined, and then the fault self-healing plan including the above-mentioned fault self-healing task is obtained.
  • the target self-healing task is set according to the target field value of the preset field
  • the fault self-healing task so the found fault self-healing task is set according to the target field value of the preset field in the alarm information, which improves the success rate of the fault self-healing task for the fault self-healing task.
  • Step S2042 Determine the found fault self-healing solution as a fault self-healing solution including a fault self-healing task.
  • the fault self-healing scheme refers to the known fault self-healing scheme
  • the fault self-healing scheme can be found from the above-mentioned known fault self-healing scheme, and the fault self-healing can be performed directly based on the found fault self-healing scheme , improving the efficiency of fault self-healing.
  • the found fault self-healing scheme can perform self-healing processing on the above fault, so that The recovery scheme realizes self-healing of faults.
  • the fault self-healing scheme may be determined referring to the embodiment shown in FIG. 5a.
  • Fig. 5a is a schematic flowchart of the fourth fault handling method provided by the embodiment of the present disclosure.
  • step S2041 if no fault self-healing scheme for self-healing processing of the fault corresponding to the cause information is found in the above step S2041, the following steps may also be included S2043-S2047.
  • Each piece of known operation and maintenance information above includes: description information of system exceptions and description information of system exception handling methods.
  • the description information may be information in text form.
  • the above-mentioned known operation and maintenance information may be based on the operation and maintenance information determined by relevant operation and maintenance documents such as operation and maintenance manuals, operation and maintenance plans, and historical operation and maintenance documents.
  • relevant operation and maintenance documents such as operation and maintenance manuals, operation and maintenance plans, and historical operation and maintenance documents.
  • the above-mentioned known operation and maintenance information may be stored in the operation and maintenance knowledge base.
  • structured extraction of the above operation and maintenance documents can be performed to obtain the description information of the system exception and the description information of the system exception handling method, and then obtain the known information including the description information of the system exception and the description information of the system exception handling method.
  • Operation and maintenance information It can also be that after the descriptive information obtained from the structured extraction of the operation and maintenance documents, the staff will adjust the content of the above descriptive information, the order of the abnormal handling methods of each system included, and obtain the adjusted system.
  • the description information of the exception and the known operation and maintenance information of the description information of the system exception handling method can be performed to obtain the description information of the system exception and the description information of the system exception handling method.
  • the semantic extraction model can be used to input known operation and maintenance information into the semantic extraction model, and the semantic features of the description information of the known operation and maintenance information output by the semantic extraction model can be obtained as the first semantics, and the The cause information is input into the semantic extraction model, and the semantic features of the cause information output by the semantic feature extraction model are obtained as the second semantics.
  • the distance between the above-mentioned first semantics and the second semantics can be calculated, such as Euclidean distance, cosine distance, etc., and the similarity between the first semantics and the second semantics can be determined based on the calculated distance, as each known The first similarity between operation and maintenance information and cause information.
  • Step S2044 According to the first semantics and the third semantics of the alarm information, obtain the second similarity between each known operation and maintenance information and the alarm information.
  • the above-mentioned first semantics is the first semantics of the description information in each known operation and maintenance information.
  • the above-mentioned third semantics represents the semantics expressed by the alarm information.
  • the semantics of the target field value of the preset field of the alarm information may be identified, and the identification result is determined as the third semantics.
  • the distance between the above-mentioned first semantics and the third semantics can be calculated, such as Euclidean distance, cosine distance, etc., and the similarity between the first semantics and the third semantics can be determined based on the calculated distance, as each known The second degree of similarity between the operation and maintenance information and the alarm information.
  • Step S2045 According to the first similarity degree and the second similarity degree, select the description information of the candidate processing method from the description information of the system exception handling method included in each known operation and maintenance information.
  • the above-mentioned first similarity is the first similarity between each known operation and maintenance information and cause information.
  • the above-mentioned second similarity is the second similarity between each known operation and maintenance information and alarm information obtained in step S2044.
  • the description information of the candidate processing modes refers to information describing the candidate processing modes, and the description information may be information in text form.
  • the description information of the system exception handling method included in the known operation and maintenance information with the highest target value may be selected as the description information of the candidate processing method.
  • the description information of the system exception handling method included in the known operation and maintenance information whose target value is greater than the preset target threshold may also be selected as the description information of the candidate processing method.
  • Step S2046 For each candidate processing method, based on the description information of the candidate processing method, obtain candidate processing tasks, so as to obtain a candidate fault self-healing solution including the candidate processing tasks.
  • the semantic features of the description information of the candidate processing modes may be obtained, and based on the obtained semantic features, the description information may be converted into executable commands to obtain candidate processing tasks including the above executable commands.
  • the operation and maintenance staff may calibrate the converted executable command's execution order, execution parameters, and other information to obtain candidate processing tasks including the calibrated executable command.
  • Step S2047 Determine the fault self-healing scheme from each candidate fault self-healing scheme.
  • a scheme may be randomly selected from candidate fault self-healing schemes as the fault self-healing scheme.
  • reference may also be made to the embodiment corresponding to FIG. 5b.
  • the second similarity is the difference between the above-mentioned first semantics and the third semantics of the alarm information
  • the semantics of the cause information, alarm information and the semantics of the description information in each known operation and maintenance information are considered comprehensively.
  • the similarity makes the determined candidate processing methods corresponding to the description information more accurately handle the faults of the alarm information, thereby making the determined fault self-healing scheme more accurate.
  • FIG. 5b is a schematic flowchart of a fifth fault handling method provided by an embodiment of the present disclosure. After the above step S2045, the following step S2048 may also be included.
  • Step S2048 Obtain the first success probability of self-healing for the fault corresponding to the cause information by adopting each candidate processing method.
  • the first success probability corresponding to each candidate processing method can be obtained, wherein the first success probability corresponding to each candidate processing method is: when the candidate processing method is used to perform fault self-healing on the fault corresponding to the cause information probability of success when .
  • the above-mentioned first success probability indicates the probability that the failure corresponding to the cause information can be successfully self-healed by adopting the candidate processing manner.
  • the target value When the target value is higher, it means that the probability that the candidate processing method can successfully self-heal the fault corresponding to the cause information is higher, that is, the first success probability is higher; when the target value is lower, it means that the candidate processing method can succeed
  • the target value when the target value is lower, it means that the candidate processing method can succeed
  • step S2047 can also be implemented according to the following steps S20471-S20472.
  • Step S20471 For each candidate fault self-healing plan, according to the current network environment information of the business system and the candidate processing tasks included in the candidate fault self-healing plan, it is estimated that the candidate fault self-healing plan will be used to perform fault self-recovery on the fault corresponding to the cause information. Healed second probability of success.
  • the current network environment information includes information such as current network traffic and available resources of the current network.
  • the above-mentioned second success probability is estimated based on the current network environment information of the business system and the candidate processing tasks included in the candidate fault self-healing scheme, the above-mentioned second success probability is related to the current network environment information of the business system, and because the business system The current network environment will affect the success probability of the fault self-healing scheme for fault self-healing, so the calculated second success probability adapts to the current network environment information of the business system, so that the calculated second success probability has high accuracy.
  • Step S20472 According to the first success probability and the second success probability, determine the fault self-healing scheme from each candidate fault self-healing scheme.
  • the candidate fault self-healing scheme with the highest fusion probability can be determined as the fault self-healing scheme, and the candidate fault self-healing scheme with the fusion probability greater than a preset probability threshold can also be determined as the fault self-healing scheme.
  • the second success is to determine the success probability of each candidate fault self-healing scheme from two different angles, so based on the above two success probabilities, the success probability of fault self-healing through the determined fault self-healing scheme is improved.
  • the above-mentioned steps S2043-S2045 may use the recommendation model to obtain the description information of the candidate processing methods, and the above-mentioned step S2048 may also use the above-mentioned recommendation model to obtain the first success rate.
  • the recommendation model calculates the first similarity between each known operation and maintenance information and cause information, and calculates the second similarity between each known operation and maintenance information and alarm information. According to the first similarity and the second similarity , from the description information of the system abnormality handling methods included in each known operation and maintenance information, determine the description information of the candidate processing methods, and determine the first success probability of using each candidate processing method to perform fault self-healing on the fault corresponding to the cause information, The description information and the first success probability of the above candidate processing modes are output.
  • FIG. 6 is a schematic flowchart of a sixth fault handling method provided by an embodiment of the present disclosure.
  • the above step S20471 can be implemented according to the following steps S204711-S204712.
  • Step S204711 According to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing solution, determine the time-consuming execution of each candidate processing task.
  • the foregoing task parameters include execution parameters required for executing candidate processing tasks, such as memory parameters, computing resource parameters, bandwidth resource parameters, and the like.
  • the dependencies between the above tasks can be determined based on the execution order of the candidate processing tasks. For example, if the execution order of the candidate processing tasks is the serial execution order: task A1, task A2, task A3, then task A1, task A3 There is a dependency between A2 and task A3, and the dependency between two adjacent tasks is the highest; if the execution order of candidate processing tasks is the order of parallel execution, then the Minimal dependencies.
  • the execution time consumption of each candidate processing task may be determined according to the preset task parameters of the fault self-healing task, the dependencies between tasks, and the correspondence between execution time consumption.
  • the above-mentioned information correspondence can be determined by experts based on experience.
  • Step S204712 According to the execution time of each candidate processing task and the current network environment information of the business system, estimate the second success probability of using the candidate fault self-healing scheme to perform fault self-healing on the fault corresponding to the cause information.
  • the success probability of each candidate fault self-healing scheme for the above-mentioned fault under the current network environment information of the business system.
  • the preset network environment information and fault self-healing scheme can be used The corresponding relationship of the success probability corresponding to the scheme, determine the success probability of the above candidate fault self-healing schemes, adjust the above success probability based on the execution time of each candidate processing task, and determine the adjusted success probability as the second success probability .
  • the execution time of each candidate processing task included in one candidate fault self-healing scheme P1 is less than the execution time of each candidate processing task included in another candidate fault self-healing scheme P2
  • the execution time of can increase the success probability corresponding to the candidate fault self-healing scheme P1, and reduce the success probability corresponding to the candidate fault self-healing scheme P2, so that the adjusted success probability is determined as the second success probability.
  • the above steps S204711-S204712 can use the effect prediction model to obtain the second success rate, and each candidate processing task included in the candidate fault self-healing scheme is used as the input of the effect prediction model, and the effect prediction model is based on the task parameters of each candidate processing task and the inter-task dependencies relationship, determine the time-consuming execution of each candidate processing task, and estimate the second success probability according to the time-consuming execution of each candidate processing task and the current network environment of the business system.
  • the estimated second success probability is related to the execution time consumption of each candidate processing task, and the execution time consumption of each candidate processing task is time affects the efficiency of fault self-healing, so the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability.
  • the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability.
  • the execution process of each fault self-healing task can also be monitored; In the case of , adjust the scheduling sequence of each fault self-healing task, and/or control the execution progress of each fault self-healing task.
  • the abnormality of the above task execution may include: task conflicts during task execution, slow progress of the current task execution, and the like.
  • information such as the execution status, execution progress percentage, and execution description information of each fault self-healing task can be monitored. Based on the above monitoring information, in the case of abnormal task execution, determine Adjusting the scheduling sequence and/or controlling the execution progress of the fault self-healing task, and performing the above operations.
  • the scheduling sequence of fault self-healing tasks is adjusted, and/or, the execution progress of fault self-healing tasks is controlled, so that possible problems can be adjusted in time, so as to successfully realize fault self-healing .
  • the operation and maintenance staff can simultaneously monitor, adjust, schedule and other operations on the execution progress of the fault self-healing task.
  • the operation and maintenance staff can adjust each fault self-healing task included in the above fault self-healing plan, and control the progress of the start, end, pause, and continuation of the task in real time, and Confirm the execution results, and enter the above fault self-healing scheme into the scheme library through the self-healing console.
  • the operation and maintenance staff it is also possible to store the execution operations of the operation and maintenance staff during the plan execution process, eliminate invalid information in the above operation process, and perform format conversion on the above execution operations, and use the converted data as
  • the training samples of the recommendation model are used to retrain the recommendation model so that the recommendation model can learn the rules and characteristics of operations performed by the operation and maintenance staff.
  • the user interface of the user end of the business system is equipped with a fault self-healing console, and the user can browse the currently generated or executed fault self-healing scheme through the self-healing task management function module of the fault self-healing console, and the fault self-healing scheme includes Edit fault self-healing tasks, such as adding, deleting, modifying operation tasks, inspection tasks, etc. You can also configure the parameters of the execution tool corresponding to each task.
  • Users can also edit the functional modules of the self-healing plan through the fault self-healing console, such as adjusting the execution order of each fault self-healing task in the fault self-healing plan, adding, deleting, and modifying the content of the fault self-healing plan, etc.
  • Fig. 7 is a flowchart of a fault handling method provided by an embodiment of the present disclosure.
  • Figure 7 includes five functional modules, namely: perception engine, decision engine, execution engine, collaboration engine, and fault self-healing console, where perception engine, decision engine, execution engine, and collaboration engine are functional modules installed on the server, and fault
  • the self-healing console is a function module installed by the client.
  • the perception engine obtains alarm information, and obtains the monitoring data of the business system through the perception engine, and the perception engine inputs the above alarm information and monitoring data into the decision engine;
  • the decision engine obtains the associated data related to the alarm information, and according to the above-mentioned associated data, determines the cause information of the fault that triggers the alarm information;
  • the decision engine also determines the fault self-healing plan including the fault self-healing task according to the above-mentioned cause information and the alarm information obtained by the perception engine, and inputs the above-mentioned fault self-healing plan to the execution engine;
  • the execution engine performs fault self-healing by executing the fault self-healing tasks included in the fault self-healing scheme.
  • the operation and maintenance staff can monitor the execution of the above fault self-healing tasks through the decision engine and the collaborative engine, and adjust the scheduling sequence of the fault self-healing tasks, and/or control the fault The execution progress of the self-healing task.
  • the perception engine includes three functional units, which are distributed as follows: document data subscription, alarm data subscription, and index data extraction.
  • Document data subscription is used to subscribe to documents in the document platform, extract the obtained documents in a structured manner, and obtain known operation and maintenance information in the form of "question-answer", also known as operation and maintenance knowledge, and add the above operation and maintenance knowledge to In the operation and maintenance knowledge base.
  • Alarm data subscription is used to subscribe to the alarm information in the event platform, and extract the target field value of the preset field of the obtained alarm information.
  • Index data extraction is used to obtain monitoring indicators, change orders, and capacity data from the monitoring system, change system, and capacity system.
  • the monitoring indicators include other alarm information generated by the business system for the target node where the alarm information occurs, and the operation of the business system.
  • Status, abnormal information, etc., the change order includes information on whether the business system has changed before the alarm message is generated, and the content of the system change, and the capacity data includes the available resources of the business system.
  • the decision engine includes four functional units, namely; situation understanding, plan recommendation, plan generation and self-healing plan controller.
  • the situation understanding obtains the alarm information, according to the target field value of the preset field recorded in the alarm information, extracts the associated data related to the alarm information from the index data extraction functional unit, and conducts multi-dimensional analysis on the associated data to obtain Information about the cause of the fault that triggered the alarm message.
  • the plan recommendation includes two functional subunits, namely: rule matching, plan recommendation, and recommendation model.
  • the above recommendation model can be an NLP/KG (Natural Language Processing/Knowledge Graph, natural language understanding/knowledge graph) model.
  • the above rule matching is used to obtain the cause information, and the cause information and alarm information are used as keywords, and the keyword matching is performed from each known fault self-healing scheme stored in the scheme library corresponding to the self-healing configuration module to determine the successfully matched fault self-healing plan.
  • the above scheme is recommended to call the recommendation model according to the alarm information and reason information after the above rules are not successfully matched, obtain the description information of candidate processing methods from the operation and maintenance knowledge base, and determine the confidence of each description information (the aforementioned The first success probability) is a number of "answers" in each "question-answer" stored in the operation and maintenance knowledge base.
  • the plan generation includes a plan generator.
  • the plan generator obtains several "answers" output by the above-mentioned plan recommendation function subunits, and arranges them in order of confidence from high to low.
  • the plan generator calls the effect prediction algorithm to analyze the above "answers”. ” to predict the effect and obtain the second probability of success. Based on the first probability of success and the second probability of success, combined with the adjustment and control of the operation and maintenance staff, a fault self-healing plan is generated.
  • the self-healing scheme controller is used to obtain the fault self-healing scheme generated by the scheme generator, and input the above-mentioned fault self-healing scheme into the execution engine function module, and is also used to control the execution process during the execution of the fault self-healing scheme Risk and progress control.
  • the operation and maintenance staff can confirm the knowledge entered in the operation and maintenance knowledge base, adjust the generated fault self-healing plan, and intervene in the process of executing the fault self-healing plan through the manual takeover module in the above function modules .
  • the collaborative engine function module is also used to collect the behavior data of the operation and maintenance staff, and use the above behavior data as the training samples of the recommendation model to iteratively update the above model.
  • the fault self-healing console includes a fault self-healing solution recommendation function module, an operation and maintenance knowledge base function module, an effect statistical analysis function module, a self-healing configuration function module, a login authentication function module, a self-healing task management function module, and a self-healing plan editing function module functional module.
  • the fault self-healing scheme recommendation function module is used to display the generated fault self-healing scheme.
  • the operation and maintenance knowledge base can enable the operation and maintenance staff to enter and confirm the operation and maintenance knowledge based on this functional module.
  • Rights management used to manage user rights.
  • Self-healing task management used to display the currently generated or executed fault self-healing plan, and provide the function of editing the fault self-healing tasks included in the fault self-healing plan, such as adding, deleting, modifying operation tasks, checking tasks, etc. You can also configure the parameters of the execution tool corresponding to each task.
  • FIG. 8 is a schematic structural diagram of a first fault handling device provided by an embodiment of the present disclosure.
  • the above device includes the following modules 801 - 805 .
  • the information obtaining module 801 is configured to obtain the alarm information generated by the business system
  • the data obtaining module 802 is configured to obtain associated data related to the above-mentioned alarm information according to the monitoring data of the above-mentioned business system;
  • the plan obtaining module 804 is configured to obtain a fault self-healing plan including a fault self-healing task according to the above-mentioned cause information and the above-mentioned alarm information;
  • the associated data is the data that is related to the alarm information
  • this failure may bring a series of chain effects, and this failure may also be caused by other problems.
  • the various information generated in this series of cascading effects are interrelated, and based on the various information related to each other, the cause information of the fault can be determined. Therefore, the fault that triggers the alarm information can be determined more accurately based on the associated data.
  • the cause information, and then the fault self-healing scheme obtained according to the cause information and alarm information has a higher success rate of self-healing for the fault corresponding to the cause information.
  • the alarm information obtaining sub-module 8021 is configured to obtain other alarm information for the target service node within the first time period of the alarm time recorded in the above alarm information from the above monitoring data, wherein the above target service node is: the above The service node targeted by the alarm information;
  • the first information obtaining sub-module 8022 is configured to determine the impact factor of the fault that triggers the target fault type, and obtain the above-mentioned impact according to the value of the above-mentioned impact factor recorded in the above-mentioned monitoring data and the above-mentioned alarm time within the second time period
  • the first information of factor fluctuations, wherein the above-mentioned target fault type is: the fault type recorded in the above-mentioned alarm information;
  • the second information obtaining sub-module 8023 is configured to obtain the second information indicating whether the above-mentioned service system has been updated within the third time period before the generation of the above-mentioned alarm information according to the latest system update time recorded in the above-mentioned monitoring data and the above-mentioned alarm time. information;
  • the resource amount selection sub-module 8024 is configured to select the available resource amount of the above-mentioned business system within the fourth time period where the above-mentioned alarm time is located from the above-mentioned monitoring data.
  • the impact factor is the impact factor of the fault type recorded in the trigger alarm information
  • the value of the impact factor within a certain period of time is associated with the alarm information generated within this period of time
  • the above-mentioned first information is determined according to the value of the influence factor in the second time period including the alarm time of the alarm information
  • the above-mentioned first information is related to the value of the influence factor in the second time period, so,
  • the first information has an association relationship with the above-mentioned alarm information, and the accuracy of determining the first information as associated data is high.
  • the second information is the information indicating whether the system update occurs in the third time period before the alarm information is generated, and because the occurrence of the system update causes the business system to fail, it is easy to The alarm information is generated, so the correlation between the second information representing whether the system occurs within the third time period and the alarm information is high, and the accuracy of determining the second information as the associated data is high.
  • FIG. 10 is a schematic structural diagram of a third fault handling device provided by an embodiment of the present disclosure.
  • the above solution obtaining module 804 includes the following submodules 8041-8042:
  • the scheme search sub-module 8041 is configured to search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the above cause information in the known fault self-healing scheme according to the above-mentioned alarm information;
  • the scheme determining submodule 8042 is configured to determine the found fault self-healing scheme as a fault self-healing scheme including a fault self-healing task.
  • the fault self-healing scheme refers to the known fault self-healing scheme
  • the fault self-healing scheme can be found from the above-mentioned known fault self-healing scheme, and the fault self-healing can be performed directly based on the found fault self-healing scheme , improving the efficiency of fault self-healing.
  • the found fault self-healing scheme can perform self-healing processing on the above fault, so that The recovery scheme realizes self-healing of faults.
  • the solution search sub-module 8041 is also set to extract the target field value of the preset field in the above alarm information; based on the above target field value, in the above known fault self-healing solution, search for the above A fault self-healing solution that includes a target self-healing task for the fault corresponding to the cause information to be self-healed, wherein the target self-healing task is a fault self-healing task set according to the target field value of the preset field.
  • the target self-healing task is set according to the target field value of the preset field
  • the fault self-healing task so the found fault self-healing task is set according to the target field value of the preset field in the alarm information, which improves the success rate of the fault self-healing task for the fault self-healing task.
  • the values of these fields include the alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the equipment generating the alarm information, the identification of the equipment room where the equipment is located, and the identification of the instance that triggered the generation of the alarm information, the values of these fields.
  • the specific situation of the alarm information is expressed from different aspects, and the alarm information can be reflected more accurately by extracting the values of the above-mentioned preset fields in the alarm information.
  • Fig. 11a is a schematic structural diagram of a fourth fault handling device provided by an embodiment of the present disclosure.
  • the above solution obtaining module 804 further includes the following submodules 8043-8047.
  • the first similarity obtaining sub-module 8043 is set to, after the failure self-healing scheme for self-healing processing of the failure corresponding to the above-mentioned reason information is not found in the above-mentioned scheme searching sub-module 8041, according to the description information in each known operation and maintenance information
  • the first semantics of the above-mentioned reason information and the second semantics of the above-mentioned reason information are used to obtain the first similarity between each known operation and maintenance information and the above-mentioned reason information, wherein each piece of known operation and maintenance information includes: description information of system abnormalities and Description information of the system exception handling method;
  • the second similarity obtaining submodule 8044 is configured to obtain the second similarity between each known operation and maintenance information and the above-mentioned alarm information according to the above-mentioned first semantics and the above-mentioned third semantics of the alarm information;
  • the candidate solution determination sub-module 8046 is configured to, for each candidate processing method, obtain candidate processing tasks based on the description information of the candidate processing method, so as to obtain a candidate fault self-healing solution including the above candidate processing tasks;
  • the self-healing scheme determination sub-module 8047 is configured to determine the fault self-healing scheme from each candidate fault self-healing scheme.
  • the second similarity is the difference between the above-mentioned first semantics and the third semantics of the alarm information
  • the semantics of the cause information, alarm information and the semantics of the description information in each known operation and maintenance information are considered comprehensively.
  • the similarity makes the determined candidate processing methods corresponding to the description information more accurately handle the faults of the alarm information, thereby making the determined fault self-healing scheme more accurate.
  • FIG. 11b is a schematic structural diagram of a fifth fault handling device provided by an embodiment of the present disclosure.
  • the above solution obtaining module 804 further includes:
  • the probability obtaining sub-module 8048 is configured to obtain the first success probability of self-healing of the fault corresponding to the above-mentioned reason information by adopting each candidate processing method after the above-mentioned information selection sub-module 8045;
  • the above self-healing scheme determines the submodule 8047, including:
  • the probability estimation unit 80471 is configured to, for each candidate fault self-healing scheme, estimate the impact of the candidate fault self-healing scheme on the above-mentioned reasons according to the current network environment information of the above-mentioned business system and the candidate processing tasks included in the candidate fault self-healing scheme.
  • the self-healing scheme determining unit 80472 is configured to determine a fault self-healing scheme from each candidate fault self-healing scheme according to the first success probability and the second success probability.
  • the second success is to determine the success probability of each candidate fault self-healing scheme from two different angles, so based on the above two success probabilities, the success probability of fault self-healing through the determined fault self-healing scheme is improved.
  • Fig. 12 is a schematic structural diagram of a sixth fault handling device provided by an embodiment of the present disclosure.
  • the above probability estimation unit 80471 includes:
  • the time-consuming determination subunit 804711 is configured to determine the time-consuming execution of each candidate processing task according to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing solution;
  • the probability estimation subunit 804712 is configured to estimate the first time for self-healing the fault corresponding to the above cause information by using the candidate fault self-healing scheme according to the execution time of each candidate processing task and the current network environment information of the above-mentioned business system. 2. Probability of success.
  • the estimated second success probability is related to the execution time consumption of each candidate processing task, and the execution time consumption of each candidate processing task is time affects the efficiency of fault self-healing, so the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability.
  • the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability.
  • the above fault self-healing module 805 includes:
  • the tool determination submodule is configured to determine the task execution tool matching the task type of each fault self-healing task
  • the fault self-healing sub-module is set to call the task execution tools corresponding to the above fault self-healing tasks according to the execution sequence of the fault self-healing tasks in the above fault self-healing scheme, execute the above fault self-healing tasks, and perform fault self-healing.
  • the task execution tool matching the task type of the task can execute the task. Therefore, by calling the above-mentioned task execution tools, the fault self-healing task can be executed , so as to realize fault self-healing.
  • the above-mentioned device further includes:
  • the process monitoring module is configured to monitor the execution process of each fault self-healing task
  • the task control module is configured to adjust the scheduling sequence of the above-mentioned fault self-healing tasks, and/or control the execution progress of the above-mentioned fault self-healing tasks in the case of monitoring abnormal task execution.
  • the scheduling sequence of fault self-healing tasks is adjusted, and/or, the execution progress of fault self-healing tasks is controlled, so that possible problems can be adjusted in time, so as to successfully realize fault self-healing .
  • An embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by The at least one processor executes a method for enabling the at least one processor to fail.
  • An embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the fault handling method.
  • An embodiment of the present disclosure provides a computer program product, including a computer program, and the computer program implements a fault handling method when executed by a processor.
  • FIG. 13 shows a schematic block diagram of an example electronic device 1300 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the I/O interface 1305 includes: an input unit 1306, such as a keyboard, a mouse, etc.; an output unit 1307, such as various types of displays, speakers, etc.; a storage unit 1308, such as a magnetic disk, an optical disk etc.; and a communication unit 1309, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 1309 allows the electronic device 1300 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 1301 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 1301 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the calculation unit 1301 executes various methods and processes described above, such as a fault handling method.
  • the fault handling method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1308 .
  • part or all of the computer program can be loaded and/or installed on the electronic device 1300 via the ROM 1302 and/or the communication unit 1309.
  • the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the fault handling method described above can be performed.
  • the computing unit 1301 may be configured in any other appropriate way (for example, by means of firmware) to execute the fault handling method.
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to the technical field of data processing, particularly, the technical field of fault processing, and provides a fault processing method and apparatus, a device, and a storage medium. The specific implementation solution is: obtaining alarm information generated by a service system; according to monitoring data of the service system, obtaining associated data having an association with the alarm information; according to the associated data, determining cause information for triggering a fault generating the alarm information; obtaining a fault self-healing solution comprising a fault self-healing task according to the cause information and the alarm information; and performing fault self-healing by executing the fault self-healing task comprised in the fault self-healing solution. By applying the solution provided in the embodiments of the present invention, fault self-heating can be implemented on a fault generated by a service system.

Description

一种故障处理方法、装置、设备及存储介质A fault handling method, device, equipment and storage medium
本公开要求于2021年08月06日提交中国专利局、申请号为202110904245.7发明名称为“一种故障处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application with the application number 202110904245.7 submitted to the China Patent Office on August 06, 2021, and the title of the invention is "a fault handling method, device, equipment and storage medium", the entire content of which is incorporated by reference in this disclosure.
技术领域technical field
本公开涉及数据处理技术领域,尤其涉及故障处理技术领域。进一步涉及一种故障处理方法、装置、设备及存储介质。The present disclosure relates to the technical field of data processing, in particular to the technical field of fault processing. It further relates to a fault handling method, device, equipment and storage medium.
背景技术Background technique
随着企业大力推进IT数字化转型,企业为用户提供的各种业务越来越多的通过线上业务***实现。另外,企业为满足用户不断变化的用户需求,上述业务***为用户提供的各种业务也越来越丰富,因此,业务***的业务量越来越大,进而业务***发生故障的概率也越来越高。As enterprises vigorously promote IT digital transformation, more and more businesses provided by enterprises to users are realized through online business systems. In addition, in order to meet the ever-changing needs of users, the above-mentioned business systems provide more and more services for users. Therefore, the business volume of the business system is increasing, and the probability of failure of the business system is also increasing. higher.
发明内容Contents of the invention
本公开提供了一种故障处理方法、装置、设备以及存储介质。The disclosure provides a fault handling method, device, equipment and storage medium.
根据本公开的一方面,提供了一种故障处理方法,包括:According to an aspect of the present disclosure, a fault handling method is provided, including:
获得业务***产生的报警信息;根据上述业务***的监控数据,获得与上述报警信息具有关联性的关联数据;根据上述关联数据,确定触发产生上述报警信息的故障的原因信息;根据上述原因信息和上述报警信息,获得包括故障自愈任务的故障自愈方案;通过执行上述故障自愈方案中包括的故障自愈任务,进行故障自愈。Obtain the alarm information generated by the business system; obtain the associated data related to the above alarm information according to the monitoring data of the above business system; determine the cause information of the fault that triggered the above alarm information according to the above associated data; According to the above alarm information, a fault self-healing scheme including fault self-healing tasks is obtained; by executing the fault self-healing tasks included in the above fault self-healing scheme, fault self-healing is performed.
根据本公开的另一方面,提供了一种故障处理装置,包括:According to another aspect of the present disclosure, a fault handling device is provided, including:
信息获得模块,用于获得业务***产生的报警信息;数据获得模块,用于根据上述业务***的监控数据,获得与上述报警信息具有关联性的关联数据;信息确定模块,用于根据上述关联数据,确定触发产生上述报警信息的故障的原因信息;方案获得模块,用于根据上述原因信息和上述报警信息,获得包括故障自愈任务的故障自愈方案;故障自愈模块,用于通过执行上述故障自愈方案中包括的故障自愈任务,进行故障自愈。The information acquisition module is used to obtain the alarm information generated by the business system; the data acquisition module is used to obtain the associated data related to the above alarm information based on the monitoring data of the above business system; the information determination module is used to obtain the associated data based on the above associated data , determine the cause information of the fault that triggers the above-mentioned alarm information; the scheme acquisition module is used to obtain the fault self-healing plan including the fault self-healing task according to the above-mentioned cause information and the above-mentioned alarm information; the fault self-healing module is used to execute the above-mentioned The fault self-healing task included in the fault self-healing scheme performs fault self-healing.
根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, an electronic device is provided, including:
至少一个处理器;以及与上述至少一个处理器通信连接的存储器;其中,上述存储器存储有可被上述至少一个处理器执行的指令,上述指令被上述至少一个处理器执行,以使上述至少一个处理器能够执行故障处理方法。At least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processing The controller is able to implement the fault handling method.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,上述计算机指令用于使上述计算机执行故障处理方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the fault handling method.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,上述计算机程序在被处理器执行时实现故障处理方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, and the above computer program implements a fault handling method when executed by a processor.
根据本公开的另一方面,提供了一种计算机程序,计算机程序在计算机上运行时,使得计算机执行上述一方面提供的任一故障处理方法的步骤。According to another aspect of the present disclosure, a computer program is provided. When running on a computer, the computer program causes the computer to execute the steps of any fault handling method provided in the above aspect.
由以上可见,应用本公开实施例提供的方案进行故障自愈时,由于故障自愈任务既考虑了报警信息自身的信息,又考虑了触发产生报警信息的故障的原因信息,原因信息能够反映触发产生报警信息的故障的原因,因此,故障自愈任务不仅能够从报警信息所呈现出来的故障直观层面进行故障自愈,还能够从故障的原因信息所呈现出来的根源层面进行故障自愈,进而实现了故障问题的精准解决,有效提升了 故障止损的效率,也就是说,应用本公开实施例所提供的方案能够对业务***产生的故障进行故障自愈,且在进行故障自愈时能够实现故障问题的精准解决,有效提升故障止损的效率。It can be seen from the above that when the solution provided by the embodiments of the present disclosure is used for fault self-healing, since the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, the cause information can reflect the trigger Therefore, the fault self-healing task can not only perform fault self-healing from the visual level of the fault presented by the alarm information, but also perform fault self-healing from the root level presented by the cause information of the fault, and then The precise solution to the fault problem is realized, and the efficiency of fault stop loss is effectively improved. That is to say, the application of the solution provided by the embodiment of the present disclosure can perform fault self-healing for the faults generated by the business system, and can perform fault self-healing. Accurately solve the fault problem and effectively improve the efficiency of fault stop loss.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:
图1为本公开实施例提供的一种业务***的故障管理全流程的示意图;FIG. 1 is a schematic diagram of a whole process of fault management of a business system provided by an embodiment of the present disclosure;
图2为本公开实施例提供的第一种故障处理方法的流程示意图;FIG. 2 is a schematic flowchart of a first fault handling method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的第二种故障处理方法的流程示意图;FIG. 3 is a schematic flowchart of a second fault handling method provided by an embodiment of the present disclosure;
图4为本公开实施例提供的第三种故障处理方法的流程示意图;FIG. 4 is a schematic flowchart of a third fault handling method provided by an embodiment of the present disclosure;
图5a为本公开实施例提供的第四种故障处理方法的流程示意图;Fig. 5a is a schematic flowchart of a fourth fault handling method provided by an embodiment of the present disclosure;
图5b为本公开实施例提供的第五种故障处理方法的流程示意图;Fig. 5b is a schematic flowchart of a fifth fault handling method provided by an embodiment of the present disclosure;
图6为本公开实施例提供的第六种故障处理方法的流程示意图;FIG. 6 is a schematic flowchart of a sixth fault handling method provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种故障处理方法的流程框图;FIG. 7 is a flowchart of a fault handling method provided by an embodiment of the present disclosure;
图8为本公开实施例提供的第一种故障处理装置的结构示意图;FIG. 8 is a schematic structural diagram of a first fault handling device provided by an embodiment of the present disclosure;
图9为本公开实施例提供的第二种故障处理装置的结构示意图;FIG. 9 is a schematic structural diagram of a second fault handling device provided by an embodiment of the present disclosure;
图10为本公开实施例提供的第三种故障处理装置的结构示意图;FIG. 10 is a schematic structural diagram of a third fault handling device provided by an embodiment of the present disclosure;
图11a为本公开实施例提供的第四种故障处理装置的结构示意图;Fig. 11a is a schematic structural diagram of a fourth fault handling device provided by an embodiment of the present disclosure;
图11b为本公开实施例提供的第五种故障处理装置的结构示意图;Fig. 11b is a schematic structural diagram of a fifth fault handling device provided by an embodiment of the present disclosure;
图12为本公开实施例提供的第六种故障处理装置的结构示意图;FIG. 12 is a schematic structural diagram of a sixth fault handling device provided by an embodiment of the present disclosure;
图13为本公开实施例提供的一种电子设备的框图。Fig. 13 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
由于业务***发生故障时,需要对故障进行有效解决,本公开实施例提供了一种故障处理方法、装置、设备及存储介质。When a business system fails, it is necessary to effectively resolve the failure. Embodiments of the present disclosure provide a failure handling method, device, equipment, and storage medium.
本公开的一个实施例中,提供了一种故障处理方法,包括:In one embodiment of the present disclosure, a fault handling method is provided, including:
获得业务***产生的报警信息;Obtain the alarm information generated by the business system;
根据业务***的监控数据,获得与报警信息具有关联性的关联数据;According to the monitoring data of the business system, the associated data related to the alarm information is obtained;
根据关联数据,确定触发产生报警信息的故障的原因信息;Determine the cause information of the fault that triggers the alarm information according to the associated data;
根据原因信息和报警信息,获得包括故障自愈任务的故障自愈方案;Obtain a fault self-healing plan including fault self-healing tasks according to the cause information and alarm information;
通过执行故障自愈方案中包括的故障自愈任务,进行故障自愈。Fault self-healing is performed by executing the fault self-healing tasks included in the fault self-healing scheme.
需要说明的,本公开所提供的故障处理方法可以应用于电子设备,该电子设备可以为用于管理业务***的终端或服务器,本公开对于电子设备的具体形态并不做限定。由于故障自愈任务既考虑了报警信息自身的信息,又考虑了触发产生报警信息的故障的原因信息,原因信息能够反映触发产生报警信息的 故障的原因,因此,故障自愈任务不仅能够从报警信息所呈现出来的故障直观层面进行故障自愈,还能够从故障的原因信息所呈现出来的根源层面进行故障自愈,进而实现了故障问题的精准解决,有效提升了故障止损的效率,也就是说,应用本公开实施例所提供的方案能够对业务***产生的故障进行故障自愈,且在进行故障自愈时能够实现故障问题的精准解决,有效提升故障止损的效率。It should be noted that the fault handling method provided in the present disclosure can be applied to electronic equipment, and the electronic equipment can be a terminal or a server for managing a service system, and the present disclosure does not limit the specific form of the electronic equipment. Since the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, and the cause information can reflect the cause of the fault that triggers the alarm information. The fault self-healing can be carried out at the visual level of the fault presented by the information, and the fault self-healing can also be carried out at the root level presented by the fault cause information, thereby realizing the precise solution of the fault problem, effectively improving the efficiency of fault stop loss, and also That is to say, applying the solutions provided by the embodiments of the present disclosure can perform fault self-healing for faults generated in the business system, and can achieve precise resolution of fault problems during fault self-healing, effectively improving the efficiency of fault stop loss.
首先,对本公开实施例的应用场景进行说明。First, the application scenarios of the embodiments of the present disclosure are described.
本公开实施例的应用场景为:对业务***发生的故障进行故障自愈的运维场景。The application scenario of the embodiments of the present disclosure is: an operation and maintenance scenario in which fault self-healing is performed on a fault occurred in a business system.
业务***配置有故障管理全流程,故障管理全流程的各阶段从开始到结束可以依次为:故障预防阶段、发现故障阶段、止损、根因定位以及恢复服务阶段、总结改进阶段,故障自愈是上述止损、根因定位以及恢复服务阶段中的一个环节。其中,上述根因定位表示确定触发产生报警信息的故障的原因。The business system is configured with the whole process of fault management. The stages of the whole process of fault management can be sequenced from beginning to end: fault prevention stage, fault discovery stage, stop loss, root cause location and recovery service stage, summary improvement stage, fault self-healing It is a link in the stop loss, root cause location and restoration of service phase mentioned above. Wherein, the above-mentioned root cause location refers to determining the cause of the fault that triggers the generation of the alarm information.
以图1为例,图1示出了业务***的故障管理全流程的示意图。Taking FIG. 1 as an example, FIG. 1 shows a schematic diagram of the whole process of fault management of the business system.
从图1可以看到,在止损、根因定位以及恢复服务阶段中,故障自愈的过程为:从响应报警信息到故障自愈方案生成、从故障自愈方案生成到开始执行故障自愈方案、从开始执行故障自愈方案到结束执行故障自愈方案这一整个过程。故障自愈所经历的时长称为止损时长,止损时长越短,业务损失越小。As can be seen from Figure 1, in the stages of stop loss, root cause location, and service recovery, the process of fault self-healing is: from responding to alarm information to generating a fault self-healing plan, from generating a fault self-healing plan to starting to execute fault self-healing Program, the whole process from the beginning of implementing the fault self-healing program to the end of implementing the fault self-healing program. The length of time for self-healing of faults is called the stop loss time. The shorter the stop loss time, the smaller the business loss.
以下对本公开实施例提供的故障处理方法进行具体说明。The fault handling method provided by the embodiments of the present disclosure will be described in detail below.
参见图2,图2为本公开实施例提供的第一种故障处理方法的流程示意图,上述方法包括以下步骤S201-步骤S205。Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first fault handling method provided by an embodiment of the present disclosure. The above method includes the following steps S201 - S205 .
步骤S201:获得业务***产生的报警信息。Step S201: Obtain alarm information generated by the business system.
业务***是用于向用户提供业务的***,例如,上述业务***可以是搜索业务***、云存储业务***、游戏业务***等。当业务***产生报警信息时,表示业务***中出现了故障,进而发生报警,报警信息中一般记录有报警的描述信息,如,报警时间、发生报警的业务节点的名称等信息。A service system is a system for providing services to users. For example, the above-mentioned service system may be a search service system, a cloud storage service system, a game service system, and the like. When the business system generates alarm information, it means that there is a fault in the business system, and then an alarm occurs. The alarm information generally records the description information of the alarm, such as the alarm time, the name of the service node where the alarm occurred, and other information.
上述报警信息可以是基于业务***的监控结果获得的信息。一种实现方式中,可以对业务***运行过程中产生的不同类型的信息进行监控,当监控到业务***产生报警类型的信息,也就是,报警信息时,可以从存储报警信息的信息库中获取报警信息。例如:上述信息库可以位于事件平台,业务***中配置有事件平台,这样事件平台用于存储业务***产生的报警信息,在此基础上,可以按照预设时间间隔从事件平台中获取报警信息,上述预设时间间隔可以为1s、5s、10s等。The above alarm information may be information obtained based on the monitoring results of the business system. In one implementation, different types of information generated during the operation of the business system can be monitored. When the business system is monitored to generate alarm-type information, that is, alarm information, it can be obtained from the information database that stores the alarm information. Alarm information. For example: the above-mentioned information library can be located in the event platform, and the business system is configured with the event platform. In this way, the event platform is used to store the alarm information generated by the business system. On this basis, the alarm information can be obtained from the event platform according to the preset time interval. The aforementioned preset time interval may be 1s, 5s, 10s, etc.
步骤S202:根据业务***的监控数据,获得与报警信息具有关联性的关联数据。Step S202: According to the monitoring data of the business system, the associated data related to the alarm information is obtained.
上述报警信息为S201步骤中所获得的业务***产生的报警信息。The above alarm information is the alarm information generated by the business system obtained in step S201.
业务***运行过程中可能会产生各种类型的数据,对这些数据进行监控得到监控数据。Various types of data may be generated during the operation of the business system, and these data are monitored to obtain monitoring data.
上述监控数据可以是来自业务***配置的不同功能子***的监控数据,例如,上述监控数据可以为:来自监控功能子***中记录的业务***产生的各种报警信息、业务***的运行状态等,上述监控数据还可以为:来自***变更子***中记录的业务***变更信息、来自容量子***中记录的业务***的可用资源量等。The above-mentioned monitoring data can be monitoring data from different functional subsystems configured by the business system. For example, the above-mentioned monitoring data can be: various alarm information generated by the business system recorded in the monitoring functional subsystem, the operating status of the business system, etc., The above monitoring data may also be: business system change information recorded in the system change subsystem, available resources of the business system recorded in the capacity subsystem, and the like.
关联数据中可以包括与报警信息具有关联性的监控数据,还可以包括对监控数据进行分析得到的、与报警信息具有关联性的数据。The associated data may include monitoring data associated with the alarm information, and may also include data associated with the alarm information obtained by analyzing the monitoring data.
具体获得关联数据的方式可以参见图3对应的实施例,在此不进行详述。For a specific manner of obtaining associated data, reference may be made to the embodiment corresponding to FIG. 3 , which will not be described in detail here.
步骤S203:根据关联数据,确定触发产生报警信息的故障的原因信息。Step S203: According to the associated data, determine the cause information of the failure that triggers the alarm information.
上述报警信息为S201步骤中所获得的业务***产生的报警信息。The above alarm information is the alarm information generated by the business system obtained in step S201.
上述原因信息表征触发产生报警信息的故障的原因。The above cause information represents the cause of the failure that triggers the alarm information.
例如:业务***发生的故障为:网络链路出现故障,这一故障会触发产生的报警信息为:业务节点的网络流量下跌,上述故障的原因是网络链路出现断路,所以,触发产生上述报警信息的故障的原因信息为:“网络链路出现断路”。For example, the failure of the business system is: the network link is faulty, and the alarm information triggered by this failure is: the network traffic of the business node drops. The reason for the above failure is that the network link is disconnected, so the above alarm is triggered The cause information of the information failure is: "the network link is disconnected".
由于关联数据是与报警信息具有关联性的数据,又由于在业务***出现故障时,这一故障可能会带来一系列连锁效应,同时这一故障也可能是由其他问题导致的,在这一连串的连锁效应中所产生的各种信息是相互关联的,基于具有关联关系的各种信息可以确定导致故障产生的原因信息,因此,根据关联数据能够确定触发产生报警信息的故障的原因信息。Because the associated data is the data that is related to the alarm information, and because when the business system fails, this failure may bring a series of chain effects, and this failure may also be caused by other problems. The various information generated in the cascading effect of the system are interrelated, and the cause information that leads to the fault can be determined based on the various information with the correlation relationship. Therefore, the cause information that triggers the fault that generates the alarm information can be determined according to the associated data.
一种实现方式中,可以预先设定报警信息与原因信息之间的对应关系,由于多种故障均可能会引起相同的报警,因此,一个报警信息可能会与多种原因信息存在对应关系,基于此,可以先从上述对应关系中,确定当前的报警信息对应的各备选原因信息,从各备选原因信息中选择与关联数据相关的备选原因信息,作为触发产生报警信息的故障的原因信息。In one implementation, the corresponding relationship between the alarm information and the cause information can be preset. Since multiple faults may cause the same alarm, one alarm information may have a corresponding relationship with multiple cause information, based on Therefore, from the above-mentioned corresponding relationship, each candidate cause information corresponding to the current alarm information can be determined first, and the candidate cause information related to the associated data can be selected from each candidate cause information as the cause of the fault that triggers the alarm information information.
上述对应关系可以是由专家或者运维工作人员根据经验确定的。The above corresponding relationship may be determined by experts or operation and maintenance staff based on experience.
例如:当前的报警信息为:业务***A流量下跌,从预设的各对应关系中,确定上述报警信息的三种备选原因信息分别为:“网络出现故障”、“***变更”、“设备A故障”,所获得的关联数据中包括:网络流量波动大、设备B故障,备选原因信息为“网络出现故障”与关联数据中“网络流量波动大”是相关的,可以将“网络出现故障”确定为触发产生报警信息的故障的原因信息。For example: the current alarm information is: the traffic of business system A drops. From the preset corresponding relationships, the three alternative cause information for the above alarm information are: "network failure", "system change", "device A failure", the associated data obtained includes: network traffic fluctuates greatly, device B fails, the alternative cause information is "network failure" is related to "network traffic fluctuates greatly" in the associated data, and "network occurrence Fault" is determined as the cause information of the fault that triggers the generation of the alarm message.
步骤S204:根据原因信息和报警信息,获得包括故障自愈任务的故障自愈方案。Step S204: Obtain a fault self-healing solution including a fault self-healing task according to the cause information and the alarm information.
上述报警信息为S201步骤中所获得的业务***产生的报警信息。The above alarm information is the alarm information generated by the business system obtained in step S201.
故障自愈方案中包括的故障自愈任务的数量可以是1个,也可以是多个。当故障自愈任务的数量是多个时,故障自愈方案中还包括各个故障自愈任务的执行顺序,其中,各个故障自愈任务之间的执行顺序可以是并行执行顺序,也可以是串行执行顺序。The number of fault self-healing tasks included in the fault self-healing scheme may be one or multiple. When there are multiple fault self-healing tasks, the fault self-healing scheme also includes the execution sequence of each fault self-healing task, where the execution sequence of each fault self-healing task can be a parallel execution sequence or a serial execution sequence. row execution order.
获得故障自愈方案的具体实现方式可以参见图4、图5a、图5b、图6对应的实施例,在此不进行详述。For a specific implementation manner of obtaining a fault self-healing solution, reference may be made to the embodiments corresponding to FIG. 4 , FIG. 5 a , FIG. 5 b , and FIG. 6 , and details are not described here.
步骤S205:通过执行故障自愈方案中包括的故障自愈任务,进行故障自愈。Step S205: Perform fault self-healing by executing the fault self-healing tasks included in the fault self-healing solution.
一种实现方式中,可以按照故障自愈方案中包括的故障自愈任务的执行顺序,依次执行各故障自愈任务,实现故障自愈。也就是说,在故障自愈任务的数量为多个时,可以按照设定的各个故障自愈任务的执行顺序,依此执行各个故障自愈任务。In an implementation manner, each fault self-healing task may be executed sequentially according to the execution order of the fault self-healing tasks included in the fault self-healing scheme, so as to realize fault self-healing. That is to say, when there are multiple fault self-healing tasks, each fault self-healing task can be executed according to the execution order of each fault self-healing task that is set.
另一种实现方式中,在故障自愈任务的数量为多个时,可以确定与每一故障自愈任务的任务类型相匹配的任务执行工具;按照故障自愈方案中各故障自愈任务的执行顺序,调用各故障自愈任务对应的任务执行工具,执行各故障自愈任务,进行故障自愈。In another implementation, when there are multiple fault self-healing tasks, the task execution tool matching the task type of each fault self-healing task can be determined; according to the fault self-healing task in the fault self-healing scheme The execution sequence calls the task execution tool corresponding to each fault self-healing task, executes each fault self-healing task, and performs fault self-healing.
各任务执行工具用于执行不同任务类型的任务。例如:上述任务执行工具可以包括:网络链路关闭工具、重启工具、健康度检查工具等。Each task execution tool is used to execute tasks of different task types. For example, the above-mentioned task execution tool may include: a network link closing tool, a restart tool, a health check tool, and the like.
由于上述任务执行工具是与故障自愈任务的任务类型相匹配的,与任务的任务类型相匹配的任务执行工具能够执行该任务,因此,通过调用上述各任务执行工具,能够执行故障自愈任务,从而实现故障自愈。需要说明的是,当故障自愈任务为一个时,可以单独地自行执行该任务,或者,借助任务执行工具来执行该故障自愈任务。Since the above-mentioned task execution tools match the task type of the fault self-healing task, the task execution tool matching the task type of the task can execute the task. Therefore, by calling the above-mentioned task execution tools, the fault self-healing task can be executed , so as to realize fault self-healing. It should be noted that, when there is one fault self-healing task, the fault self-healing task can be executed independently or by means of a task execution tool.
由以上可见,应用本实施例提供的方案进行故障自愈时,由于故障自愈任务既考虑了报警信息自身的信息,又考虑了触发产生报警信息的故障的原因信息,原因信息能够反映触发产生报警信息的故障的原因,因此,故障自愈任务不仅能够从报警信息所呈现出来的故障直观层面进行故障自愈,还能够从故障的原因信息所呈现出来的根源层面进行故障自愈,进而实现了故障问题的精准解决,有效提升了故障止损的效率。It can be seen from the above that when the solution provided by this embodiment is used for fault self-healing, because the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, and the cause information can reflect the trigger generation. The cause of the fault in the alarm information, therefore, the fault self-healing task can not only perform fault self-healing from the intuitive level of the fault presented by the alarm information, but also perform fault self-healing from the root level presented by the cause information of the fault, and then realize The precise solution to the fault problem has effectively improved the efficiency of fault stop loss.
另外,由于关联数据是与报警信息具有关联性的数据,又由于在业务***出现故障时,这一故障可能会带来一系列连锁效应,同时这一故障也可能是由其他问题导致的,在这一连串的连锁效应中所产生的各种信息是相互关联的,基于具有关联关系的各种信息可以确定导致故障产生的原因信息,因此,根据关联数据能够较为准确地确定触发产生报警信息的故障的原因信息,进而根据原因信息和报警信息所获得的故障自愈方案对原因信息对应的故障进行故障自愈的成功率越高。In addition, because the associated data is the data that is related to the alarm information, and because when the business system fails, this failure may bring a series of chain effects, and this failure may also be caused by other problems. The various information generated in this series of cascading effects are interrelated, and based on the various information related to each other, the cause information of the fault can be determined. Therefore, the fault that triggers the alarm information can be determined more accurately based on the associated data. The cause information, and then the fault self-healing scheme obtained according to the cause information and alarm information has a higher success rate of self-healing for the fault corresponding to the cause information.
以下结合图3对上述步骤S202中获得关联数据的具体实现方式进行说明。图3为本公开实施例提供的第二种故障处理方法的流程示意图,上述步骤S202,可以按照以下步骤S2021-S2024中的至少一个步骤获得信息,作为与报警信息具有关联性的关联数据。The specific implementation manner of obtaining the associated data in the above step S202 will be described below with reference to FIG. 3 . Fig. 3 is a schematic flowchart of a second fault handling method provided by an embodiment of the present disclosure. In the above step S202, information may be obtained according to at least one of the following steps S2021-S2024 as associated data associated with alarm information.
步骤S2021:从监控数据中,获得报警信息中记录的报警时间所在第一时间段内的且针对目标业务节点的其他报警信息。Step S2021: From the monitoring data, obtain other alarm information for the target service node within the first time period where the alarm time recorded in the alarm information falls.
上述报警时间为:产生报警信息的时间。该报警信息指步骤S201中获得的业务***产生的报警信息。本步骤中,可以从监控数据中,获取第一时间段内的针对目标业务节点的其他报警信息,该第一时间段为报警信息中记录的报警信息所在的时间段,也就是说,第一时间段为基于所获得的报警信息的报警时间所设定的时间段,且第一时间段包含有所获得的报警信息的报警信息的时间段,并且,其他报警信息中的记录报警时间位于第一时间段。The above alarm time is: the time when the alarm information is generated. The alarm information refers to the alarm information generated by the business system obtained in step S201. In this step, other alarm information for the target service node in the first time period can be obtained from the monitoring data, and the first time period is the time period in which the alarm information recorded in the alarm information is located, that is to say, the first time period The time period is the time period set based on the alarm time of the obtained alarm information, and the first time period includes the time period of the alarm information of the obtained alarm information, and the recorded alarm time in other alarm information is located in the second a period of time.
示例性的,上述第一时间段可以为:报警时间向前延长第一预设时长的时间段,上述第一预设时长可以由工作人员根据经验设定。例如:报警时间为00:10:00,第一预设时长为5min,第一时间段为00:05:00-00:10:00;上述第一时间段还可以为:报警时间分别向前向后延长一定时长的时间段。例如:报警时间为00:10:00,向前延长的时长为5min,向后延长的时长为8min,第一时间段为00:05:00-00:18:00;上述第一时间段还可以为:报警事件向后延长第二预设时长的时间段,上述第二预设时长可以由工作人员根据经验设定。例如:报警时间为00:10:00,第二预设时长为8min,第一时间段为00:10:00-00:18:00。上述第一预设时长、第二预设时长可以是相同的,也可以是不同的。Exemplarily, the above-mentioned first time period may be: a time period in which the alarm time is extended forward by a first preset time period, and the above-mentioned first preset time period may be set by the staff based on experience. For example: the alarm time is 00:10:00, the first preset duration is 5 minutes, and the first time period is 00:05:00-00:10:00; the above first time period can also be: the alarm time is forward respectively Extends the time period backwards by a certain amount of time. For example: the alarm time is 00:10:00, the duration of the forward extension is 5 minutes, and the duration of the backward extension is 8 minutes, the first time period is 00:05:00-00:18:00; It may be that: the alarm event is extended backward for a time period of a second preset duration, and the second preset duration may be set by a staff member based on experience. For example: the alarm time is 00:10:00, the second preset duration is 8 minutes, and the first time period is 00:10:00-00:18:00. The first preset duration and the second preset duration may be the same or different.
上述目标业务节点为:报警信息针对的业务节点。该报警信息指步骤S201中获得的业务***产生的报警信息。报警信息针对的业务节点是指:产生上述报警信息的业务节点。上述目标业务节点可以为业务模块、机房或设备等。The above target service node is: the service node targeted by the alarm information. The alarm information refers to the alarm information generated by the business system obtained in step S201. The service node targeted by the alarm information refers to the service node that generates the above alarm information. The above-mentioned target service node may be a service module, a computer room, or a device.
可选的,可以从监控数据中获取上述第一时间段内生成的报警信息,从获取到的报警信息中确定针对目标业务节点的其他报警信息。Optionally, the alarm information generated in the first time period can be obtained from the monitoring data, and other alarm information for the target service node can be determined from the obtained alarm information.
由于在一个时间段内针对同一节点的各报警信息之间具有关联关系的概率较大,所以第一时间段内针对目标节点的其他报警信息与报警信息之间具有关联关系的概率较大,将上述其他报警信息确定为关联数据的准确度高。Since the probability of correlation between the alarm information of the same node in a period of time is relatively high, the probability of correlation between other alarm information and alarm information of the target node in the first period of time is relatively high. The above other alarm information is determined to have a high accuracy of associated data.
步骤S2022:确定触发目标故障类型的故障的影响因子,根据监控数据中记录的、报警时间所在第二时间段内的影响因子的取值,获得表征影响因子波动的第一信息。Step S2022: Determine the impact factor of the fault that triggers the target fault type, and obtain the first information representing the fluctuation of the impact factor according to the value of the impact factor recorded in the monitoring data and within the second time period where the alarm time is located.
其中,监控数据中记录有影响因子的取值,本步骤中,可以根据监控数据中记录的在第二时间段内的影响因子的取值,获得表征影响因子波动的第一信息。该第二时间段为基于所获得的报警信息的报警时间所设定的时间段,且第二时间段包含有所获得的报警信息的报警信息的时间段。Wherein, the value of the impact factor is recorded in the monitoring data. In this step, the first information representing the fluctuation of the impact factor can be obtained according to the value of the impact factor recorded in the monitoring data within the second time period. The second time period is a time period set based on the alarm time of the obtained alarm information, and the second time period includes the time period of the obtained alarm information of the alarm information.
上述目标故障类型为:报警信息中记录的故障类型。该报警信息指步骤S201中获得的业务***产生的报警信息。上述目标故障类型可以是触发报警信息产生的故障的故障类型。The above target fault type is: the fault type recorded in the alarm information. The alarm information refers to the alarm information generated by the business system obtained in step S201. The aforementioned target fault type may be a fault type that triggers the fault generated by the alarm information.
上述影响因子表征导致目标故障类型的故障产生的影响因素。The above-mentioned influencing factors represent the influencing factors that cause the failure of the target failure type.
可选的,可以从预先设定的故障类型与影响因子之间的对应关系中,确定目标故障类型对应的影响因子,作为触发目标故障类型的故障的影响因子。Optionally, the influence factor corresponding to the target failure type may be determined from the preset correspondence between the failure type and the influence factor as the influence factor triggering the failure of the target failure type.
例如:以目标故障类型为网络流量下跌类型为例,从预设的各对应关系中,可以确定上述目标故障类型对应的各影响因子包括:上游业务模块的网络流量、外网链路质量、内网链路质量。For example: taking the target failure type as the network traffic drop type as an example, from the preset corresponding relationships, it can be determined that the various influencing factors corresponding to the above target failure type include: the network traffic of the upstream business module, the quality of the external network link, the internal Network link quality.
上述第二时间段可以为报警时间向前延长第三预设时长的时间段,还可以为报警时间分别向前向后延长一定时长的时间段,还可以为报警事件向后延长第四预设时长的时间段。上述第三预设时长与第四预设时长可以是相同的,也可以是不同的。该报警信息指步骤S201中获得的业务***产生的报警信息。The above-mentioned second time period can be the time period in which the alarm time is extended forward by the third preset duration, the alarm time can also be extended forward and backward by a certain period of time, and the alarm event can also be extended backward by the fourth preset time period. duration of time period. The third preset duration and the fourth preset duration may be the same or different. The alarm information refers to the alarm information generated by the business system obtained in step S201.
上述第一信息表征影响因子波动的情况。以影响因子为网络流量为例,上述第一信息表征网络流量波动情况。The above-mentioned first information represents the fluctuation of the influencing factors. Taking the influence factor as network traffic as an example, the above-mentioned first information represents fluctuations in network traffic.
一种实施方式中,可以计算上述第二时间段内影响因子的最大值与最小值之间的差异值,将上述差异值确定为第一信息。当上述差异值大于预设差异阈值时,表示在第二时间段内影响因子波动较大;当上述差异值不大于预设差异阈值时,表示在第一时间段内影响因子波动较平稳。In an implementation manner, a difference value between the maximum value and the minimum value of the impact factor within the second time period may be calculated, and the difference value may be determined as the first information. When the above difference value is greater than the preset difference threshold, it means that the impact factor fluctuates greatly in the second time period; when the above difference value is not greater than the preset difference threshold, it means that the impact factor fluctuates relatively smoothly in the first time period.
另一种实施方式中,还可以计算第二时间段内影响因子的取值的平均值,将上述平均值确定为第一信息。当上述平均值大于预设平均阈值时,表示在第二时间段内影响因子波动较大;当上述平均值不大于预设平均阈值时,表示在第一时间段内影响因子波动较平稳。In another implementation manner, the average value of the values of the impact factors within the second time period may also be calculated, and the above average value is determined as the first information. When the above average value is greater than the preset average threshold value, it means that the impact factor fluctuates greatly in the second time period; when the above average value is not greater than the preset average threshold value, it means that the impact factor fluctuates relatively smoothly in the first time period.
由于所确定出的影响因子是触发报警信息中记录的故障类型的故障的影响因子,在一定时间段内影响因子的取值与该时间段内产生的报警信息具有关联关系,又由于上述第一信息是根据基于报警信息的报警时间所确定的第二时间段内的影响因子的取值确定的,上述第一信息是与第二时间段内影响因子的取值相关的,所以,第一信息与上述报警信息具有关联关系,将第一信息确定为关联数据的准确度高。Since the determined influence factor is the influence factor of the failure type that triggers the failure type recorded in the alarm information, the value of the influence factor within a certain period of time has a correlation with the alarm information generated within this period of time, and because the above-mentioned first The information is determined based on the value of the influence factor in the second time period determined based on the alarm time of the alarm information. The above-mentioned first information is related to the value of the influence factor in the second time period. Therefore, the first information It has an association relationship with the above alarm information, and the accuracy of determining the first information as associated data is high.
步骤S2023:根据监控数据中记录的最新***更新时间和报警时间,获得表征业务***在报警信息产生前的第三时间段内是否发生***更新的第二信息。Step S2023: According to the latest system update time and alarm time recorded in the monitoring data, obtain the second information indicating whether the system update occurred in the business system within the third time period before the alarm information was generated.
第三时间段为报警时间向前延长第五预设时长的时间段。最新***更新时间是指对***更新的最近时间。***更新是指对***进行升级、修复等操作。The third time period is a time period in which the alarm time is extended forward by a fifth preset time period. The latest system update time refers to the latest time for system update. System update refers to operations such as upgrading and repairing the system.
可选的,可以判断上述第三时间段内是否包含最新***更新时间,若为是,确定第二信息为:第三时间段内发生***更新;若为否,确定第二信息为:第三时间段内未发生***更新。Optionally, it may be determined whether the latest system update time is included in the above-mentioned third time period, if yes, determine the second information as: a system update occurred within the third time period; if not, determine the second information as: the third No system updates have occurred during the time period.
可选的,当确定第三时间段内发生***更新,还可以获得***更新的更新内容、更新对象、更新时间等信息,可以将所获得的上述信息确定为与报警信息具有关联性的关联数据。Optionally, when it is determined that a system update occurs within the third time period, information such as the update content, update object, and update time of the system update can also be obtained, and the obtained above-mentioned information can be determined as associated data related to the alarm information .
由于第二信息是表征报警信息产生前的第三时间段内是否发生***更新的信息,又由于发生***更新导致业务***产生故障的概率较大,从而易产生报警信息,所以表征第三时间段内是否发生***的第二信息与报警信息之间关联性高,将第二信息确定为关联数据的准确度高。Since the second information is information indicating whether a system update occurred in the third time period before the alarm information is generated, and because the system update has a high probability of failure in the business system, it is easy to generate alarm information, so it represents the third time period There is a high correlation between the second information of the internal system and the alarm information, and the accuracy of determining the second information as associated data is high.
步骤S2024;从监控数据中,选择报警时间所在第四时间段内的业务***的可用资源量。Step S2024: From the monitoring data, select the amount of available resources of the business system within the fourth time period where the alarm time is located.
上述报警时间为报警信息的报警时间,该报警信息指步骤S201中获得的业务***产生的报警信息。从监控数据中所选择的业务***的可用资源量是:第四时间段内的可用资源量;且第四时间段为报警时间所在的时间段。The above-mentioned alarm time is the alarm time of the alarm information, and the alarm information refers to the alarm information generated by the business system obtained in step S201. The available resources of the business system selected from the monitoring data are: the available resources in the fourth time period; and the fourth time period is the time period of the alarm time.
第四时间段可以为报警时间向前延长第六预设时长的时间段,还可以为报警时间分别向前向后延长一定时长的时间段,还可以为报警时间向后延长第七预设时长的时间段。上述第六预设时长与第七预设时长可以是相同的,也可以是不同的。The fourth time period can extend the alarm time forward by the sixth preset time period, can also extend the alarm time forward and backward by a certain period of time, and can also extend the alarm time backward by the seventh preset time period time period. The sixth preset duration and the seventh preset duration may be the same or different.
上述可用资源量是指业务***响应用户请求的可用资源的资源量,上述可用资源可以包括带宽资源、计算资源等。The above-mentioned amount of available resources refers to the amount of available resources that the service system responds to user requests, and the above-mentioned available resources may include bandwidth resources, computing resources, and the like.
具体的,可以从监控数据中获得第四时间段内各时刻的可用资源量,对各可用资源量进行统计分析,将统计分析值确定为上述第四时间段内业务***的可用资源量。上述统计分析可以是计算平均值、中值等方式。Specifically, the amount of available resources at each moment in the fourth time period can be obtained from the monitoring data, statistical analysis is performed on each amount of available resources, and the statistical analysis value is determined as the amount of available resources of the business system in the fourth time period. The above-mentioned statistical analysis may be in the form of calculating an average value, a median value, and the like.
由于业务***的可用资源量会对业务***产生影响,当发生故障时产生报警信息,如可用资源量低会造成业务***难以响应用户请求的故障,从而产生对应的报警信息。所以包括报警时间的第四时间段内业务***的可用资源量与报警信息之间具有关联关系,将上述业务***的可用资源量确定为关联数据的准确度高。Since the amount of available resources of the business system will have an impact on the business system, an alarm message will be generated when a failure occurs. If the amount of available resources is low, it will make it difficult for the business system to respond to the fault requested by the user, thereby generating a corresponding alarm message. Therefore, there is an association between the available resource amount of the business system and the alarm information in the fourth time period including the alarm time, and the accuracy of determining the above-mentioned available resource amount of the business system as associated data is high.
上述各个步骤中,第一时间段、第二时间段、第三时间段、第四时间段可以是相同的,也可以是不同的。In the above steps, the first time period, the second time period, the third time period, and the fourth time period may be the same or different.
在确定关联数据时,可以选择其中一个步骤获得信息,作为与报警信息具有关联性的关联数据,还可以选择其中多个步骤获得信息,将所获得的多个信息作为与报警信息具有关联性的关联数据。When determining associated data, one of the steps can be selected to obtain information as associated data associated with alarm information, and multiple steps can be selected to obtain information, and the obtained multiple information can be used as associated data associated with alarm information. Linked data.
以下结合图4对上述步骤S204中获得故障自愈方案的具体实现过程进行说明。The specific implementation process of obtaining the fault self-healing solution in the above step S204 will be described below with reference to FIG. 4 .
图4为本公开实施例提供的第三种故障处理方法的流程示意图,在上述实施例的基础上,上述步骤S204,可以按照以下步骤S2041-步骤S2042实现。FIG. 4 is a schematic flowchart of a third fault handling method provided by an embodiment of the present disclosure. On the basis of the above embodiments, the above step S204 can be implemented according to the following steps S2041-S2042.
步骤S2041:根据报警信息,在已知故障自愈方案中,查找对原因信息对应的故障进行自愈处理的故障自愈方案。Step S2041: According to the alarm information, among the known fault self-healing schemes, search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the reason information.
上述已知故障自愈方案可以为:对已经发生过的故障进行自愈处理的方案,在这种情况下,上述方案可以存储于服务端中的方案库中。The above-mentioned known fault self-healing scheme may be: a scheme for performing self-healing processing on a fault that has already occurred. In this case, the above-mentioned scheme may be stored in a scheme library in the server.
上述已知故障自愈方案还可以为:对可能发生的故障预先生成的方案。可选的,可以是由专家或者运维工作人员根据经验确定对上述可能发生的故障进行自愈处理的方案。在这种情况下,可以由专家或者运维工作人员通过用户端提供的方案录入接口,输入所确定的故障自愈方案,服务端接收到上述故障自愈方案并存储于上述方案库中。The above known fault self-healing scheme may also be: a pre-generated scheme for possible faults. Optionally, experts or operation and maintenance personnel may determine a self-healing solution for the above-mentioned possible faults based on experience. In this case, experts or operation and maintenance personnel can input the determined fault self-healing plan through the plan input interface provided by the client, and the server receives the above-mentioned fault self-healing plan and stores it in the above-mentioned plan library.
例如:在业务***的用户端的用户界面中,可以配置自愈控制台,专家或者运维工作人员可以通过自愈控制台输入故障自愈方案,业务***可以将用户输入的故障自愈方案中存储至自愈方案规则库中,上述自愈方案规则库中存储各已知故障自愈方案。For example: In the user interface of the user end of the business system, a self-healing console can be configured, and experts or operation and maintenance personnel can input a fault self-healing plan through the self-healing console, and the business system can store the fault self-healing plan entered by the user In the self-healing scheme rule base, each known fault self-healing scheme is stored in the above-mentioned self-healing scheme rule base.
本公开的一个实施例中,可以提取报警信息中预设字段的目标字段值;基于目标字段值,在已知故障自愈方案中,查找对原因信息对应的故障进行自愈处理、且包括目标自愈任务的故障自愈方案。In one embodiment of the present disclosure, the target field value of the preset field in the alarm information can be extracted; based on the target field value, in the known fault self-healing scheme, it is found that the fault corresponding to the cause information is self-healing and includes the target Fault self-healing scheme for self-healing tasks.
上述预设字段可以包括:报警信息的报警时间、报警信息所针对的业务节点的标识、生成报警信息的设备的标识、上述设备所在机房的标识、触发生成报警信息的实例(例如,程序、算法等)的标识以 及异常描述信息。The above-mentioned preset fields may include: the alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the device generating the alarm information, the identification of the computer room where the above-mentioned equipment is located, and the instance (for example, program, algorithm, etc.) that triggered the generation of the alarm information. etc.) and exception description information.
上述报警信息所针对的业务节点的标识是指:触发生成报警信息的故障发生的业务节点的标识,如业务节点的编号、名称等。上述业务节点可以包括业务模块等。The identification of the service node targeted by the above-mentioned alarm information refers to the identification of the service node that triggers the occurrence of the fault that generates the alarm information, such as the number and name of the service node. The above-mentioned service nodes may include service modules and the like.
上述机房的标识是指生成报警信息的设备所在机房的标识,上述机房的标识可以为机房所在地、机房编号等。The identification of the above-mentioned computer room refers to the identification of the computer room where the equipment generating the alarm information is located, and the above-mentioned identification of the computer room may be the location of the computer room, the number of the computer room, and the like.
上述设备的标识是指生成报警信息的设备的标识。上述设备的标识可以为设备IP地址(Internet Protocol Address,互联网协议地址)、MAC地址(Media Access Control Address,媒体存取控制位址)等。The identification of the above-mentioned equipment refers to the identification of the equipment that generates the alarm information. The identification of the above-mentioned equipment may be equipment IP address (Internet Protocol Address, Internet Protocol address), MAC address (Media Access Control Address, media access control address), etc.
由于上述预设字段包括报警信息的报警时间,报警信息所针对的业务节点的标识、生成报警信息的设备的标识、上述设备所在机房的标识、触发生成报警信息的实例的标识,这些字段的取值从不同方面表示报警信息的具体情况,通过提取报警信息中上述预设字段的取值,可以较为准确地反映报警信息。Since the above preset fields include the alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the device generating the alarm information, the identification of the computer room where the above equipment is located, and the identification of the instance that triggered the generation of the alarm information, the selection of these fields The value represents the specific situation of the alarm information from different aspects. By extracting the values of the above preset fields in the alarm information, the alarm information can be reflected more accurately.
上述目标字段值可以是对报警信息进行解析、提取得到的。The above target field value may be obtained by parsing and extracting the alarm information.
上述目标自愈任务为:依据预设字段的目标字段值设置的故障自愈任务。例如:以目标字段值为机器的标识为例,依据上述目标字段值设备的故障自愈任务可以是对标识为目标字段值的机器进行重启操作。The aforementioned target self-healing task is: a fault self-healing task set according to the target field value of the preset field. For example, taking the identification of a machine with the target field value as an example, the fault self-healing task of the device according to the above target field value may be to restart the machine identified as the target field value.
由于已知故障自愈方案中记录有所针对的原因信息,基于此,可以以上述原因信息作为关键字,从各已知故障自愈方案记录的原因信息进行关键字匹配,获得针对上述原因信息的故障进行自愈处理的已知故障自愈方案;Since the targeted cause information is recorded in the known fault self-healing scheme, based on this, the above-mentioned cause information can be used as a keyword to perform keyword matching from the cause information recorded in each known fault self-healing scheme to obtain the above-mentioned cause information Known fault self-healing scheme for self-healing processing of faults;
又由于已知故障自愈方案包括故障自愈任务,故障自愈任务中可以记录所针对的预设字段的字段值,基于此,可以以目标字段值为关键字,从所获得的各已知故障自愈方案包括的故障自愈任务记录的各字段值进行关键字匹配,确定依据上述目标字段值设置的故障自愈任务,进而得到包括上述故障自愈任务的故障自愈方案。And because the known fault self-healing scheme includes the fault self-healing task, the field value of the targeted preset field can be recorded in the fault self-healing task. Based on this, the target field value can be used as a keyword to obtain each known Keyword matching is performed on each field value of the fault self-healing task record included in the fault self-healing plan, and the fault self-healing task set according to the above-mentioned target field value is determined, and then the fault self-healing plan including the above-mentioned fault self-healing task is obtained.
由于是从对原因信息对应的故障进行自愈处理的已知故障自愈方案中,进一步查找包括目标自愈任务的故障自愈方案,目标自愈任务为依据预设字段的目标字段值设置的故障自愈任务,所以查找到的故障自愈任务是针对报警信息中预设字段的目标字段值设置的,提高了故障自愈任务进行故障自愈的成功率。Since the fault self-healing scheme including the target self-healing task is further searched from the known fault self-healing scheme for self-healing processing of the fault corresponding to the cause information, the target self-healing task is set according to the target field value of the preset field The fault self-healing task, so the found fault self-healing task is set according to the target field value of the preset field in the alarm information, which improves the success rate of the fault self-healing task for the fault self-healing task.
步骤S2042:将查找到的故障自愈方案确定为包括故障自愈任务的故障自愈方案。Step S2042: Determine the found fault self-healing solution as a fault self-healing solution including a fault self-healing task.
由于已知故障自愈方案是指已知的对故障进行故障自愈的方案,从上述已知故障自愈方案中查找故障自愈方案,可以直接基于查找到的故障自愈方案进行故障自愈,提高了故障自愈的效率。Since the known fault self-healing scheme refers to the known fault self-healing scheme, the fault self-healing scheme can be found from the above-mentioned known fault self-healing scheme, and the fault self-healing can be performed directly based on the found fault self-healing scheme , improving the efficiency of fault self-healing.
另外,由于是根据报警信息,查找到的对原因信息对应的故障进行自愈处理的故障自愈方案,使得查找到的故障自愈方案能够对上述故障进行自愈处理,从而通过执行上述故障自愈方案实现故障自愈。In addition, because it is a fault self-healing scheme that performs self-healing processing on the fault corresponding to the reason information found according to the alarm information, the found fault self-healing scheme can perform self-healing processing on the above fault, so that The recovery scheme realizes self-healing of faults.
在上述步骤S2041中,若未查找到对原因信息对应的故障进行自愈处理的故障自愈方案时,可以参见图5a所示的实施例确定故障自愈方案。In the above step S2041, if no fault self-healing scheme for self-healing processing is found for the fault corresponding to the cause information, the fault self-healing scheme may be determined referring to the embodiment shown in FIG. 5a.
图5a为本公开实施例提供的第四种故障处理方法的流程示意图,在上述步骤S2041中若未查找到对原因信息对应的故障进行自愈处理的故障自愈方案时,还可以包括以下步骤S2043-S2047。Fig. 5a is a schematic flowchart of the fourth fault handling method provided by the embodiment of the present disclosure. In the above step S2041, if no fault self-healing scheme for self-healing processing of the fault corresponding to the cause information is found in the above step S2041, the following steps may also be included S2043-S2047.
步骤S2043:根据各已知运维信息中描述信息的第一语义与原因信息的第二语义,获得各已知运维信息与原因信息之间的第一相似度。Step S2043: Obtain the first similarity between each known operation and maintenance information and the cause information according to the first semantics of the description information and the second semantics of the cause information in each known operation and maintenance information.
上述每一条已知运维信息中包括:***异常的描述信息以及***异常处理方式的描述信息。描述信息可以是文字形式的信息。Each piece of known operation and maintenance information above includes: description information of system exceptions and description information of system exception handling methods. The description information may be information in text form.
例如:***异常的描述信息可以为:业务模块无法运行,***异常处理方式的描述信息可以为:对安装有业务模块的设备进行重启。For example, the description information of the system exception may be: the service module cannot run, and the description information of the system exception handling method may be: restart the device installed with the service module.
上述各已知运维信息可以基于运维手册、运维预案以及历史运维文档等相关运维文档确定的运维信息。上述各已知运维信息可以存储于运维知识库中。The above-mentioned known operation and maintenance information may be based on the operation and maintenance information determined by relevant operation and maintenance documents such as operation and maintenance manuals, operation and maintenance plans, and historical operation and maintenance documents. The above-mentioned known operation and maintenance information may be stored in the operation and maintenance knowledge base.
可选的,可以是对上述运维文档进行结构化抽取,得到***异常的描述信息以及***异常处理方式的描述信息,进而得到包括***异常的描述信息以及***异常处理方式的描述信息的已知运维信息。还可以是在从运维文档中结构化抽取得到的各描述信息后,由工作人员对上述描述信息的内容、所包括的各***异常处理方式的顺序等信息进行调整,得到包括调整后的***异常的描述信息以及***异常处理方式的描述信息的已知运维信息。Optionally, structured extraction of the above operation and maintenance documents can be performed to obtain the description information of the system exception and the description information of the system exception handling method, and then obtain the known information including the description information of the system exception and the description information of the system exception handling method. Operation and maintenance information. It can also be that after the descriptive information obtained from the structured extraction of the operation and maintenance documents, the staff will adjust the content of the above descriptive information, the order of the abnormal handling methods of each system included, and obtain the adjusted system. The description information of the exception and the known operation and maintenance information of the description information of the system exception handling method.
上述第一语义表征已知运维信息中描述信息所表达的语义,第二语义表征原因信息所表达的语义。The above-mentioned first semantics represents the semantics expressed by the description information in the known operation and maintenance information, and the second semantics represents the semantics expressed by the cause information.
一种实现方式中,可以采用自然语言理解技术对各已知运维信息中描述信息的语义进行识别、对原因信息的语义进行识别,得到第一语义和第二语义。In an implementation manner, the natural language understanding technology may be used to identify the semantics of the description information in each known operation and maintenance information, identify the semantics of the cause information, and obtain the first semantics and the second semantics.
另一种实现方式中,可以采用语义提取模型,将已知运维信息输入至语义提取模型,得到语义提取模型输出的已知运维信息的描述信息的语义特征,作为第一语义,并将原因信息输入至语义提取模型,得到语义特征提取模型输出的原因信息的语义特征,作为第二语义。In another implementation, the semantic extraction model can be used to input known operation and maintenance information into the semantic extraction model, and the semantic features of the description information of the known operation and maintenance information output by the semantic extraction model can be obtained as the first semantics, and the The cause information is input into the semantic extraction model, and the semantic features of the cause information output by the semantic feature extraction model are obtained as the second semantics.
可选的,可以计算上述第一语义与第二语义之间的距离,如欧式距离、余弦距离等,基于计算得到的距离确定第一语义与第二语义之间的相似度,作为各已知运维信息与原因信息之间的第一相似度。Optionally, the distance between the above-mentioned first semantics and the second semantics can be calculated, such as Euclidean distance, cosine distance, etc., and the similarity between the first semantics and the second semantics can be determined based on the calculated distance, as each known The first similarity between operation and maintenance information and cause information.
步骤S2044:根据第一语义和报警信息的第三语义,获得各已知运维信息与报警信息之间的第二相似度。Step S2044: According to the first semantics and the third semantics of the alarm information, obtain the second similarity between each known operation and maintenance information and the alarm information.
其中,上述第一语义为各已知运维信息中描述信息的第一语义。上述第三语义表征报警信息所表达的语义。Wherein, the above-mentioned first semantics is the first semantics of the description information in each known operation and maintenance information. The above-mentioned third semantics represents the semantics expressed by the alarm information.
可选的,可以对报警信息的预设字段的目标字段值的语义进行识别,将识别结果确定为第三语义。Optionally, the semantics of the target field value of the preset field of the alarm information may be identified, and the identification result is determined as the third semantics.
一种实现方式中,可以采用自然语言理解技术对报警信息的语义进行识别,得到第三语义。另一种实现方式中,还可以采用语义提取模型,将报警信息输入至语义提取模型,得到语义提取模型输出的报警信息的语义特征,作为第三语义。In an implementation manner, a natural language understanding technology may be used to identify the semantics of the alarm information to obtain the third semantics. In another implementation manner, a semantic extraction model may also be used to input the alarm information into the semantic extraction model, and obtain the semantic features of the alarm information output by the semantic extraction model as the third semantics.
可选的,可以计算上述第一语义与第三语义之间的距离,如欧式距离、余弦距离等,基于计算得到的距离确定第一语义与第三语义之间的相似度,作为各已知运维信息与报警信息之间的第二相似度。Optionally, the distance between the above-mentioned first semantics and the third semantics can be calculated, such as Euclidean distance, cosine distance, etc., and the similarity between the first semantics and the third semantics can be determined based on the calculated distance, as each known The second degree of similarity between the operation and maintenance information and the alarm information.
步骤S2045:根据第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息。Step S2045: According to the first similarity degree and the second similarity degree, select the description information of the candidate processing method from the description information of the system exception handling method included in each known operation and maintenance information.
其中,上述第一相似度为各已知运维信息与原因信息之间的第一相似度。上述第二相似度为步骤S2044所获取的各已知运维信息与报警信息之间的第二相似度。Wherein, the above-mentioned first similarity is the first similarity between each known operation and maintenance information and cause information. The above-mentioned second similarity is the second similarity between each known operation and maintenance information and alarm information obtained in step S2044.
上述候选处理方式的描述信息是指对候选处理方式进行描述的信息,上述描述信息可以是文字形式的信息。The description information of the candidate processing modes refers to information describing the candidate processing modes, and the description information may be information in text form.
可选的,可以针对每一已知运维信息,对该已知运维信息对应的第一相似度和第二相似度进行数据融合,如对上述第一相似度和第二相似度进行加权求和,得到该已知运维信息对应的目标值,根据计算 得到的各已知运维信息对应的目标值,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息。Optionally, for each known operation and maintenance information, data fusion may be performed on the first similarity and the second similarity corresponding to the known operation and maintenance information, such as weighting the above-mentioned first similarity and second similarity Sum up to obtain the target value corresponding to the known operation and maintenance information. According to the calculated target value corresponding to each known operation and maintenance information, select candidate Description of the processing method.
一种实现方式中,可以选择目标值最高的已知运维信息包括的***异常处理方式的描述信息,作为候选处理方式的描述信息。In an implementation manner, the description information of the system exception handling method included in the known operation and maintenance information with the highest target value may be selected as the description information of the candidate processing method.
另一种实现方式中,还可以选择目标值大于预设目标阈值的已知运维信息包括的***异常处理方式的描述信息,作为候选处理方式的描述信息。In another implementation manner, the description information of the system exception handling method included in the known operation and maintenance information whose target value is greater than the preset target threshold may also be selected as the description information of the candidate processing method.
步骤S2046:针对每一候选处理方式,基于该候选处理方式的描述信息,获得候选处理任务,以得到包含候选处理任务的候选故障自愈方案。Step S2046: For each candidate processing method, based on the description information of the candidate processing method, obtain candidate processing tasks, so as to obtain a candidate fault self-healing solution including the candidate processing tasks.
上述候选处理任务中包含可执行命令。例如:候选处理方式的描述信息为:启动A程序,候选处理任务包含的可执行命令为:start A。The above candidate processing tasks include executable commands. For example: the description information of the candidate processing mode is: start A program, and the executable command included in the candidate processing task is: start A.
可选的,可以获得候选处理方式的描述信息的语义特征,基于所获得的语义特征,将描述信息转换为可执行命令,得到包含上述可执行命令的候选处理任务。Optionally, the semantic features of the description information of the candidate processing modes may be obtained, and based on the obtained semantic features, the description information may be converted into executable commands to obtain candidate processing tasks including the above executable commands.
还可以在将描述信息转换为可执行命令后,由运维工作人员对转换后的可执行命令的执行顺序、执行参数等信息进行校准,得到包含校准后的可执行命令的候选处理任务。Alternatively, after the description information is converted into an executable command, the operation and maintenance staff may calibrate the converted executable command's execution order, execution parameters, and other information to obtain candidate processing tasks including the calibrated executable command.
步骤S2047:从各候选故障自愈方案中确定故障自愈方案。Step S2047: Determine the fault self-healing scheme from each candidate fault self-healing scheme.
一种实施方式中,可以从各候选故障自愈方案中随机选择一种方案,作为故障自愈方案。确定故障自愈方案的其他实施方式还可以参见图5b对应的实施例。In an implementation manner, a scheme may be randomly selected from candidate fault self-healing schemes as the fault self-healing scheme. For other implementation manners of determining the fault self-healing solution, reference may also be made to the embodiment corresponding to FIG. 5b.
由于第一相似度是各已知运维信息中描述信息的第一语义与原因信息的第二语义之间的相似度,第二相似度是上述第一语义与报警信息的第三语义之间的相似度,在根据第一相似度和第二相似度确定候选处理方式的描述信息时,综合考虑了原因信息、报警信息的语义分别与各已知运维信息中描述信息的语义之间的相似度,使得所确定的描述信息对应的候选处理方式能够较为准确地处理报警信息的故障,进而使得所确定的故障自愈方案较为准确。Since the first similarity is the similarity between the first semantics of the description information in each known operation and maintenance information and the second semantics of the cause information, the second similarity is the difference between the above-mentioned first semantics and the third semantics of the alarm information When determining the description information of the candidate processing method according to the first similarity and the second similarity, the semantics of the cause information, alarm information and the semantics of the description information in each known operation and maintenance information are considered comprehensively. The similarity makes the determined candidate processing methods corresponding to the description information more accurately handle the faults of the alarm information, thereby making the determined fault self-healing scheme more accurate.
参见图5b,图5b为本公开实施例提供的第五种故障处理方法的流程示意图,在上述步骤S2045之后,还可以包括以下步骤S2048。Referring to FIG. 5b, FIG. 5b is a schematic flowchart of a fifth fault handling method provided by an embodiment of the present disclosure. After the above step S2045, the following step S2048 may also be included.
步骤S2048:获得采用各候选处理方式对原因信息对应的故障进行故障自愈的第一成功概率。Step S2048: Obtain the first success probability of self-healing for the fault corresponding to the cause information by adopting each candidate processing method.
本步骤中,可以获得各候选处理方式对应的第一成功概率,其中,每一候选处理方式对应的第一成功概率为:在采用该候选处理方式对所述原因信息对应的故障进行故障自愈时的成功概率。In this step, the first success probability corresponding to each candidate processing method can be obtained, wherein the first success probability corresponding to each candidate processing method is: when the candidate processing method is used to perform fault self-healing on the fault corresponding to the cause information probability of success when .
上述第一成功概率表示采用候选处理方式能够成功对原因信息对应的故障进行故障自愈的概率。The above-mentioned first success probability indicates the probability that the failure corresponding to the cause information can be successfully self-healed by adopting the candidate processing manner.
可选的,可以根据各候选处理方式对应的第一相似度和第二相似度数据融合后得到的目标值,确定第一成功概率。例如:可以对上述目标值进行归一化处理,基于归一化处理后的值,确定第一成功概率。Optionally, the first success probability may be determined according to the target value obtained after fusion of the first similarity corresponding to each candidate processing mode and the second similarity data. For example, normalization processing may be performed on the above target value, and the first success probability is determined based on the normalized value.
当目标值越高,表示该候选处理方式能够成功对原因信息对应的故障进行故障自愈的概率越高,也就是第一成功概率越高,当目标值越低,表示该候选处理方式能够成功对原因信息对应的故障进行故障自愈的概率越低,也就是第一成功概率越低。When the target value is higher, it means that the probability that the candidate processing method can successfully self-heal the fault corresponding to the cause information is higher, that is, the first success probability is higher; when the target value is lower, it means that the candidate processing method can succeed The lower the probability of fault self-healing for the fault corresponding to the cause information is, that is, the lower the first success probability is.
在上述实施例的基础上,上述步骤S2047,还可以按照以下步骤S20471-S20472实现。On the basis of the above embodiments, the above step S2047 can also be implemented according to the following steps S20471-S20472.
步骤S20471:针对每一候选故障自愈方案,根据业务***的当前网络环境信息和候选故障自愈方案包括的候选处理任务,预估采用该候选故障自愈方案对原因信息对应的故障进行故障自愈的第二成功概率。Step S20471: For each candidate fault self-healing plan, according to the current network environment information of the business system and the candidate processing tasks included in the candidate fault self-healing plan, it is estimated that the candidate fault self-healing plan will be used to perform fault self-recovery on the fault corresponding to the cause information. Healed second probability of success.
示例性的,当前网络环境信息包括当前网络流量、当前网络的可用资源量等信息。Exemplarily, the current network environment information includes information such as current network traffic and available resources of the current network.
由于上述第二成功概率是根据业务***的当前网络环境信息和候选故障自愈方案包括的候选处理任务预估得到的,上述第二成功概率与业务***的当前网络环境信息相关,又由于业务***的当前网络环境会影响故障自愈方案对故障进行故障自愈的成功概率,所以计算得到的第二成功概率适应业务***的当前网络环境信息,使得计算得到的第二成功概率准确度高。Since the above-mentioned second success probability is estimated based on the current network environment information of the business system and the candidate processing tasks included in the candidate fault self-healing scheme, the above-mentioned second success probability is related to the current network environment information of the business system, and because the business system The current network environment will affect the success probability of the fault self-healing scheme for fault self-healing, so the calculated second success probability adapts to the current network environment information of the business system, so that the calculated second success probability has high accuracy.
步骤S20472:根据第一成功概率和第二成功概率,从各候选故障自愈方案中确定故障自愈方案。Step S20472: According to the first success probability and the second success probability, determine the fault self-healing scheme from each candidate fault self-healing scheme.
可选的,可以对第一成功概率和第二成功概率进行数据融合,得到融合概率,如按照预设的权重,对第一成功概率和第二成功概率进行加权求和,将计算得到的值确定为融合概率。基于各候选故障自愈方案的融合概率,确定故障自愈方案。Optionally, data fusion can be performed on the first success probability and the second success probability to obtain the fusion probability. For example, according to the preset weight, the first success probability and the second success probability are weighted and summed, and the calculated value Determined as fusion probability. Based on the fusion probability of each candidate fault self-healing scheme, the fault self-healing scheme is determined.
例如:可以将融合概率最高的候选故障自愈方案确定为故障自愈方案,还可以将融合概率大于预设概率阈值的候选故障自愈方案确定为故障自愈方案。For example, the candidate fault self-healing scheme with the highest fusion probability can be determined as the fault self-healing scheme, and the candidate fault self-healing scheme with the fusion probability greater than a preset probability threshold can also be determined as the fault self-healing scheme.
由于是根据第一成功概率和第二成功概率从各候选故障自愈方案中确定的故障自愈方案,又由于第一成功概率表示各候选处理方式自身进行故障自愈的成功概率,第二成功概率不仅考虑了候选故障自愈方案的候选处理任务的信息,还考虑了业务***的当前网络环境信息,使得第二成功概率适应业务***的当前网络环境信息,所以第一成功概率和第二成功概率是从两种不同角度确定各候选故障自愈方案的成功概率,从而基于上述两种成功概率,提高了通过所确定的故障自愈方案进行故障自愈的成功概率。Because it is the fault self-healing scheme determined from each candidate fault self-healing scheme according to the first success probability and the second success probability, and because the first success probability represents the success probability of each candidate processing method itself for fault self-healing, the second success The probability not only considers the candidate processing task information of the candidate fault self-healing scheme, but also considers the current network environment information of the business system, so that the second success probability adapts to the current network environment information of the business system, so the first success probability and the second success probability Probability is to determine the success probability of each candidate fault self-healing scheme from two different angles, so based on the above two success probabilities, the success probability of fault self-healing through the determined fault self-healing scheme is improved.
本公开的一个实施例中,上述步骤S2043-S2045可以采用推荐模型得到候选处理方式的描述信息,上述步骤S2048也可以采用上述推荐模型得到第一成功率。In an embodiment of the present disclosure, the above-mentioned steps S2043-S2045 may use the recommendation model to obtain the description information of the candidate processing methods, and the above-mentioned step S2048 may also use the above-mentioned recommendation model to obtain the first success rate.
可选的,可以将已知运维信息中描述信息、原因信息以及报警信息作为推荐模型的输入;Optionally, the description information, cause information and alarm information in the known operation and maintenance information can be used as the input of the recommendation model;
推荐模型计算得到各已知运维信息与原因信息之间的第一相似度、并计算各已知运维信息与报警信息之间的第二相似度,根据第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,确定候选处理方式的描述信息,并确定采用各候选处理方式对原因信息对应的故障进行故障自愈的第一成功概率,输出上述候选处理方式的描述信息以及第一成功概率。The recommendation model calculates the first similarity between each known operation and maintenance information and cause information, and calculates the second similarity between each known operation and maintenance information and alarm information. According to the first similarity and the second similarity , from the description information of the system abnormality handling methods included in each known operation and maintenance information, determine the description information of the candidate processing methods, and determine the first success probability of using each candidate processing method to perform fault self-healing on the fault corresponding to the cause information, The description information and the first success probability of the above candidate processing modes are output.
为更准确获得第二成功概率,参见图6所示的实施例,图6为本公开实施例提供的第六种故障处理方法的流程示意图。在图5b对应的实施例的基础上,上述步骤S20471,可以按照以下步骤S204711-S204712实现。To obtain the second success probability more accurately, refer to the embodiment shown in FIG. 6 , which is a schematic flowchart of a sixth fault handling method provided by an embodiment of the present disclosure. On the basis of the embodiment corresponding to FIG. 5b, the above step S20471 can be implemented according to the following steps S204711-S204712.
步骤S204711:根据该候选故障自愈方案包括的各候选处理任务的任务参数以及任务间依赖关系,确定各候选处理任务的执行耗时。Step S204711: According to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing solution, determine the time-consuming execution of each candidate processing task.
上述任务参数包括执行候选处理任务时需要的执行参数,如内存参数、计算资源参数、带宽资源参数等。The foregoing task parameters include execution parameters required for executing candidate processing tasks, such as memory parameters, computing resource parameters, bandwidth resource parameters, and the like.
上述任务间依赖关系可以基于各候选处理任务之间的执行顺序确定,如,若候选处理任务之间的执行顺序依次为串行执行顺序:任务A1、任务A2、任务A3,则任务A1、任务A2、任务A3之间均具有依赖关系、且相邻两个任务之间的依赖关系最高;若候选处理任务之间的执行顺序为并行执行的顺序,那么各并行执行的候选处理任务之间的依赖关系最低。The dependencies between the above tasks can be determined based on the execution order of the candidate processing tasks. For example, if the execution order of the candidate processing tasks is the serial execution order: task A1, task A2, task A3, then task A1, task A3 There is a dependency between A2 and task A3, and the dependency between two adjacent tasks is the highest; if the execution order of candidate processing tasks is the order of parallel execution, then the Minimal dependencies.
可选的,可以根据预设的故障自愈任务的任务参数、任务间依赖关系与执行耗时之间的对应关系,确定各候选处理任务的执行耗时。上述信息对应关系可以由专家根据经验确定。Optionally, the execution time consumption of each candidate processing task may be determined according to the preset task parameters of the fault self-healing task, the dependencies between tasks, and the correspondence between execution time consumption. The above-mentioned information correspondence can be determined by experts based on experience.
步骤S204712:根据各候选处理任务的执行耗时和业务***的当前网络环境信息,预估采用该候选 故障自愈方案对原因信息对应的故障进行故障自愈的第二成功概率。Step S204712: According to the execution time of each candidate processing task and the current network environment information of the business system, estimate the second success probability of using the candidate fault self-healing scheme to perform fault self-healing on the fault corresponding to the cause information.
可选的,可以确定在业务***的当前网络环境信息下各候选故障自愈方案对上述故障进行故障自愈的成功概率,在这一过程中,可以采用预设的网络环境信息与故障自愈方案对应的成功概率的对应关系,确定上述各候选故障自愈方案的成功概率,基于各候选处理任务的执行耗时,对上述成功概率进行调整,将调整后的成功概率确定为第二成功概率。Optionally, it is possible to determine the success probability of each candidate fault self-healing scheme for the above-mentioned fault under the current network environment information of the business system. In this process, the preset network environment information and fault self-healing scheme can be used The corresponding relationship of the success probability corresponding to the scheme, determine the success probability of the above candidate fault self-healing schemes, adjust the above success probability based on the execution time of each candidate processing task, and determine the adjusted success probability as the second success probability .
例如:对于同一成功概率的不同候选故障自愈方案,若其中一个候选故障自愈方案P1中包括的各候选处理任务的执行耗时小于另外一个候选故障自愈方案P2中包括的各候选处理任务的执行耗时,可以提高候选故障自愈方案P1对应的成功概率,并减少候选故障自愈方案P2对应的成功概率,从而将调整后的成功概率确定为第二成功概率。For example: for different candidate fault self-healing schemes with the same success probability, if the execution time of each candidate processing task included in one candidate fault self-healing scheme P1 is less than the execution time of each candidate processing task included in another candidate fault self-healing scheme P2 The execution time of , can increase the success probability corresponding to the candidate fault self-healing scheme P1, and reduce the success probability corresponding to the candidate fault self-healing scheme P2, so that the adjusted success probability is determined as the second success probability.
上述步骤S204711-S204712可以采用效果预测模型得到第二成功率,将候选故障自愈方案包括的各候选处理任务作为效果预测模型的输入,效果预测模型根据各候选处理任务的任务参数以及任务间依赖关系,确定各候选处理任务的执行耗时,并根据各候选处理任务的执行耗时、业务***的当前网络环境,预估得到第二成功概率。The above steps S204711-S204712 can use the effect prediction model to obtain the second success rate, and each candidate processing task included in the candidate fault self-healing scheme is used as the input of the effect prediction model, and the effect prediction model is based on the task parameters of each candidate processing task and the inter-task dependencies relationship, determine the time-consuming execution of each candidate processing task, and estimate the second success probability according to the time-consuming execution of each candidate processing task and the current network environment of the business system.
由于是根据各候选处理任务的执行耗时和当前网络环境信息预估第二成功概率,预估得到的第二成功概率与各候选处理任务的执行耗时相关,而各候选处理任务的执行耗时影响故障自愈的效率,所以预估得到的第二成功概率考虑了故障自愈的效率,进而基于第一成功概率和第二成功概率确定故障自愈方案,在执行上述故障自愈方案时提高了故障自愈的效率。Since the second success probability is estimated based on the execution time consumption of each candidate processing task and the current network environment information, the estimated second success probability is related to the execution time consumption of each candidate processing task, and the execution time consumption of each candidate processing task is time affects the efficiency of fault self-healing, so the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability. When implementing the above fault self-healing scheme Improve the efficiency of fault self-healing.
为实现更好对故障进行自愈,本公开的一个实施例中,在故障自愈任务为多个的情况下,还可以对各故障自愈任务的执行过程进行监控;在监控到任务执行异常的情况下,对各故障自愈任务的调度顺序进行调整,和/或,控制各故障自愈任务的执行进度。In order to achieve better self-healing of faults, in one embodiment of the present disclosure, when there are multiple fault self-healing tasks, the execution process of each fault self-healing task can also be monitored; In the case of , adjust the scheduling sequence of each fault self-healing task, and/or control the execution progress of each fault self-healing task.
上述任务执行异常的情况可以包括:执行任务时出现任务冲突、当前执行任务进度缓慢等。The abnormality of the above task execution may include: task conflicts during task execution, slow progress of the current task execution, and the like.
可选的,可以对各故障自愈任务的执行过程的执行状态、执行进行进度百分比以及执行描述信息等信息进行监控,基于上述监控信息,在任务执行异常的情况下,确定故障自愈任务的调度顺序的调整操作和/或故障自愈任务的执行进度的控制操作,并执行上述操作。Optionally, information such as the execution status, execution progress percentage, and execution description information of each fault self-healing task can be monitored. Based on the above monitoring information, in the case of abnormal task execution, determine Adjusting the scheduling sequence and/or controlling the execution progress of the fault self-healing task, and performing the above operations.
由于在任务执行异常的情况下,对故障自愈任务的调度顺序进行调整,和/或,控制故障自愈任务的执行进度,使得能够对可能出现的问题及时进行调整,从而顺利实现故障自愈。In the case of abnormal task execution, the scheduling sequence of fault self-healing tasks is adjusted, and/or, the execution progress of fault self-healing tasks is controlled, so that possible problems can be adjusted in time, so as to successfully realize fault self-healing .
本公开的一个实施例中,在故障自愈方案包括的各故障自愈任务执行过程中,可以由运维工作人员同步对故障自愈任务的执行进度进行监控、调整、调度等操作。In one embodiment of the present disclosure, during the execution of each fault self-healing task included in the fault self-healing solution, the operation and maintenance staff can simultaneously monitor, adjust, schedule and other operations on the execution progress of the fault self-healing task.
可选的,在生成故障自愈方案后,可以由运维工作人员对上述故障自愈方案中包括的各故障自愈任务进行调整,实时控制任务的开始、结束、暂停、继续等进度,并对执行结果进行确认,将上述故障自愈方案通过自愈控制台录入方案库中。Optionally, after the fault self-healing plan is generated, the operation and maintenance staff can adjust each fault self-healing task included in the above fault self-healing plan, and control the progress of the start, end, pause, and continuation of the task in real time, and Confirm the execution results, and enter the above fault self-healing scheme into the scheme library through the self-healing console.
在运维工作人员进行调整的过程中,还可以存储运维工作人员在方案执行过程中的执行操作,剔除上述操作过程中无效信息,并对上述执行操作进行格式转换,使用转换后的数据作为推荐模型的训练样本,对推荐模型进行重新训练,使得推荐模型学习到运维工作人员执行操作的规律和特征。During the adjustment process of the operation and maintenance staff, it is also possible to store the execution operations of the operation and maintenance staff during the plan execution process, eliminate invalid information in the above operation process, and perform format conversion on the above execution operations, and use the converted data as The training samples of the recommendation model are used to retrain the recommendation model so that the recommendation model can learn the rules and characteristics of operations performed by the operation and maintenance staff.
业务***的用户端的用户界面中配置有故障自愈控制台,用户可以通过故障自愈控制台的自愈任务管理功能模块,浏览当前生成或者执行的故障自愈方案,对故障自愈方案中包括的故障自愈任务进行编辑,如增删改操作任务、检查任务等。还可以配置各任务对应的执行工具的参数。The user interface of the user end of the business system is equipped with a fault self-healing console, and the user can browse the currently generated or executed fault self-healing scheme through the self-healing task management function module of the fault self-healing console, and the fault self-healing scheme includes Edit fault self-healing tasks, such as adding, deleting, modifying operation tasks, inspection tasks, etc. You can also configure the parameters of the execution tool corresponding to each task.
用户还可以通过故障自愈控制台的自愈方案编辑功能模块,如调整故障自愈方案中各故障自愈任务的执行顺序,增删改故障自愈方案的内容等。Users can also edit the functional modules of the self-healing plan through the fault self-healing console, such as adjusting the execution order of each fault self-healing task in the fault self-healing plan, adding, deleting, and modifying the content of the fault self-healing plan, etc.
以下结合图7,对本公开实施例提供的一种故障处理方法的具体过程进行说明。The specific process of a fault handling method provided by an embodiment of the present disclosure will be described below with reference to FIG. 7 .
图7为本公开实施例提供的一种故障处理方法的流程框图。Fig. 7 is a flowchart of a fault handling method provided by an embodiment of the present disclosure.
图7包括5个功能模块,分别为:感知引擎、决策引擎、执行引擎、协同引擎以及故障自愈控制台,其中,感知引擎、决策引擎、执行引擎、协同引擎为服务器安装的功能模块,故障自愈控制台为客户端安装的功能模块。Figure 7 includes five functional modules, namely: perception engine, decision engine, execution engine, collaboration engine, and fault self-healing console, where perception engine, decision engine, execution engine, and collaboration engine are functional modules installed on the server, and fault The self-healing console is a function module installed by the client.
应用本公开实施例提供的故障自愈方案进行故障自愈时,首先,感知引擎获取报警信息,并且通过感知引擎获取业务***的监控数据,感知引擎将上述报警信息和监控数据输入至决策引擎;When using the fault self-healing solution provided by the embodiments of the present disclosure to perform fault self-healing, first, the perception engine obtains alarm information, and obtains the monitoring data of the business system through the perception engine, and the perception engine inputs the above alarm information and monitoring data into the decision engine;
其次,决策引擎根据上述感知引擎所输入的监控数据,获得与报警信息具有关联性的关联数据,根据上述关联数据,确定触发产生报警信息的故障的原因信息;Secondly, according to the monitoring data input by the above-mentioned perception engine, the decision engine obtains the associated data related to the alarm information, and according to the above-mentioned associated data, determines the cause information of the fault that triggers the alarm information;
然后,决策引擎还根据上述原因信息和感知引擎所获取的报警信息,确定包括故障自愈任务的故障自愈方案,并将上述故障自愈方案输入至执行引擎;Then, the decision engine also determines the fault self-healing plan including the fault self-healing task according to the above-mentioned cause information and the alarm information obtained by the perception engine, and inputs the above-mentioned fault self-healing plan to the execution engine;
最后,执行引擎通过执行故障自愈方案中包括的故障自愈任务,进行故障自愈。Finally, the execution engine performs fault self-healing by executing the fault self-healing tasks included in the fault self-healing scheme.
在执行上述故障自愈任务的过程中,运维工作人员可以通过决策引擎以及协同引擎监控上述故障自愈任务的执行情况,并对故障自愈任务的调度顺序进行调整,和/或,控制故障自愈任务的执行进度。In the process of executing the above fault self-healing tasks, the operation and maintenance staff can monitor the execution of the above fault self-healing tasks through the decision engine and the collaborative engine, and adjust the scheduling sequence of the fault self-healing tasks, and/or control the fault The execution progress of the self-healing task.
以下对各个功能模块的组成以及功能进行具体说明。The composition and functions of each functional module are described in detail below.
感知引擎中包括三个功能单元,分布为:文档数据订阅、报警数据订阅、指标数据抽取。The perception engine includes three functional units, which are distributed as follows: document data subscription, alarm data subscription, and index data extraction.
文档数据订阅用于对文档平台中的文档进行订阅,对获取的文档进行结构化抽取,得到“问题-答案”形式的已知运维信息,又称运维知识,将上述运维知识添加至运维知识库中。Document data subscription is used to subscribe to documents in the document platform, extract the obtained documents in a structured manner, and obtain known operation and maintenance information in the form of "question-answer", also known as operation and maintenance knowledge, and add the above operation and maintenance knowledge to In the operation and maintenance knowledge base.
报警数据订阅用于对事件平台中的报警信息进行订阅,抽取获取到的报警信息的预设字段的目标字段值。Alarm data subscription is used to subscribe to the alarm information in the event platform, and extract the target field value of the preset field of the obtained alarm information.
指标数据抽取用于从监控***、变更***以及容量***中获取监控指标、变更单以及容量数据,具体的,监控指标包括业务***针对发生报警信息的目标节点产生的其他报警信息、业务***的运行状态、异常信息等,变更单包括业务***在报警信息产生前是否发生***变更的信息、以及***变更的内容,容量数据包括业务***的可用资源量。Index data extraction is used to obtain monitoring indicators, change orders, and capacity data from the monitoring system, change system, and capacity system. Specifically, the monitoring indicators include other alarm information generated by the business system for the target node where the alarm information occurs, and the operation of the business system. Status, abnormal information, etc., the change order includes information on whether the business system has changed before the alarm message is generated, and the content of the system change, and the capacity data includes the available resources of the business system.
在决策引擎中包括四个功能单元,分别为;态势理解、预案推荐、方案生成以及自愈方案控制器。The decision engine includes four functional units, namely; situation understanding, plan recommendation, plan generation and self-healing plan controller.
其中,态势理解获得报警信息,根据报警信息中记录的预设字段的目标字段值,从指标数据抽取功能单元中抽取与报警信息具有关联关系的关联数据,并对关联数据进行多维度分析,得到触发报警信息的故障的原因信息。Among them, the situation understanding obtains the alarm information, according to the target field value of the preset field recorded in the alarm information, extracts the associated data related to the alarm information from the index data extraction functional unit, and conducts multi-dimensional analysis on the associated data to obtain Information about the cause of the fault that triggered the alarm message.
预案推荐中包括两个功能子单元,分别为:规则匹配、方案推荐,以及推荐模型,上述推荐模型可以为NLP/KG(Natural Language Processing/Knowledge Graph,自然语言理解/知识图谱)模型。The plan recommendation includes two functional subunits, namely: rule matching, plan recommendation, and recommendation model. The above recommendation model can be an NLP/KG (Natural Language Processing/Knowledge Graph, natural language understanding/knowledge graph) model.
上述规则匹配用于获得原因信息,以原因信息和报警信息作为关键字,从自愈配置模块对应的方案库存储的各已知故障自愈方案中进行关键字匹配,确定匹配成功的故障自愈方案。The above rule matching is used to obtain the cause information, and the cause information and alarm information are used as keywords, and the keyword matching is performed from each known fault self-healing scheme stored in the scheme library corresponding to the self-healing configuration module to determine the successfully matched fault self-healing plan.
上述方案推荐用于在上述规则匹配未匹配成功之后,根据报警信息以及原因信息,调用了推荐模型,从运维知识库中获取候选处理方式的描述信息,并确定各描述信息的置信度(前述第一成功概率)也就是运维知识库中存储的各“问题-答案”中的若干个“答案”。The above scheme is recommended to call the recommendation model according to the alarm information and reason information after the above rules are not successfully matched, obtain the description information of candidate processing methods from the operation and maintenance knowledge base, and determine the confidence of each description information (the aforementioned The first success probability) is a number of "answers" in each "question-answer" stored in the operation and maintenance knowledge base.
方案生成中包括方案生成器,方案生成器获得上述方案推荐功能子单元输出的若干个“答案”,并且按照置信度由高到低的顺序排列,方案生成器调用效果预测算法对上述各“答案”的效果进行预测,得到第二成功概率,基于第一成功概率和第二成功概率,并结合运维工作人员的调整与控制,生成故障自愈方案。The plan generation includes a plan generator. The plan generator obtains several "answers" output by the above-mentioned plan recommendation function subunits, and arranges them in order of confidence from high to low. The plan generator calls the effect prediction algorithm to analyze the above "answers". ” to predict the effect and obtain the second probability of success. Based on the first probability of success and the second probability of success, combined with the adjustment and control of the operation and maintenance staff, a fault self-healing plan is generated.
自愈方案控制器,用于获得方案生成器生成的故障自愈方案,并将上述故障自愈方案输入至执行引擎功能模块中,还用于在故障自愈方案执行过程中,对执行过程的风险以及进度进行控制。The self-healing scheme controller is used to obtain the fault self-healing scheme generated by the scheme generator, and input the above-mentioned fault self-healing scheme into the execution engine function module, and is also used to control the execution process during the execution of the fault self-healing scheme Risk and progress control.
执行引擎,用于确定故障自愈方案中各故障自愈任务的任务类型相匹配的执行工具,并在执行各故障自愈任务的过程中,调用上述执行工具,上述执行工具包括:链路关闭工具、重启工具、健康度检查工具等。The execution engine is used to determine the execution tool that matches the task type of each fault self-healing task in the fault self-healing scheme, and invoke the above-mentioned execution tool during the execution of each fault self-healing task. The above-mentioned execution tool includes: link shutdown tools, restart tools, health check tools, etc.
协同引擎,运维工作人员可以通过上述功能模块中的人工接管模块,对运维知识库中录入的知识进行确认、对生成的故障自愈方案进行调整、并对执行故障自愈方案过程进行干预。协同引擎功能模块,还用于采集运维工作人员的行为数据,将上述行为数据作为推荐模型的训练样本对上述模型进行迭代更新。Collaborative engine, the operation and maintenance staff can confirm the knowledge entered in the operation and maintenance knowledge base, adjust the generated fault self-healing plan, and intervene in the process of executing the fault self-healing plan through the manual takeover module in the above function modules . The collaborative engine function module is also used to collect the behavior data of the operation and maintenance staff, and use the above behavior data as the training samples of the recommendation model to iteratively update the above model.
故障自愈可控制台中包括故障自愈方案推荐功能模块、运维知识库功能模块、效果统计分析功能模块、自愈配置功能模块、登录认证功能模块、自愈任务管理功能模块以及自愈方案编辑功能模块。The fault self-healing console includes a fault self-healing solution recommendation function module, an operation and maintenance knowledge base function module, an effect statistical analysis function module, a self-healing configuration function module, a login authentication function module, a self-healing task management function module, and a self-healing plan editing function module functional module.
其中,故障自愈方案推荐功能模块,用于显示所生成的故障自愈方案。Wherein, the fault self-healing scheme recommendation function module is used to display the generated fault self-healing scheme.
运维知识库,可以使得运维工作人员基于这一功能模块进行运维知识的录入与运维知识的确认。The operation and maintenance knowledge base can enable the operation and maintenance staff to enter and confirm the operation and maintenance knowledge based on this functional module.
效果统计分析,用于显示已运行过的故障自愈方案的效果。Effect statistical analysis, used to display the effect of fault self-healing solutions that have been run.
自愈配置模块,用于使得运维工作人员基于这一功能模块录入已运行过得故障自愈方案。The self-healing configuration module is used to enable the operation and maintenance staff to enter the fault self-healing scheme that has been run based on this functional module.
登录认证,用于对登录的用户进行认证。Login authentication, used to authenticate the logged-in user.
权限管理,用于对用户的权限进行管理。Rights management, used to manage user rights.
自愈任务管理,用于显示当前生成或者执行的故障自愈方案,并提供故障自愈方案中包括的故障自愈任务编辑的功能,如增删改操作任务、检查任务等。还可以配置各任务对应的执行工具的参数。Self-healing task management, used to display the currently generated or executed fault self-healing plan, and provide the function of editing the fault self-healing tasks included in the fault self-healing plan, such as adding, deleting, modifying operation tasks, checking tasks, etc. You can also configure the parameters of the execution tool corresponding to each task.
自愈方案编辑,用于调整故障自愈方案中各故障自愈任务的执行顺序,增删改故障自愈方案的内容等。Self-healing plan editing, used to adjust the execution sequence of each fault self-healing task in the fault self-healing plan, add, delete, and modify the content of the fault self-healing plan, etc.
参见图8,图8为本公开实施例提供的第一种故障处理装置的结构示意图,上述装置包括以下模块801-805。Referring to FIG. 8 , FIG. 8 is a schematic structural diagram of a first fault handling device provided by an embodiment of the present disclosure. The above device includes the following modules 801 - 805 .
信息获得模块801,设置为获得业务***产生的报警信息;The information obtaining module 801 is configured to obtain the alarm information generated by the business system;
数据获得模块802,设置为根据上述业务***的监控数据,获得与上述报警信息具有关联性的关联数据;The data obtaining module 802 is configured to obtain associated data related to the above-mentioned alarm information according to the monitoring data of the above-mentioned business system;
信息确定模块803,设置为根据上述关联数据,确定触发产生上述报警信息的故障的原因信息;The information determination module 803 is configured to determine the cause information of the failure triggering the generation of the above alarm information according to the above associated data;
方案获得模块804,设置为根据上述原因信息和上述报警信息,获得包括故障自愈任务的故障自愈方案;The plan obtaining module 804 is configured to obtain a fault self-healing plan including a fault self-healing task according to the above-mentioned cause information and the above-mentioned alarm information;
故障自愈模块805,设置为通过执行上述故障自愈方案中包括的故障自愈任务,进行故障自愈。The fault self-healing module 805 is configured to perform fault self-healing by executing the fault self-healing tasks included in the above fault self-healing scheme.
由以上可见,应用本实施例提供的方案进行故障自愈时,由于故障自愈任务既考虑了报警信息自身的信息,又考虑了触发产生报警信息的故障的原因信息,原因信息能够反映触发产生报警信息的故障的原因,因此,故障自愈任务不仅能够从报警信息所呈现出来的故障直观层面进行故障自愈,还能够从故 障的原因信息所呈现出来的根源层面进行故障自愈,进而实现了故障问题的精准解决,有效提升了故障止损的效率。It can be seen from the above that when the solution provided by this embodiment is used for fault self-healing, because the fault self-healing task not only considers the information of the alarm information itself, but also considers the cause information of the fault that triggers the alarm information, and the cause information can reflect the trigger generation. The cause of the fault in the alarm information, therefore, the fault self-healing task can not only perform fault self-healing from the intuitive level of the fault presented by the alarm information, but also perform fault self-healing from the root level presented by the cause information of the fault, and then realize The precise solution to the fault problem has effectively improved the efficiency of fault stop loss.
另外,由于关联数据是与报警信息具有关联性的数据,又由于在业务***出现故障时,这一故障可能会带来一系列连锁效应,同时这一故障也可能是由其他问题导致的,在这一连串的连锁效应中所产生的各种信息是相互关联的,基于具有关联关系的各种信息可以确定导致故障产生的原因信息,因此,根据关联数据能够较为准确地确定触发产生报警信息的故障的原因信息,进而根据原因信息和报警信息所获得的故障自愈方案对原因信息对应的故障进行故障自愈的成功率越高。In addition, because the associated data is the data that is related to the alarm information, and because when the business system fails, this failure may bring a series of chain effects, and this failure may also be caused by other problems. The various information generated in this series of cascading effects are interrelated, and based on the various information related to each other, the cause information of the fault can be determined. Therefore, the fault that triggers the alarm information can be determined more accurately based on the associated data. The cause information, and then the fault self-healing scheme obtained according to the cause information and alarm information has a higher success rate of self-healing for the fault corresponding to the cause information.
参见图9,图9为本公开实施例提供的第二种故障处理装置的结构示意图,在上述实施例的基础上,上述数据获得模块802包括以下子模块中的至少一个子模块:Referring to FIG. 9, FIG. 9 is a schematic structural diagram of a second fault handling device provided by an embodiment of the present disclosure. On the basis of the above embodiment, the above data acquisition module 802 includes at least one of the following submodules:
报警信息获得子模块8021,设置为从上述监控数据中,获得上述报警信息中记录的报警时间所在第一时间段内的且针对目标业务节点的其他报警信息,其中,上述目标业务节点为:上述报警信息针对的业务节点;The alarm information obtaining sub-module 8021 is configured to obtain other alarm information for the target service node within the first time period of the alarm time recorded in the above alarm information from the above monitoring data, wherein the above target service node is: the above The service node targeted by the alarm information;
第一信息获得子模块8022,设置为确定触发目标故障类型的故障的影响因子,根据上述监控数据中记录的、上述报警时间在第二时间段内的上述影响因子的取值,获得表征上述影响因子波动的第一信息,其中,上述目标故障类型为:上述报警信息中记录的故障类型;The first information obtaining sub-module 8022 is configured to determine the impact factor of the fault that triggers the target fault type, and obtain the above-mentioned impact according to the value of the above-mentioned impact factor recorded in the above-mentioned monitoring data and the above-mentioned alarm time within the second time period The first information of factor fluctuations, wherein the above-mentioned target fault type is: the fault type recorded in the above-mentioned alarm information;
第二信息获得子模块8023,设置为根据上述监控数据中记录的最新***更新时间和上述报警时间,获得表征上述业务***在上述报警信息产生前的第三时间段内是否发生***更新的第二信息;The second information obtaining sub-module 8023 is configured to obtain the second information indicating whether the above-mentioned service system has been updated within the third time period before the generation of the above-mentioned alarm information according to the latest system update time recorded in the above-mentioned monitoring data and the above-mentioned alarm time. information;
资源量选择子模块8024,设置为从上述监控数据中,选择上述报警时间所在第四时间段内的上述业务***的可用资源量。The resource amount selection sub-module 8024 is configured to select the available resource amount of the above-mentioned business system within the fourth time period where the above-mentioned alarm time is located from the above-mentioned monitoring data.
针对上述报警信息获得子模块8021,由于在一个时间段内针对同一节点的各报警信息之间具有关联关系的概率较大,所以第一时间段内针对目标节点的其他报警信息与报警信息之间具有关联关系的概率较大,将上述其他报警信息确定为关联数据的准确度高。For the above-mentioned alarm information obtaining sub-module 8021, since there is a relatively high probability of correlation between the alarm information for the same node within a time period, the relationship between other alarm information and alarm information for the target node within the first time period The probability of having an associated relationship is relatively high, and the accuracy of determining the above-mentioned other alarm information as associated data is high.
针对上述第一信息获得子模块8022,由于影响因子是触发报警信息中记录的故障类型的故障的影响因子,在一定时间段内影响因子的取值与该时间段内产生的报警信息具有关联关系,又由于上述第一信息是根据包括报警信息的报警时间的第二时间段内影响因子的取值确定的,上述第一信息是与第二时间段内影响因子的取值相关的,所以,第一信息与上述报警信息具有关联关系,将第一信息确定为关联数据的准确度高。For the above-mentioned first information obtaining sub-module 8022, since the impact factor is the impact factor of the fault type recorded in the trigger alarm information, the value of the impact factor within a certain period of time is associated with the alarm information generated within this period of time , and since the above-mentioned first information is determined according to the value of the influence factor in the second time period including the alarm time of the alarm information, the above-mentioned first information is related to the value of the influence factor in the second time period, so, The first information has an association relationship with the above-mentioned alarm information, and the accuracy of determining the first information as associated data is high.
针对上述第二信息获得子模块8023,由于第二信息是表征报警信息产生前的第三时间段内是否发生***更新的信息,又由于发生***更新导致业务***产生故障的概率较大,从而易产生报警信息,所以表征第三时间段内是否发生***的第二信息与报警信息之间关联性高,将第二信息确定为关联数据的准确度高。For the above-mentioned second information obtaining sub-module 8023, because the second information is the information indicating whether the system update occurs in the third time period before the alarm information is generated, and because the occurrence of the system update causes the business system to fail, it is easy to The alarm information is generated, so the correlation between the second information representing whether the system occurs within the third time period and the alarm information is high, and the accuracy of determining the second information as the associated data is high.
针对上述资源量选择子模块8024,由于业务***的可用资源量会对业务***产生影响,当发生故障时产生报警信息,如可用资源量低会造成业务***难以响应用户请求的故障,从而产生对应的报警信息。所以包括报警时间的第四时间段内业务***的可用资源量与报警信息之间具有关联关系,将上述业务***的可用资源量确定为关联数据的准确度高。For the above resource amount selection sub-module 8024, since the amount of available resources of the business system will have an impact on the business system, an alarm message will be generated when a failure occurs. If the amount of available resources is low, the business system will be difficult to respond to the fault requested by the user, thereby generating a corresponding alarm information. Therefore, there is an association between the available resource amount of the business system and the alarm information in the fourth time period including the alarm time, and the accuracy of determining the above-mentioned available resource amount of the business system as associated data is high.
参见图10,图10为本公开实施例提供的第三种故障处理装置的结构示意图,在上述实施例的基础上,上述方案获得模块804包括以下子模块8041-8042:Referring to FIG. 10 , FIG. 10 is a schematic structural diagram of a third fault handling device provided by an embodiment of the present disclosure. On the basis of the above embodiment, the above solution obtaining module 804 includes the following submodules 8041-8042:
方案查找子模块8041,设置为根据上述报警信息,在已知故障自愈方案中,查找对上述原因信息对应的故障进行自愈处理的故障自愈方案;The scheme search sub-module 8041 is configured to search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the above cause information in the known fault self-healing scheme according to the above-mentioned alarm information;
方案确定子模块8042,设置为将查找到的故障自愈方案确定为包括故障自愈任务的故障自愈方案。The scheme determining submodule 8042 is configured to determine the found fault self-healing scheme as a fault self-healing scheme including a fault self-healing task.
由于已知故障自愈方案是指已知的对故障进行故障自愈的方案,从上述已知故障自愈方案中查找故障自愈方案,可以直接基于查找到的故障自愈方案进行故障自愈,提高了故障自愈的效率。Since the known fault self-healing scheme refers to the known fault self-healing scheme, the fault self-healing scheme can be found from the above-mentioned known fault self-healing scheme, and the fault self-healing can be performed directly based on the found fault self-healing scheme , improving the efficiency of fault self-healing.
另外,由于是根据报警信息,查找到的对原因信息对应的故障进行自愈处理的故障自愈方案,使得查找到的故障自愈方案能够对上述故障进行自愈处理,从而通过执行上述故障自愈方案实现故障自愈。In addition, because it is a fault self-healing scheme that performs self-healing processing on the fault corresponding to the reason information found according to the alarm information, the found fault self-healing scheme can perform self-healing processing on the above fault, so that The recovery scheme realizes self-healing of faults.
本公开的一个实施例中,上述方案查找子模块8041,还设置为提取上述报警信息中预设字段的目标字段值;基于上述目标字段值,在上述已知故障自愈方案中,查找对上述原因信息对应的故障进行自愈处理的、且包括目标自愈任务的故障自愈方案,其中,上述目标自愈任务为:依据上述预设字段的目标字段值设置的故障自愈任务。In one embodiment of the present disclosure, the solution search sub-module 8041 is also set to extract the target field value of the preset field in the above alarm information; based on the above target field value, in the above known fault self-healing solution, search for the above A fault self-healing solution that includes a target self-healing task for the fault corresponding to the cause information to be self-healed, wherein the target self-healing task is a fault self-healing task set according to the target field value of the preset field.
由于是从对原因信息对应的故障进行自愈处理的已知故障自愈方案中,进一步查找包括目标自愈任务的故障自愈方案,目标自愈任务为依据预设字段的目标字段值设置的故障自愈任务,所以查找到的故障自愈任务是针对报警信息中预设字段的目标字段值设置的,提高了故障自愈任务进行故障自愈的成功率。Since the fault self-healing scheme including the target self-healing task is further searched from the known fault self-healing scheme for self-healing processing of the fault corresponding to the cause information, the target self-healing task is set according to the target field value of the preset field The fault self-healing task, so the found fault self-healing task is set according to the target field value of the preset field in the alarm information, which improves the success rate of the fault self-healing task for the fault self-healing task.
本公开的一个实施例中,上述预设字段包括以下字段中的至少一个字段:In an embodiment of the present disclosure, the above preset fields include at least one of the following fields:
上述报警信息的报警时间、上述报警信息所针对的业务节点的标识、生成上述报警信息的设备的标识、上述设备所在机房的标识、触发生成上述报警信息的实例的标识以及异常描述信息。The alarm time of the above-mentioned alarm information, the identification of the service node targeted by the above-mentioned alarm information, the identification of the device generating the above-mentioned alarm information, the identification of the computer room where the above-mentioned equipment is located, the identification of the instance that triggered the generation of the above-mentioned alarm information, and abnormal description information.
由于上述预设字段包括报警信息的报警时间,报警信息所针对的业务节点的标识、生成报警信息的设备的标识、设备所在机房的标识、触发生成报警信息的实例的标识,这些字段的取值从不同方面表示报警信息的具体情况,通过提取报警信息中上述预设字段的取值,可以较为准确地反映报警信息。Since the above preset fields include the alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the equipment generating the alarm information, the identification of the equipment room where the equipment is located, and the identification of the instance that triggered the generation of the alarm information, the values of these fields The specific situation of the alarm information is expressed from different aspects, and the alarm information can be reflected more accurately by extracting the values of the above-mentioned preset fields in the alarm information.
参见图11a,图11a为本公开实施例提供的第四种故障处理装置的结构示意图,在上述实施例的基础上,上述方案获得模块804还包括以下子模块8043-8047。Referring to Fig. 11a, Fig. 11a is a schematic structural diagram of a fourth fault handling device provided by an embodiment of the present disclosure. On the basis of the above embodiment, the above solution obtaining module 804 further includes the following submodules 8043-8047.
第一相似度获得子模块8043,设置为在上述方案查找子模块8041中未查找到对上述原因信息对应的故障进行自愈处理的故障自愈方案之后,根据各已知运维信息中描述信息的第一语义与上述原因信息的第二语义,获得各已知运维信息与上述原因信息之间的第一相似度,其中,每一条已知运维信息中包括:***异常的描述信息以及***异常处理方式的描述信息;The first similarity obtaining sub-module 8043 is set to, after the failure self-healing scheme for self-healing processing of the failure corresponding to the above-mentioned reason information is not found in the above-mentioned scheme searching sub-module 8041, according to the description information in each known operation and maintenance information The first semantics of the above-mentioned reason information and the second semantics of the above-mentioned reason information are used to obtain the first similarity between each known operation and maintenance information and the above-mentioned reason information, wherein each piece of known operation and maintenance information includes: description information of system abnormalities and Description information of the system exception handling method;
第二相似度获得子模块8044,设置为根据上述第一语义和上述报警信息的第三语义,获得各已知运维信息与上述报警信息之间的第二相似度;The second similarity obtaining submodule 8044 is configured to obtain the second similarity between each known operation and maintenance information and the above-mentioned alarm information according to the above-mentioned first semantics and the above-mentioned third semantics of the alarm information;
信息选择子模块8045,设置为根据上述第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息;The information selection sub-module 8045 is configured to select the description information of the candidate processing method from the description information of the system abnormality processing method included in each known operation and maintenance information according to the above-mentioned first similarity degree and the second similarity degree;
候选方案确定子模块8046,设置为针对每一候选处理方式,基于该候选处理方式的描述信息,获得候选处理任务,以得到包含上述候选处理任务的候选故障自愈方案;The candidate solution determination sub-module 8046 is configured to, for each candidate processing method, obtain candidate processing tasks based on the description information of the candidate processing method, so as to obtain a candidate fault self-healing solution including the above candidate processing tasks;
自愈方案确定子模块8047,设置为从各候选故障自愈方案中确定故障自愈方案。The self-healing scheme determination sub-module 8047 is configured to determine the fault self-healing scheme from each candidate fault self-healing scheme.
由于第一相似度是各已知运维信息中描述信息的第一语义与原因信息的第二语义之间的相似度,第二相似度是上述第一语义与报警信息的第三语义之间的相似度,在根据第一相似度和第二相似度确定候选处理方式的描述信息时,综合考虑了原因信息、报警信息的语义分别与各已知运维信息中描述信息的 语义之间的相似度,使得所确定的描述信息对应的候选处理方式能够较为准确地处理报警信息的故障,进而使得所确定的故障自愈方案较为准确。Since the first similarity is the similarity between the first semantics of the description information in each known operation and maintenance information and the second semantics of the cause information, the second similarity is the difference between the above-mentioned first semantics and the third semantics of the alarm information When determining the description information of the candidate processing method according to the first similarity and the second similarity, the semantics of the cause information, alarm information and the semantics of the description information in each known operation and maintenance information are considered comprehensively. The similarity makes the determined candidate processing methods corresponding to the description information more accurately handle the faults of the alarm information, thereby making the determined fault self-healing scheme more accurate.
参见图11b,图11b为本公开实施例提供的第五种故障处理装置的结构示意图,在上述实施例的基础上,上述方案获得模块804还包括:Referring to FIG. 11b, FIG. 11b is a schematic structural diagram of a fifth fault handling device provided by an embodiment of the present disclosure. On the basis of the above embodiment, the above solution obtaining module 804 further includes:
概率获得子模块8048,设置为在上述信息选择子模块8045之后,获得采用各候选处理方式对上述原因信息对应的故障进行故障自愈的第一成功概率;The probability obtaining sub-module 8048 is configured to obtain the first success probability of self-healing of the fault corresponding to the above-mentioned reason information by adopting each candidate processing method after the above-mentioned information selection sub-module 8045;
上述自愈方案确定子模块8047,包括:The above self-healing scheme determines the submodule 8047, including:
概率预估单元80471,设置为针对每一候选故障自愈方案,根据上述业务***的当前网络环境信息和候选故障自愈方案包括的候选处理任务,预估采用该候选故障自愈方案对上述原因信息对应的故障进行故障自愈的第二成功概率;The probability estimation unit 80471 is configured to, for each candidate fault self-healing scheme, estimate the impact of the candidate fault self-healing scheme on the above-mentioned reasons according to the current network environment information of the above-mentioned business system and the candidate processing tasks included in the candidate fault self-healing scheme. The second success probability of fault self-healing for the fault corresponding to the information;
自愈方案确定单元80472,设置为根据上述第一成功概率和第二成功概率,从各候选故障自愈方案中确定故障自愈方案。The self-healing scheme determining unit 80472 is configured to determine a fault self-healing scheme from each candidate fault self-healing scheme according to the first success probability and the second success probability.
由于是根据第一成功概率和第二成功概率从各候选故障自愈方案中确定的故障自愈方案,又由于第一成功概率表示各候选处理方式自身进行故障自愈的成功概率,第二成功概率不仅考虑了候选故障自愈方案的候选处理任务的信息,还考虑了业务***的当前网络环境信息,使得第二成功概率适应业务***的当前网络环境信息,所以第一成功概率和第二成功概率是从两种不同角度确定各候选故障自愈方案的成功概率,从而基于上述两种成功概率,提高了通过所确定的故障自愈方案进行故障自愈的成功概率。Because it is the fault self-healing scheme determined from each candidate fault self-healing scheme according to the first success probability and the second success probability, and because the first success probability represents the success probability of each candidate processing method itself for fault self-healing, the second success The probability not only considers the candidate processing task information of the candidate fault self-healing scheme, but also considers the current network environment information of the business system, so that the second success probability adapts to the current network environment information of the business system, so the first success probability and the second success probability Probability is to determine the success probability of each candidate fault self-healing scheme from two different angles, so based on the above two success probabilities, the success probability of fault self-healing through the determined fault self-healing scheme is improved.
参见图12,图12为本公开实施例提供的第六种故障处理装置的结构示意图,在上述实施例的基础上,上述概率预估单元80471,包括:Referring to Fig. 12, Fig. 12 is a schematic structural diagram of a sixth fault handling device provided by an embodiment of the present disclosure. On the basis of the above embodiment, the above probability estimation unit 80471 includes:
耗时确定子单元804711,设置为根据该候选故障自愈方案包括的各候选处理任务的任务参数以及任务间依赖关系,确定各候选处理任务的执行耗时;The time-consuming determination subunit 804711 is configured to determine the time-consuming execution of each candidate processing task according to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing solution;
概率预估子单元804712,设置为根据各候选处理任务的执行耗时和上述业务***的当前网络环境信息,预估采用该候选故障自愈方案对上述原因信息对应的故障进行故障自愈的第二成功概率。The probability estimation subunit 804712 is configured to estimate the first time for self-healing the fault corresponding to the above cause information by using the candidate fault self-healing scheme according to the execution time of each candidate processing task and the current network environment information of the above-mentioned business system. 2. Probability of success.
由于是根据各候选处理任务的执行耗时和当前网络环境信息预估第二成功概率,预估得到的第二成功概率与各候选处理任务的执行耗时相关,而各候选处理任务的执行耗时影响故障自愈的效率,所以预估得到的第二成功概率考虑了故障自愈的效率,进而基于第一成功概率和第二成功概率确定故障自愈方案,在执行上述故障自愈方案时提高了故障自愈的效率。Since the second success probability is estimated based on the execution time consumption of each candidate processing task and the current network environment information, the estimated second success probability is related to the execution time consumption of each candidate processing task, and the execution time consumption of each candidate processing task is time affects the efficiency of fault self-healing, so the estimated second success probability takes into account the efficiency of fault self-healing, and then determines the fault self-healing scheme based on the first success probability and the second success probability. When implementing the above fault self-healing scheme Improve the efficiency of fault self-healing.
本公开的一个实施例中,上述故障自愈模块805,包括:In one embodiment of the present disclosure, the above fault self-healing module 805 includes:
工具确定子模块,设置为确定与每一故障自愈任务的任务类型相匹配的任务执行工具;The tool determination submodule is configured to determine the task execution tool matching the task type of each fault self-healing task;
故障自愈子模块,设置为按照上述故障自愈方案中各故障自愈任务的执行顺序,调用上述各故障自愈任务对应的任务执行工具,执行上述各故障自愈任务,进行故障自愈。The fault self-healing sub-module is set to call the task execution tools corresponding to the above fault self-healing tasks according to the execution sequence of the fault self-healing tasks in the above fault self-healing scheme, execute the above fault self-healing tasks, and perform fault self-healing.
由于上述任务执行工具是与故障自愈任务的任务类型相匹配的,与任务的任务类型相匹配的任务执行工具能够执行该任务,因此,通过调用上述各任务执行工具,能够执行故障自愈任务,从而实现故障自愈。Since the above-mentioned task execution tools match the task type of the fault self-healing task, the task execution tool matching the task type of the task can execute the task. Therefore, by calling the above-mentioned task execution tools, the fault self-healing task can be executed , so as to realize fault self-healing.
本公开的一个实施例中,上述装置还包括:In an embodiment of the present disclosure, the above-mentioned device further includes:
过程监控模块,设置为对各故障自愈任务的执行过程进行监控;The process monitoring module is configured to monitor the execution process of each fault self-healing task;
任务控制模块,设置为在监控到任务执行异常的情况下,对上述各故障自愈任务的调度顺序进行调 整,和/或,控制上述各故障自愈任务的执行进度。The task control module is configured to adjust the scheduling sequence of the above-mentioned fault self-healing tasks, and/or control the execution progress of the above-mentioned fault self-healing tasks in the case of monitoring abnormal task execution.
由于在任务执行异常的情况下,对故障自愈任务的调度顺序进行调整,和/或,控制故障自愈任务的执行进度,使得能够对可能出现的问题及时进行调整,从而顺利实现故障自愈。In the case of abnormal task execution, the scheduling sequence of fault self-healing tasks is adjusted, and/or, the execution progress of fault self-healing tasks is controlled, so that possible problems can be adjusted in time, so as to successfully realize fault self-healing .
本公开实施例提供了一种电子设备,包括:至少一个处理器;以及与上述至少一个处理器通信连接的存储器;其中,上述存储器存储有可被上述至少一个处理器执行的指令,上述指令被上述至少一个处理器执行,以使上述至少一个处理器能够故障处理方法。An embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by The at least one processor executes a method for enabling the at least one processor to fail.
本公开实施例提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,上述计算机指令用于使上述计算机执行故障处理方法。An embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the fault handling method.
本公开实施例提供了一种计算机程序产品,包括计算机程序,上述计算机程序在被处理器执行时实现故障处理方法。An embodiment of the present disclosure provides a computer program product, including a computer program, and the computer program implements a fault handling method when executed by a processor.
图13示出了可以用来实施本公开的实施例的示例电子设备1300的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 13 shows a schematic block diagram of an example electronic device 1300 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图13所示,电子设备1300包括计算单元1301,其可以根据存储在只读存储器(ROM)1302中的计算机程序或者从存储单元1308加载到随机访问存储器(RAM)1303中的计算机程序,来执行各种适当的动作和处理。在RAM 1303中,还可存储电子设备1300操作所需的各种程序和数据。计算单元1301、ROM 1302以及RAM 1303通过总线1304彼此相连。输入/输出(I/O)接口1305也连接至总线1304。As shown in FIG. 13 , an electronic device 1300 includes a computing unit 1301, which can perform calculations according to a computer program stored in a read-only memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a random access memory (RAM) 1303. Various appropriate actions and processes are performed. In the RAM 1303, various programs and data necessary for the operation of the electronic device 1300 can also be stored. The computing unit 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to the bus 1304 .
电子设备1300中的多个部件连接至I/O接口1305,包括:输入单元1306,例如键盘、鼠标等;输出单元1307,例如各种类型的显示器、扬声器等;存储单元1308,例如磁盘、光盘等;以及通信单元1309,例如网卡、调制解调器、无线通信收发机等。通信单元1309允许电子设备1300通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 1300 are connected to the I/O interface 1305, including: an input unit 1306, such as a keyboard, a mouse, etc.; an output unit 1307, such as various types of displays, speakers, etc.; a storage unit 1308, such as a magnetic disk, an optical disk etc.; and a communication unit 1309, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1309 allows the electronic device 1300 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
计算单元1301可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1301的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1301执行上文所描述的各个方法和处理,例如故障处理方法。例如,在一些实施例中,故障处理方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1308。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1302和/或通信单元1309而被载入和/或安装到电子设备1300上。当计算机程序加载到RAM 1303并由计算单元1301执行时,可以执行上文描述的故障处理方法的一个或多个步骤。备选地,在其他实施例中,计算单元1301可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行故障处理方法。The computing unit 1301 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 1301 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1301 executes various methods and processes described above, such as a fault handling method. For example, in some embodiments, the fault handling method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1308 . In some embodiments, part or all of the computer program can be loaded and/or installed on the electronic device 1300 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the fault handling method described above can be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured in any other appropriate way (for example, by means of firmware) to execute the fault handling method.
本文中以上描述的***和技术的各种实施方式可以在数字电子电路***、集成电路***、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上***的***(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理 器的可编程***上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储***、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储***、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的***和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
可以将此处描述的***和技术实施在包括后台部件的计算***(例如,作为数据服务器)、或者包括中间件部件的计算***(例如,应用服务器)、或者包括前端部件的计算***(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的***和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算***中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将***的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机***可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims (22)

  1. 一种故障处理方法,包括:A troubleshooting method, comprising:
    获得业务***产生的报警信息;Obtain the alarm information generated by the business system;
    根据所述业务***的监控数据,获得与所述报警信息具有关联性的关联数据;Obtain associated data related to the alarm information according to the monitoring data of the business system;
    根据所述关联数据,确定触发产生所述报警信息的故障的原因信息;Determine the cause information of the fault that triggers the generation of the alarm information according to the associated data;
    根据所述原因信息和所述报警信息,获得包括故障自愈任务的故障自愈方案;Obtaining a fault self-healing solution including a fault self-healing task according to the cause information and the alarm information;
    通过执行所述故障自愈方案中包括的故障自愈任务,进行故障自愈。Fault self-healing is performed by executing the fault self-healing tasks included in the fault self-healing solution.
  2. 根据权利要求1所述的方法,其中,所述根据所述业务***的监控数据,获得与所述报警信息具有关联性的关联数据,包括:The method according to claim 1, wherein the obtaining associated data related to the alarm information according to the monitoring data of the business system includes:
    根据所述业务***的监控数据,按照以下方式中的至少一种方式获得信息,作为与所述报警信息具有关联性的关联数据:According to the monitoring data of the business system, information is obtained in at least one of the following ways as associated data related to the alarm information:
    从所述监控数据中,获得所述报警信息中记录的报警时间所在第一时间段内的且针对目标业务节点的其他报警信息,其中,所述目标业务节点为:所述报警信息针对的业务节点;From the monitoring data, obtain other alarm information for the target service node within the first time period of the alarm time recorded in the alarm information, wherein the target service node is: the service targeted by the alarm information node;
    确定触发目标故障类型的故障的影响因子,根据所述监控数据中记录的、所述报警时间所在第二时间段内的所述影响因子的取值,获得表征所述影响因子波动的第一信息,其中,所述目标故障类型为:所述报警信息中记录的故障类型;determining the impact factor that triggers the fault of the target fault type, and obtaining first information that characterizes the fluctuation of the impact factor according to the value of the impact factor recorded in the monitoring data within the second time period where the alarm time is located , wherein the target fault type is: the fault type recorded in the alarm information;
    根据所述监控数据中记录的最新***更新时间和所述报警时间,获得表征所述业务***在所述报警信息产生前的第三时间段内是否发生***更新的第二信息;According to the latest system update time recorded in the monitoring data and the alarm time, obtain second information representing whether a system update has occurred in the business system within a third time period before the alarm information is generated;
    从所述监控数据中,选择所述报警时间所在第四时间段内的所述业务***的可用资源量。From the monitoring data, the amount of available resources of the service system within the fourth time period where the alarm time is located is selected.
  3. 根据权利要求1所述的方法,其中,所述根据所述原因信息和所述报警信息,获得包括故障自愈任务的故障自愈方案,包括:The method according to claim 1, wherein said obtaining a fault self-healing solution including a fault self-healing task according to said cause information and said alarm information comprises:
    根据所述报警信息,在已知故障自愈方案中,查找对所述原因信息对应的故障进行自愈处理的故障自愈方案;According to the alarm information, among the known fault self-healing schemes, search for a fault self-healing scheme for performing self-healing processing on the fault corresponding to the cause information;
    将查找到的故障自愈方案确定为包括故障自愈任务的故障自愈方案。The found fault self-healing scheme is determined as the fault self-healing scheme including the fault self-healing task.
  4. 根据权利要求3所述的方法,其中,所述根据所述报警信息,在已知故障自愈方案中,查找对所述原因信息对应的故障进行自愈处理的故障自愈方案,包括:The method according to claim 3, wherein, according to the alarm information, among the known fault self-healing schemes, searching for a fault self-healing scheme for self-healing the fault corresponding to the cause information includes:
    提取所述报警信息中预设字段的目标字段值;Extracting the target field value of the preset field in the alarm information;
    基于所述目标字段值,在所述已知故障自愈方案中,查找对所述原因信息对应的故障进行自愈处理、且包括目标自愈任务的故障自愈方案,其中,所述目标自愈任务为:依据所述预设字段的目标字段值设置的故障自愈任务。Based on the target field value, in the known fault self-healing scheme, search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the cause information and includes a target self-healing task, wherein the target self-healing The recovery task is: a fault self-healing task set according to the target field value of the preset field.
  5. 根据权利要求4所述的方法,其中,The method according to claim 4, wherein,
    所述预设字段包括以下字段中的至少一个字段:The preset fields include at least one of the following fields:
    所述报警信息的报警时间、所述报警信息所针对的业务节点的标识、生成所述报警信息的设备的标识、所述设备所在机房的标识,触发生成所述报警信息的实例的标识以及异常描述信息。The alarm time of the alarm information, the identification of the service node targeted by the alarm information, the identification of the device generating the alarm information, the identification of the equipment room where the equipment is located, the identification of the instance that triggered the generation of the alarm information, and the abnormality Description.
  6. 根据权利要求3-5中任一项所述的方法,其中,若未查找到对所述原因信息对应的故障进行自愈处理的故障自愈方案,所述方法还包括:The method according to any one of claims 3-5, wherein, if no fault self-healing scheme for self-healing processing of the fault corresponding to the cause information is found, the method further includes:
    根据各已知运维信息中描述信息的第一语义与所述原因信息的第二语义,获得各已知运维信息与所 述原因信息之间的第一相似度,其中,每一条已知运维信息中包括:***异常的描述信息以及***异常处理方式的描述信息;According to the first semantics of the description information in each known operation and maintenance information and the second semantics of the cause information, the first similarity between each known operation and maintenance information and the cause information is obtained, wherein each piece of known Operation and maintenance information includes: description information of system exceptions and description information of system exception handling methods;
    根据所述第一语义和所述报警信息的第三语义,获得各已知运维信息与所述报警信息之间的第二相似度;Obtaining a second similarity between each known operation and maintenance information and the alarm information according to the first semantics and the third semantics of the alarm information;
    根据所述第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息;According to the first similarity and the second similarity, from the description information of the system exception handling method included in each known operation and maintenance information, select the description information of the candidate processing method;
    针对每一候选处理方式,基于该候选处理方式的描述信息,获得候选处理任务,以得到包含所述候选处理任务的候选故障自愈方案;For each candidate processing method, based on the description information of the candidate processing method, a candidate processing task is obtained, so as to obtain a candidate fault self-healing solution including the candidate processing task;
    从各候选故障自愈方案中确定故障自愈方案。Determine the fault self-healing scheme from each candidate fault self-healing scheme.
  7. 根据权利要求6所述的方法,其中,The method of claim 6, wherein,
    在所述根据所述第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息之后,还包括:After the description information of the candidate processing method is selected from the description information of the system abnormality processing method included in each known operation and maintenance information according to the first similarity and the second similarity, it also includes:
    获得采用各候选处理方式对所述原因信息对应的故障进行故障自愈的第一成功概率;Obtaining the first success probability of self-healing of the fault corresponding to the cause information by adopting each candidate processing method;
    所述从各候选故障自愈方案中确定故障自愈方案,包括:The determination of the fault self-healing scheme from each candidate fault self-healing scheme includes:
    针对每一候选故障自愈方案,根据所述业务***的当前网络环境信息和该候选故障自愈方案包括的候选处理任务,预估采用该候选故障自愈方案对所述原因信息对应的故障进行故障自愈的第二成功概率;For each candidate fault self-healing scheme, according to the current network environment information of the business system and the candidate processing tasks included in the candidate fault self-healing scheme, it is estimated that the candidate fault self-healing scheme is used to perform the fault corresponding to the cause information Second probability of success for fault self-healing;
    根据所述第一成功概率和所述第二成功概率,从各候选故障自愈方案中确定故障自愈方案。A fault self-healing scheme is determined from candidate fault self-healing schemes according to the first success probability and the second success probability.
  8. 根据权利要求7所述的方法,其中,所述根据所述业务***的当前网络环境信息和候选故障自愈方案包括的候选处理任务,预估采用该候选故障自愈方案对所述原因信息对应的故障进行故障自愈的第二成功概率,包括:The method according to claim 7, wherein, according to the current network environment information of the service system and the candidate processing tasks included in the candidate fault self-healing scheme, it is estimated that the candidate fault self-healing scheme will correspond to the cause information The second probability of success for self-healing of faults, including:
    根据该候选故障自愈方案包括的各候选处理任务的任务参数以及任务间依赖关系,确定各候选处理任务的执行耗时;According to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing scheme, determine the execution time of each candidate processing task;
    根据各候选处理任务的执行耗时和所述业务***的当前网络环境信息,预估采用该候选故障自愈方案对所述原因信息对应的故障进行故障自愈的第二成功概率。According to the execution time consumption of each candidate processing task and the current network environment information of the business system, a second success probability of self-healing the fault corresponding to the cause information by using the candidate fault self-healing scheme is estimated.
  9. 根据权利要求1-3中任一项所述的方法,其中,所述通过执行所述故障自愈方案中包括的故障自愈任务,进行故障自愈,包括:The method according to any one of claims 1-3, wherein performing the fault self-healing by executing the fault self-healing task included in the fault self-healing scheme includes:
    确定与每一故障自愈任务的任务类型相匹配的任务执行工具;Identify task execution tools that match the task type of each fault self-healing task;
    按照所述故障自愈方案中各故障自愈任务的执行顺序,调用所述各故障自愈任务对应的任务执行工具,执行所述各故障自愈任务,进行故障自愈。According to the execution sequence of each fault self-healing task in the fault self-healing scheme, the task execution tool corresponding to each fault self-healing task is invoked to execute each fault self-healing task to perform fault self-healing.
  10. 根据权利要求1-3中任一项所述的方法,所述方法还包括:The method according to any one of claims 1-3, further comprising:
    对各故障自愈任务的执行过程进行监控;Monitor the execution process of each fault self-healing task;
    在监控到任务执行异常的情况下,对所述各故障自愈任务的调度顺序进行调整,和/或,控制所述各故障自愈任务的执行进度。In the case of abnormal task execution being monitored, the scheduling sequence of the fault self-healing tasks is adjusted, and/or, the execution progress of the fault self-healing tasks is controlled.
  11. 一种故障处理装置,包括:A fault handling device, comprising:
    信息获得模块,设置为获得业务***产生的报警信息;The information obtaining module is configured to obtain the alarm information generated by the business system;
    数据获得模块,设置为根据所述业务***的监控数据,获得与所述报警信息具有关联性的关联数据;The data obtaining module is configured to obtain associated data related to the alarm information according to the monitoring data of the business system;
    信息确定模块,设置为根据所述关联数据,确定触发产生所述报警信息的故障的原因信息;The information determination module is configured to determine the cause information of the fault that triggers the generation of the alarm information according to the associated data;
    方案获得模块,设置为根据所述原因信息和所述报警信息,获得包括故障自愈任务的故障自愈方案;The scheme obtaining module is configured to obtain a fault self-healing scheme including a fault self-healing task according to the cause information and the alarm information;
    故障自愈模块,设置为通过执行所述故障自愈方案中包括的故障自愈任务,进行故障自愈。The fault self-healing module is configured to perform fault self-healing by executing the fault self-healing tasks included in the fault self-healing scheme.
  12. 根据权利要求11所述的装置,其中,所述数据获得模块,包括:The device according to claim 11, wherein the data obtaining module comprises:
    根据所述业务***的监控数据,按照以下各子模块中的至少一种子模块获得信息,作为与所述报警信息具有关联性的关联数据:According to the monitoring data of the business system, obtain information according to at least one of the following submodules as associated data related to the alarm information:
    报警信息获得子模块,设置为从所述监控数据中,获得所述报警信息中记录的报警时间所在第一时间段内的且针对目标业务节点的其他报警信息,其中,所述目标业务节点为:所述报警信息针对的业务节点;The alarm information obtaining sub-module is configured to obtain, from the monitoring data, other alarm information for the target service node within the first time period of the alarm time recorded in the alarm information, wherein the target service node is : the service node targeted by the alarm information;
    第一信息获得子模块,设置为确定触发目标故障类型的故障的影响因子,根据所述监控数据中记录的、所述报警时间在第二时间段内的所述影响因子的取值,获得表征所述影响因子波动的第一信息,其中,所述目标故障类型为:所述报警信息中记录的故障类型;The first information obtaining sub-module is configured to determine the impact factor of the fault that triggers the target fault type, and obtain the characterization according to the value of the impact factor recorded in the monitoring data and the alarm time within the second time period The first information about the fluctuation of the influencing factor, wherein the target fault type is: the fault type recorded in the alarm information;
    第二信息获得子模块,设置为根据所述监控数据中记录的最新***更新时间和所述报警时间,获得表征所述业务***在所述报警信息产生前的第三时间段内是否发生***更新的第二信息;The second information obtaining sub-module is configured to obtain whether a system update has occurred in the third time period before the generation of the alarm information, which represents the business system according to the latest system update time recorded in the monitoring data and the alarm time the second information of
    资源量选择子模块,设置为从所述监控数据中,选择所述报警时间在第四时间段内的所述业务***的可用资源量。The resource selection sub-module is configured to select from the monitoring data the available resource of the business system whose alarm time is within the fourth time period.
  13. 根据权利要求11所述的装置,其中,所述方案获得模块,包括:The device according to claim 11, wherein the scheme obtaining module comprises:
    方案查找子模块,设置为根据所述报警信息,在已知故障自愈方案中,查找对所述原因信息对应的故障进行自愈处理的故障自愈方案;The scheme search submodule is configured to search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the cause information in the known fault self-healing scheme according to the alarm information;
    方案确定子模块,设置为将查找到的故障自愈方案确定为包括故障自愈任务的故障自愈方案。The scheme determining submodule is configured to determine the found fault self-healing scheme as a fault self-healing scheme including a fault self-healing task.
  14. 根据权利要求13所述的装置,其中,所述方案查找子模块,还设置为提取所述报警信息中预设字段的目标字段值;基于所述目标字段值,在所述已知故障自愈方案中,查找对所述原因信息对应的故障进行自愈处理的、且包括目标自愈任务的故障自愈方案,其中,所述目标自愈任务为:依据所述预设字段的目标字段值设置的故障自愈任务。The device according to claim 13, wherein the solution search submodule is further configured to extract a target field value of a preset field in the alarm information; based on the target field value, the known fault self-healing In the scheme, search for a fault self-healing scheme that performs self-healing processing on the fault corresponding to the cause information and includes a target self-healing task, wherein the target self-healing task is: according to the target field value of the preset field Set fault self-healing tasks.
  15. 根据权利要求14所述的装置,其中,The apparatus of claim 14, wherein,
    所述预设字段包括以下字段中的至少一个字段:The preset fields include at least one of the following fields:
    所述报警信息的报警时间、所述报警信息所针对的业务节点的标识、生成所述报警信息的设备的标识、所述设备所在机房的标识、触发生成所述报警信息的实例的标识以及异常描述信息。The alarm time of the alarm information, the identifier of the service node targeted by the alarm information, the identifier of the device that generated the alarm information, the identifier of the computer room where the device is located, the identifier of the instance that triggered the generation of the alarm information, and the exception Description.
  16. 根据权利要求13-15中任一项所述的装置,所述方案获得模块,还包括:The device according to any one of claims 13-15, the scheme obtaining module further comprising:
    第一相似度获得子模块,设置为根据各已知运维信息中描述信息的第一语义与所述原因信息的第二语义,获得各已知运维信息与所述原因信息之间的第一相似度,其中,每一条已知运维信息中包括:***异常的描述信息以及***异常处理方式的描述信息;The first similarity obtaining submodule is configured to obtain the first semantics between each known operation and maintenance information and the cause information according to the first semantics of the description information in each known operation and maintenance information and the second semantics of the cause information. A degree of similarity, wherein each piece of known operation and maintenance information includes: description information of system exceptions and description information of system exception handling methods;
    第二相似度获得子模块,设置为根据所述第一语义和所述报警信息的第三语义,获得各已知运维信息与所述报警信息之间的第二相似度;The second similarity obtaining submodule is configured to obtain a second similarity between each known operation and maintenance information and the alarm information according to the first semantics and the third semantics of the alarm information;
    信息选择子模块,设置为根据所述第一相似度和第二相似度,从各已知运维信息包括的***异常处理方式的描述信息中,选择候选处理方式的描述信息;The information selection submodule is configured to select the description information of the candidate processing method from the description information of the system abnormality processing method included in each known operation and maintenance information according to the first similarity degree and the second similarity degree;
    候选方案确定子模块,设置为针对每一候选处理方式,基于该候选处理方式的描述信息,获得候选处理任务,以得到包含所述候选处理任务的候选故障自愈方案;The candidate solution determination sub-module is configured to, for each candidate processing method, obtain a candidate processing task based on the description information of the candidate processing method, so as to obtain a candidate fault self-healing solution including the candidate processing task;
    自愈方案确定子模块,设置为从各候选故障自愈方案中确定故障自愈方案。The self-healing scheme determination sub-module is configured to determine the fault self-healing scheme from each candidate fault self-healing scheme.
  17. 根据权利要求16所述的装置,所述方案获得模块,还包括:The device according to claim 16, the solution obtaining module further comprising:
    概率获得子模块,用于在所述信息选择子模块之后,获得采用各候选处理方式对所述原因信息对应的故障进行故障自愈的第一成功概率;The probability obtaining sub-module is used to obtain the first success probability of self-healing of the fault corresponding to the cause information by adopting each candidate processing mode after the information selection sub-module;
    所述自愈方案确定子模块,包括:The self-healing scheme determines submodules, including:
    概率预估单元,设置为针对每一候选故障自愈方案,根据所述业务***的当前网络环境信息和候选故障自愈方案包括的候选处理任务,预估采用该候选故障自愈方案对所述原因信息对应的故障进行故障自愈的第二成功概率;The probability estimation unit is configured to, for each candidate fault self-healing scheme, estimate the use of the candidate fault self-healing scheme for the The second success probability of fault self-healing for the fault corresponding to the cause information;
    自愈方案确定单元,设置为根据所述第一成功概率和所述第二成功概率,从各候选故障自愈方案中确定故障自愈方案。The self-healing scheme determination unit is configured to determine a fault self-healing scheme from candidate fault self-healing schemes according to the first success probability and the second success probability.
  18. 根据权利要求17所述的装置,其中,所述概率预估单元,包括:The device according to claim 17, wherein the probability estimation unit comprises:
    耗时确定子单元,设置为根据该候选故障自愈方案包括的各候选处理任务的任务参数以及任务间依赖关系,确定各候选处理任务的执行耗时;The time-consuming determination subunit is configured to determine the time-consuming execution of each candidate processing task according to the task parameters and inter-task dependencies of each candidate processing task included in the candidate fault self-healing scheme;
    概率预估子单元,设置为根据各候选处理任务的执行耗时和所述业务***的当前网络环境信息,预估采用该候选故障自愈方案对所述原因信息对应的故障进行故障自愈的第二成功概率。The probability estimation sub-unit is configured to estimate, based on the time-consuming execution of each candidate processing task and the current network environment information of the business system, the probability of self-healing the fault corresponding to the cause information by using the candidate fault self-healing scheme second probability of success.
  19. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. Methods.
  20. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-10中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-10.
  21. 一种包含指令的计算机程序产品,所述包含指令的计算机程序产品在计算机上运行时,使得计算机执行权利要求1-10中任一所述的方法步骤。A computer program product comprising instructions, which, when run on a computer, causes the computer to perform the method steps of any one of claims 1-10.
  22. 一种计算机程序,所述计算机程序在计算机上运行时,使得计算机执行权利要求1-10中任一所述的方法步骤。A computer program, which, when run on a computer, causes the computer to execute the method steps described in any one of claims 1-10.
PCT/CN2022/106444 2021-08-06 2022-07-19 Fault processing method and apparatus, device, and storage medium WO2023011160A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110904245.7 2021-08-06
CN202110904245.7A CN113590370B (en) 2021-08-06 2021-08-06 Fault processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023011160A1 true WO2023011160A1 (en) 2023-02-09

Family

ID=78256004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/106444 WO2023011160A1 (en) 2021-08-06 2022-07-19 Fault processing method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN113590370B (en)
WO (1) WO2023011160A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271100A (en) * 2023-11-21 2023-12-22 北京国科天迅科技股份有限公司 Algorithm chip cluster scheduling method, device, computer equipment and storage medium
CN117811897A (en) * 2024-02-23 2024-04-02 济南通华电子技术有限公司 Intelligent analysis management system for internet of things card communication operation and maintenance worksheet data
CN117830961A (en) * 2024-03-06 2024-04-05 山东达斯特信息技术有限公司 Environment-friendly equipment operation and maintenance behavior analysis method and system based on image analysis
CN117834386A (en) * 2023-12-20 2024-04-05 北京联广通网络科技有限公司 Automatic alarm system and method for flow chart network monitoring faults
CN118042492A (en) * 2024-04-11 2024-05-14 深圳市友恺通信技术有限公司 Network data operation and maintenance management system and method based on 5G communication

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590370B (en) * 2021-08-06 2022-06-21 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium
CN114996119B (en) * 2022-04-20 2023-03-03 中国工商银行股份有限公司 Fault diagnosis method, fault diagnosis device, electronic device and storage medium
CN116049146B (en) * 2023-02-13 2023-09-01 北京优特捷信息技术有限公司 Database fault processing method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010522A1 (en) * 2006-06-09 2008-01-10 Fuji Xerox Co., Ltd. Failure diagnosis system, image forming apparatus, computer readable medium and data signal
CN107342878A (en) * 2016-04-29 2017-11-10 中兴通讯股份有限公司 A kind of fault handling method and device
CN108846484A (en) * 2018-04-11 2018-11-20 北京百度网讯科技有限公司 Fault self-recovery system, method, computer equipment and storage medium
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109088773A (en) * 2018-08-24 2018-12-25 广州视源电子科技股份有限公司 Fault self-recovery method, apparatus, server and storage medium
CN110380907A (en) * 2019-07-26 2019-10-25 京信通信***(中国)有限公司 A kind of network fault diagnosis method, device, the network equipment and storage medium
CN110430071A (en) * 2019-07-19 2019-11-08 云南电网有限责任公司信息中心 Service node fault self-recovery method, apparatus, computer equipment and storage medium
CN110704231A (en) * 2019-09-30 2020-01-17 深圳前海微众银行股份有限公司 Fault processing method and device
CN113590370A (en) * 2021-08-06 2021-11-02 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4893663B2 (en) * 2008-03-06 2012-03-07 日本電気株式会社 Fault recovery device
US10223189B1 (en) * 2015-06-25 2019-03-05 Amazon Technologies, Inc. Root cause detection and monitoring for storage systems
CN105337765B (en) * 2015-10-10 2018-10-12 上海新炬网络信息技术股份有限公司 A kind of distribution hadoop cluster automatic fault diagnosis repair system
CN108446184B (en) * 2018-02-23 2021-09-07 北京天元创新科技有限公司 Method and system for analyzing fault root cause
CN112152830B (en) * 2019-06-28 2023-08-04 中国电力科学研究院有限公司 Intelligent fault root cause analysis method and system
CN110941528B (en) * 2019-11-08 2022-04-08 支付宝(杭州)信息技术有限公司 Log buried point setting method, device and system based on fault
CN111181767A (en) * 2019-12-10 2020-05-19 中国航空工业集团公司成都飞机设计研究所 Monitoring and fault self-healing system and method for complex system
CN111796959B (en) * 2020-06-30 2023-08-08 中国工商银行股份有限公司 Self-healing method, device and system for host container
CN112506695A (en) * 2021-01-16 2021-03-16 鸣飞伟业技术有限公司 IT operation and maintenance risk early warning method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010522A1 (en) * 2006-06-09 2008-01-10 Fuji Xerox Co., Ltd. Failure diagnosis system, image forming apparatus, computer readable medium and data signal
CN107342878A (en) * 2016-04-29 2017-11-10 中兴通讯股份有限公司 A kind of fault handling method and device
CN108846484A (en) * 2018-04-11 2018-11-20 北京百度网讯科技有限公司 Fault self-recovery system, method, computer equipment and storage medium
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109088773A (en) * 2018-08-24 2018-12-25 广州视源电子科技股份有限公司 Fault self-recovery method, apparatus, server and storage medium
CN110430071A (en) * 2019-07-19 2019-11-08 云南电网有限责任公司信息中心 Service node fault self-recovery method, apparatus, computer equipment and storage medium
CN110380907A (en) * 2019-07-26 2019-10-25 京信通信***(中国)有限公司 A kind of network fault diagnosis method, device, the network equipment and storage medium
CN110704231A (en) * 2019-09-30 2020-01-17 深圳前海微众银行股份有限公司 Fault processing method and device
CN113590370A (en) * 2021-08-06 2021-11-02 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271100A (en) * 2023-11-21 2023-12-22 北京国科天迅科技股份有限公司 Algorithm chip cluster scheduling method, device, computer equipment and storage medium
CN117271100B (en) * 2023-11-21 2024-02-06 北京国科天迅科技股份有限公司 Algorithm chip cluster scheduling method, device, computer equipment and storage medium
CN117834386A (en) * 2023-12-20 2024-04-05 北京联广通网络科技有限公司 Automatic alarm system and method for flow chart network monitoring faults
CN117811897A (en) * 2024-02-23 2024-04-02 济南通华电子技术有限公司 Intelligent analysis management system for internet of things card communication operation and maintenance worksheet data
CN117811897B (en) * 2024-02-23 2024-04-30 济南通华电子技术有限公司 Intelligent analysis management system for internet of things card communication operation and maintenance worksheet data
CN117830961A (en) * 2024-03-06 2024-04-05 山东达斯特信息技术有限公司 Environment-friendly equipment operation and maintenance behavior analysis method and system based on image analysis
CN117830961B (en) * 2024-03-06 2024-05-10 山东达斯特信息技术有限公司 Environment-friendly equipment operation and maintenance behavior analysis method and system based on image analysis
CN118042492A (en) * 2024-04-11 2024-05-14 深圳市友恺通信技术有限公司 Network data operation and maintenance management system and method based on 5G communication

Also Published As

Publication number Publication date
CN113590370A (en) 2021-11-02
CN113590370B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
WO2023011160A1 (en) Fault processing method and apparatus, device, and storage medium
EP3889777A1 (en) System and method for automating fault detection in multi-tenant environments
US20180114234A1 (en) Systems and methods for monitoring and analyzing computer and network activity
US8516499B2 (en) Assistance in performing action responsive to detected event
JP2019536185A (en) System and method for monitoring and analyzing computer and network activity
CN112087334A (en) Alarm root cause analysis method, electronic device and storage medium
US20220350690A1 (en) Training method and apparatus for fault recognition model, fault recognition method and apparatus, and electronic device
AU2019201510A1 (en) Platform for supporting multiple virtual agent applications
US20210158210A1 (en) Hybrid in-domain and out-of-domain document processing for non-vocabulary tokens of electronic documents
CN114328132A (en) Method, device, equipment and medium for monitoring state of external data source
CN115603955B (en) Abnormal access object identification method, device, equipment and medium
CN116755974A (en) Cloud computing platform operation and maintenance method and device, electronic equipment and storage medium
WO2023103344A1 (en) Data processing method and apparatus, device, and storage medium
CN113239054B (en) Information generation method and related device
US11188405B1 (en) Similar alert identification based on application fingerprints
CN111340222B (en) Neural network model searching method and device and electronic equipment
CN113590774A (en) Event query method, device and storage medium
CN113112311A (en) Method for training causal inference model, information prompting method and device
US11539650B2 (en) System and method for alerts for missing coverage of chatbot conversation messages
US11977439B2 (en) Method and system for actionable smart monitoring of error messages
US20230139008A1 (en) Failure analysis and recommendation service for automated executions
WO2024027127A1 (en) Fault detection method and apparatus, and electronic device and readable storage medium
US20240118960A1 (en) Error context for bot optimization
US11675644B2 (en) Method and system for managing notifications for flapping incidents
US11546415B2 (en) Intelligent server migration platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851874

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE