CN112749064A

CN112749064A - Method and system for predicting and self-healing fault of software application service

Info

Publication number: CN112749064A
Application number: CN202110082882.0A
Authority: CN
Inventors: 孙国良
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-04

Abstract

The invention relates to a method and a system for predicting and self-healing a software application service fault, wherein the method comprises the following steps: the monitoring system monitors the service system; when an alarm is triggered, the monitoring system sends the alarm to an alarm control node; and the alarm control node responds according to preset logic. The invention can sense the possible problems of the service system in advance through the monitoring system, and can give an alarm and respond to the service with the problem in time under the condition of no manual intervention, thereby reducing the avalanche effect of all services caused by one service problem and the condition that the service cannot be provided to the outside.

Description

Method and system for predicting and self-healing fault of software application service

Technical Field

The invention relates to the field of system fault processing, in particular to a method and a system for predicting and self-healing a software application service fault.

Background

The traditional application service operation and maintenance work is that after the application service fails, manual intervention is carried out to process the corresponding failure, the service is affected at this time, and if a background does not perform a fusing mechanism and service degradation, the service is completely unavailable due to an avalanche effect under the condition, and very serious influence is brought to the service!

Specifically, after the application service fails and the corresponding responsible person receives the alarm, the operation and maintenance engineer and the corresponding development engineer are online together, and how to handle the problem is determined according to the log and the service problem condition! In the processing process, the problem that the corresponding engineer cannot perform online processing or cannot timely process the service due to incomplete business understanding can occur, so that the recovery time of the application service is too long, and the service cannot be accessed for a long time.

The prior art means is a post-reaction mechanism, which is to provide the application service after the application service has appeared, and then to process the application service, thus having an impact on the use of the user! But also the situation that the service cannot be normally used for a longer time because the processing personnel cannot be on-line in time.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method and a system for software application service fault prediction and fault self-healing.

The technical scheme for solving the technical problems is as follows:

a method for predicting and self-healing a software application service fault comprises the following steps:

the monitoring system monitors the service system;

when an alarm is triggered, the monitoring system sends the alarm to an alarm control node;

and the alarm control node responds according to preset logic.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the monitoring system monitors the service system, and specifically includes:

and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.

and monitoring the internal call link of the service through a skywalk full-link monitoring system.

and monitoring the service corresponding to the access of the zabbix simulation user.

Further, the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically including:

if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;

if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;

if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.

In order to achieve the above object, the present invention further provides a system for predicting and self-healing a failure of a software application service, comprising:

the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;

and the alarm control node is used for responding according to preset logic.

Further, the monitoring system is specifically configured to:

Further, the number of the monitoring systems is three, and the alarm control node is specifically configured to:

The invention has the beneficial effects that:

the monitoring system senses possible problems of the service system in advance, and timely alarms and responds to the service with the problems to be caused under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided externally is achieved.

Drawings

Fig. 1 is a flowchart of a method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention;

fig. 2 is a flowchart of another method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

With the development of services, online services are applied more and more, frequently released online, and the access volume is increased dramatically. The possibility of application collapse can occur at any time, the possibility can be sensed in advance through the combination of the log and the monitoring system, the possibility is killed in the cradle, even if some applications cannot normally provide services due to some reasons, the possibility can be sensed in advance through the invention, and the applications are taken off line and restarted to achieve the purpose of minimizing the influence range. The method achieves stable service, reduces application service faults and improves the availability of user service application.

Fig. 1 is a flowchart of a method for predicting and self-healing a software application service failure according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

110. the monitoring system monitors the service system;

optionally, in this embodiment, step 110 specifically includes:

1101. transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert;

1102. monitoring an internal call link of the service through a skywalk full-link monitoring system;

1103. and monitoring the service corresponding to the access of the zabbix simulation user.

120. When an alarm is triggered, the monitoring system sends the alarm to an alarm control node;

specifically, when the monitoring system performs step 1101, such as 4xx &5xx by Nginx, 5 key error characters occur within 3 minutes and an alarm is sent to the alarm control node.

When the monitoring system performs step 1102, if the link invoked inside the service is greater than 1 second and lasts for 3 minutes, the corresponding alarm is sent to the alarm control node.

When the monitoring system executes step 1103, if the access time is 2 times longer than the usual service access response time and lasts for 3 minutes, an alarm is sent to the alarm control node.

130. And the alarm control node responds according to preset logic.

Specifically, in this step, the alarm control node determines, through a preset algorithm, whether the node needs to be offline restarted or not.

As shown in fig. 2, based on the three-way monitoring system in the foregoing process, step 130 specifically includes:

1301. if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;

1302. if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;

in step 1302, while automatically expanding the application nodes, an alarm may be sent to prompt a corresponding responsible person whether to restart the corresponding application by program manual intervention.

1303. If the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.

The traditional application operation and maintenance is based on a post-processing mechanism, namely, the application is processed under the condition that the application is hung up and the normal use of a user is influenced. Such a process may cause significant harm to the development of the company and the user experience, and may even cause the company to lose many users.

The invention aims to sense possible problems of a service system in advance by collecting application service logs and combining with a corresponding service monitoring system, and automatically expand, restart off-line, automatically check and automatically online the service which is about to have problems under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided to the outside is realized.

The prediction method comprises the steps of receiving alarms from alarm systems such as log alarms, skywalk and zabbix through alarm and alarm processing control nodes developed by the prediction method, and processing corresponding alarms according to preset conditions. The ELK log collection system is used for collecting front-end agent logs and back-end service logs of corresponding application services, and the logs are stored and analyzed; and sending an alarm to the control node when the triggering condition is met according to the preset alarm mechanism as the triggering condition. Monitoring the access time of links between services through a full link monitoring system skywalk, sending an alarm to a control node when the access time exceeds a preset time, simulating user access to the services through zabbix, judging whether the application is in a healthy state or not through application of response speed and an access code, and sending corresponding alarm information to the control node if the access time exceeds a threshold value.

This patent just interferes with the application under the unusual condition just appears in the application, just begin to prevent like our health just appears the sign of flu, just so can avoid the emergence of flu as far as possible, and we interfere with the application in advance, handles after, can avoid because an application node hangs after, arouses the avalanche effect, leads to using the whole condition production of hanging. Therefore, the usability of the application can be greatly improved, and the experience satisfaction of the user can be improved. The situation that the user is lost due to hanging of the application is reduced.

The embodiment of the invention provides a system for predicting and self-healing a software application service fault, which comprises the following steps:

and the alarm control node is used for responding according to preset logic.

Optionally, in this embodiment, the monitoring system is specifically configured to:

Optionally, in this embodiment, the number of the monitoring systems is three, and the alarm control node is specifically configured to:

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for predicting and self-healing a software application service fault is characterized by comprising the following steps:

the monitoring system monitors the service system;

and the alarm control node responds according to preset logic.

2. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:

3. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:

4. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:

5. The method according to any one of claims 1 to 4, wherein the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically comprising:

6. A system for predicting and self-healing a software application service failure is characterized by comprising the following components:

and the alarm control node is used for responding according to preset logic.

7. The system of claim 6, wherein the monitoring system is specifically configured to:

8. The system of claim 6, wherein the monitoring system is specifically configured to:

9. The system of claim 6, wherein the monitoring system is specifically configured to:

10. The system according to any one of claims 6 to 9, wherein the number of monitoring systems is three, and the alarm control node is specifically configured to: