CN112749064A - Method and system for predicting and self-healing fault of software application service - Google Patents

Method and system for predicting and self-healing fault of software application service Download PDF

Info

Publication number
CN112749064A
CN112749064A CN202110082882.0A CN202110082882A CN112749064A CN 112749064 A CN112749064 A CN 112749064A CN 202110082882 A CN202110082882 A CN 202110082882A CN 112749064 A CN112749064 A CN 112749064A
Authority
CN
China
Prior art keywords
service
alarm
monitoring
application
monitoring system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110082882.0A
Other languages
Chinese (zh)
Inventor
孙国良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202110082882.0A priority Critical patent/CN112749064A/en
Publication of CN112749064A publication Critical patent/CN112749064A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Abstract

The invention relates to a method and a system for predicting and self-healing a software application service fault, wherein the method comprises the following steps: the monitoring system monitors the service system; when an alarm is triggered, the monitoring system sends the alarm to an alarm control node; and the alarm control node responds according to preset logic. The invention can sense the possible problems of the service system in advance through the monitoring system, and can give an alarm and respond to the service with the problem in time under the condition of no manual intervention, thereby reducing the avalanche effect of all services caused by one service problem and the condition that the service cannot be provided to the outside.

Description

Method and system for predicting and self-healing fault of software application service
Technical Field
The invention relates to the field of system fault processing, in particular to a method and a system for predicting and self-healing a software application service fault.
Background
The traditional application service operation and maintenance work is that after the application service fails, manual intervention is carried out to process the corresponding failure, the service is affected at this time, and if a background does not perform a fusing mechanism and service degradation, the service is completely unavailable due to an avalanche effect under the condition, and very serious influence is brought to the service!
Specifically, after the application service fails and the corresponding responsible person receives the alarm, the operation and maintenance engineer and the corresponding development engineer are online together, and how to handle the problem is determined according to the log and the service problem condition! In the processing process, the problem that the corresponding engineer cannot perform online processing or cannot timely process the service due to incomplete business understanding can occur, so that the recovery time of the application service is too long, and the service cannot be accessed for a long time.
The prior art means is a post-reaction mechanism, which is to provide the application service after the application service has appeared, and then to process the application service, thus having an impact on the use of the user! But also the situation that the service cannot be normally used for a longer time because the processing personnel cannot be on-line in time.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method and a system for software application service fault prediction and fault self-healing.
The technical scheme for solving the technical problems is as follows:
a method for predicting and self-healing a software application service fault comprises the following steps:
the monitoring system monitors the service system;
when an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
and the alarm control node responds according to preset logic.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the monitoring system monitors the service system, and specifically includes:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Further, the monitoring system monitors the service system, and specifically includes:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Further, the monitoring system monitors the service system, and specifically includes:
and monitoring the service corresponding to the access of the zabbix simulation user.
Further, the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically including:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
In order to achieve the above object, the present invention further provides a system for predicting and self-healing a failure of a software application service, comprising:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
Further, the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Further, the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Further, the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
Further, the number of the monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The invention has the beneficial effects that:
the monitoring system senses possible problems of the service system in advance, and timely alarms and responds to the service with the problems to be caused under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided externally is achieved.
Drawings
Fig. 1 is a flowchart of a method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention;
fig. 2 is a flowchart of another method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
With the development of services, online services are applied more and more, frequently released online, and the access volume is increased dramatically. The possibility of application collapse can occur at any time, the possibility can be sensed in advance through the combination of the log and the monitoring system, the possibility is killed in the cradle, even if some applications cannot normally provide services due to some reasons, the possibility can be sensed in advance through the invention, and the applications are taken off line and restarted to achieve the purpose of minimizing the influence range. The method achieves stable service, reduces application service faults and improves the availability of user service application.
Fig. 1 is a flowchart of a method for predicting and self-healing a software application service failure according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
110. the monitoring system monitors the service system;
optionally, in this embodiment, step 110 specifically includes:
1101. transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert;
1102. monitoring an internal call link of the service through a skywalk full-link monitoring system;
1103. and monitoring the service corresponding to the access of the zabbix simulation user.
120. When an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
specifically, when the monitoring system performs step 1101, such as 4xx &5xx by Nginx, 5 key error characters occur within 3 minutes and an alarm is sent to the alarm control node.
When the monitoring system performs step 1102, if the link invoked inside the service is greater than 1 second and lasts for 3 minutes, the corresponding alarm is sent to the alarm control node.
When the monitoring system executes step 1103, if the access time is 2 times longer than the usual service access response time and lasts for 3 minutes, an alarm is sent to the alarm control node.
130. And the alarm control node responds according to preset logic.
Specifically, in this step, the alarm control node determines, through a preset algorithm, whether the node needs to be offline restarted or not.
As shown in fig. 2, based on the three-way monitoring system in the foregoing process, step 130 specifically includes:
1301. if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
1302. if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
in step 1302, while automatically expanding the application nodes, an alarm may be sent to prompt a corresponding responsible person whether to restart the corresponding application by program manual intervention.
1303. If the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The traditional application operation and maintenance is based on a post-processing mechanism, namely, the application is processed under the condition that the application is hung up and the normal use of a user is influenced. Such a process may cause significant harm to the development of the company and the user experience, and may even cause the company to lose many users.
The invention aims to sense possible problems of a service system in advance by collecting application service logs and combining with a corresponding service monitoring system, and automatically expand, restart off-line, automatically check and automatically online the service which is about to have problems under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided to the outside is realized.
The prediction method comprises the steps of receiving alarms from alarm systems such as log alarms, skywalk and zabbix through alarm and alarm processing control nodes developed by the prediction method, and processing corresponding alarms according to preset conditions. The ELK log collection system is used for collecting front-end agent logs and back-end service logs of corresponding application services, and the logs are stored and analyzed; and sending an alarm to the control node when the triggering condition is met according to the preset alarm mechanism as the triggering condition. Monitoring the access time of links between services through a full link monitoring system skywalk, sending an alarm to a control node when the access time exceeds a preset time, simulating user access to the services through zabbix, judging whether the application is in a healthy state or not through application of response speed and an access code, and sending corresponding alarm information to the control node if the access time exceeds a threshold value.
This patent just interferes with the application under the unusual condition just appears in the application, just begin to prevent like our health just appears the sign of flu, just so can avoid the emergence of flu as far as possible, and we interfere with the application in advance, handles after, can avoid because an application node hangs after, arouses the avalanche effect, leads to using the whole condition production of hanging. Therefore, the usability of the application can be greatly improved, and the experience satisfaction of the user can be improved. The situation that the user is lost due to hanging of the application is reduced.
The embodiment of the invention provides a system for predicting and self-healing a software application service fault, which comprises the following steps:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
Optionally, in this embodiment, the number of the monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for predicting and self-healing a software application service fault is characterized by comprising the following steps:
the monitoring system monitors the service system;
when an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
and the alarm control node responds according to preset logic.
2. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
3. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
4. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and monitoring the service corresponding to the access of the zabbix simulation user.
5. The method according to any one of claims 1 to 4, wherein the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically comprising:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
6. A system for predicting and self-healing a software application service failure is characterized by comprising the following components:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
7. The system of claim 6, wherein the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
8. The system of claim 6, wherein the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
9. The system of claim 6, wherein the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
10. The system according to any one of claims 6 to 9, wherein the number of monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
CN202110082882.0A 2021-01-21 2021-01-21 Method and system for predicting and self-healing fault of software application service Pending CN112749064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110082882.0A CN112749064A (en) 2021-01-21 2021-01-21 Method and system for predicting and self-healing fault of software application service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110082882.0A CN112749064A (en) 2021-01-21 2021-01-21 Method and system for predicting and self-healing fault of software application service

Publications (1)

Publication Number Publication Date
CN112749064A true CN112749064A (en) 2021-05-04

Family

ID=75652810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110082882.0A Pending CN112749064A (en) 2021-01-21 2021-01-21 Method and system for predicting and self-healing fault of software application service

Country Status (1)

Country Link
CN (1) CN112749064A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312202A (en) * 2021-07-29 2021-08-27 太平金融科技服务(上海)有限公司 Fault processing logic generation method, device, equipment and medium based on component
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030037288A1 (en) * 2001-08-15 2003-02-20 International Business Machines Corporation Method and system for reduction of service costs by discrimination between software and hardware induced outages
CN102981943A (en) * 2012-10-29 2013-03-20 新浪技术(中国)有限公司 Method and system for monitoring application logs
CN107819614A (en) * 2017-10-27 2018-03-20 中航信移动科技有限公司 Application monitoring system and method based on analog subscriber request
CN110750426A (en) * 2019-10-30 2020-02-04 北京明朝万达科技股份有限公司 Service state monitoring method and device, electronic equipment and readable storage medium
CN111752795A (en) * 2020-06-18 2020-10-09 多加网络科技(北京)有限公司 Full-process monitoring alarm platform and method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030037288A1 (en) * 2001-08-15 2003-02-20 International Business Machines Corporation Method and system for reduction of service costs by discrimination between software and hardware induced outages
CN102981943A (en) * 2012-10-29 2013-03-20 新浪技术(中国)有限公司 Method and system for monitoring application logs
CN107819614A (en) * 2017-10-27 2018-03-20 中航信移动科技有限公司 Application monitoring system and method based on analog subscriber request
CN110750426A (en) * 2019-10-30 2020-02-04 北京明朝万达科技股份有限公司 Service state monitoring method and device, electronic equipment and readable storage medium
CN111752795A (en) * 2020-06-18 2020-10-09 多加网络科技(北京)有限公司 Full-process monitoring alarm platform and method thereof

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312202A (en) * 2021-07-29 2021-08-27 太平金融科技服务(上海)有限公司 Fault processing logic generation method, device, equipment and medium based on component
CN113312202B (en) * 2021-07-29 2021-11-12 太平金融科技服务(上海)有限公司 Fault processing logic generation method, device, equipment and medium based on component
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees

Similar Documents

Publication Publication Date Title
CN107179957B (en) Physical machine fault classification processing method and device and virtual machine recovery method and system
CN112749064A (en) Method and system for predicting and self-healing fault of software application service
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN110677480B (en) Node health management method and device and computer readable storage medium
CN101296135A (en) Fault information processing method and device
CN103812675A (en) Method and system for realizing allopatric disaster recovery switching of service delivery platform
CN113282635A (en) Micro-service system fault root cause positioning method and device
US7278048B2 (en) Method, system and computer program product for improving system reliability
CN110727533A (en) Alarm method, device, equipment and medium
CN101488881A (en) A fault processing method
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN111142801B (en) Distributed storage system network sub-health detection method and device
CN108243031B (en) Method and device for realizing dual-computer hot standby
CN101102217A (en) Processing method for duplicate alert and discontinuous reporting and monitoring in telecom network management system
CN115202958A (en) Power abnormity monitoring method and device, electronic equipment and storage medium
Koutras et al. Semi-Markov availability modeling of a redundant system with partial and full rejuvenation actions
CN115632706B (en) FC link management method, device, equipment and readable storage medium
CN113541982B (en) Health early warning method and device for network element, computing equipment and computer storage medium
CN103684862A (en) Alarm information processing method, device and system and equipment
CN113064798A (en) Exception handling method and device, electronic equipment and system
CN113157493A (en) Backup method, device and system based on ticket checking system and computer equipment
CN111309504A (en) Control method for embedded module serial port redundant transmission and related components
CN110795263B (en) Hard disk link protection method and related device
CN115426247B (en) Fault node processing method and device, storage medium and electronic equipment
CN112486765B (en) Java application interface management method, system and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination