CN112749064A - Method and system for predicting and self-healing fault of software application service - Google Patents
Method and system for predicting and self-healing fault of software application service Download PDFInfo
- Publication number
- CN112749064A CN112749064A CN202110082882.0A CN202110082882A CN112749064A CN 112749064 A CN112749064 A CN 112749064A CN 202110082882 A CN202110082882 A CN 202110082882A CN 112749064 A CN112749064 A CN 112749064A
- Authority
- CN
- China
- Prior art keywords
- service
- alarm
- monitoring
- application
- monitoring system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012544 monitoring process Methods 0.000 claims abstract description 86
- 230000001960 triggered effect Effects 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000004088 simulation Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/327—Alarm or error message display
Abstract
The invention relates to a method and a system for predicting and self-healing a software application service fault, wherein the method comprises the following steps: the monitoring system monitors the service system; when an alarm is triggered, the monitoring system sends the alarm to an alarm control node; and the alarm control node responds according to preset logic. The invention can sense the possible problems of the service system in advance through the monitoring system, and can give an alarm and respond to the service with the problem in time under the condition of no manual intervention, thereby reducing the avalanche effect of all services caused by one service problem and the condition that the service cannot be provided to the outside.
Description
Technical Field
The invention relates to the field of system fault processing, in particular to a method and a system for predicting and self-healing a software application service fault.
Background
The traditional application service operation and maintenance work is that after the application service fails, manual intervention is carried out to process the corresponding failure, the service is affected at this time, and if a background does not perform a fusing mechanism and service degradation, the service is completely unavailable due to an avalanche effect under the condition, and very serious influence is brought to the service!
Specifically, after the application service fails and the corresponding responsible person receives the alarm, the operation and maintenance engineer and the corresponding development engineer are online together, and how to handle the problem is determined according to the log and the service problem condition! In the processing process, the problem that the corresponding engineer cannot perform online processing or cannot timely process the service due to incomplete business understanding can occur, so that the recovery time of the application service is too long, and the service cannot be accessed for a long time.
The prior art means is a post-reaction mechanism, which is to provide the application service after the application service has appeared, and then to process the application service, thus having an impact on the use of the user! But also the situation that the service cannot be normally used for a longer time because the processing personnel cannot be on-line in time.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method and a system for software application service fault prediction and fault self-healing.
The technical scheme for solving the technical problems is as follows:
a method for predicting and self-healing a software application service fault comprises the following steps:
the monitoring system monitors the service system;
when an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
and the alarm control node responds according to preset logic.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the monitoring system monitors the service system, and specifically includes:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Further, the monitoring system monitors the service system, and specifically includes:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Further, the monitoring system monitors the service system, and specifically includes:
and monitoring the service corresponding to the access of the zabbix simulation user.
Further, the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically including:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
In order to achieve the above object, the present invention further provides a system for predicting and self-healing a failure of a software application service, comprising:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
Further, the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Further, the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Further, the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
Further, the number of the monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The invention has the beneficial effects that:
the monitoring system senses possible problems of the service system in advance, and timely alarms and responds to the service with the problems to be caused under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided externally is achieved.
Drawings
Fig. 1 is a flowchart of a method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention;
fig. 2 is a flowchart of another method for predicting and self-healing a failure of a software application service according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
With the development of services, online services are applied more and more, frequently released online, and the access volume is increased dramatically. The possibility of application collapse can occur at any time, the possibility can be sensed in advance through the combination of the log and the monitoring system, the possibility is killed in the cradle, even if some applications cannot normally provide services due to some reasons, the possibility can be sensed in advance through the invention, and the applications are taken off line and restarted to achieve the purpose of minimizing the influence range. The method achieves stable service, reduces application service faults and improves the availability of user service application.
Fig. 1 is a flowchart of a method for predicting and self-healing a software application service failure according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
110. the monitoring system monitors the service system;
optionally, in this embodiment, step 110 specifically includes:
1101. transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert;
1102. monitoring an internal call link of the service through a skywalk full-link monitoring system;
1103. and monitoring the service corresponding to the access of the zabbix simulation user.
120. When an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
specifically, when the monitoring system performs step 1101, such as 4xx &5xx by Nginx, 5 key error characters occur within 3 minutes and an alarm is sent to the alarm control node.
When the monitoring system performs step 1102, if the link invoked inside the service is greater than 1 second and lasts for 3 minutes, the corresponding alarm is sent to the alarm control node.
When the monitoring system executes step 1103, if the access time is 2 times longer than the usual service access response time and lasts for 3 minutes, an alarm is sent to the alarm control node.
130. And the alarm control node responds according to preset logic.
Specifically, in this step, the alarm control node determines, through a preset algorithm, whether the node needs to be offline restarted or not.
As shown in fig. 2, based on the three-way monitoring system in the foregoing process, step 130 specifically includes:
1301. if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
1302. if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
in step 1302, while automatically expanding the application nodes, an alarm may be sent to prompt a corresponding responsible person whether to restart the corresponding application by program manual intervention.
1303. If the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The traditional application operation and maintenance is based on a post-processing mechanism, namely, the application is processed under the condition that the application is hung up and the normal use of a user is influenced. Such a process may cause significant harm to the development of the company and the user experience, and may even cause the company to lose many users.
The invention aims to sense possible problems of a service system in advance by collecting application service logs and combining with a corresponding service monitoring system, and automatically expand, restart off-line, automatically check and automatically online the service which is about to have problems under the condition of no manual intervention, so that the avalanche effect of all services caused by one service problem is reduced, and the condition that the service cannot be provided to the outside is realized.
The prediction method comprises the steps of receiving alarms from alarm systems such as log alarms, skywalk and zabbix through alarm and alarm processing control nodes developed by the prediction method, and processing corresponding alarms according to preset conditions. The ELK log collection system is used for collecting front-end agent logs and back-end service logs of corresponding application services, and the logs are stored and analyzed; and sending an alarm to the control node when the triggering condition is met according to the preset alarm mechanism as the triggering condition. Monitoring the access time of links between services through a full link monitoring system skywalk, sending an alarm to a control node when the access time exceeds a preset time, simulating user access to the services through zabbix, judging whether the application is in a healthy state or not through application of response speed and an access code, and sending corresponding alarm information to the control node if the access time exceeds a threshold value.
This patent just interferes with the application under the unusual condition just appears in the application, just begin to prevent like our health just appears the sign of flu, just so can avoid the emergence of flu as far as possible, and we interfere with the application in advance, handles after, can avoid because an application node hangs after, arouses the avalanche effect, leads to using the whole condition production of hanging. Therefore, the usability of the application can be greatly improved, and the experience satisfaction of the user can be improved. The situation that the user is lost due to hanging of the application is reduced.
The embodiment of the invention provides a system for predicting and self-healing a software application service fault, which comprises the following steps:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
Optionally, in this embodiment, the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
Optionally, in this embodiment, the number of the monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for predicting and self-healing a software application service fault is characterized by comprising the following steps:
the monitoring system monitors the service system;
when an alarm is triggered, the monitoring system sends the alarm to an alarm control node;
and the alarm control node responds according to preset logic.
2. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
3. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
4. The method according to claim 1, wherein the monitoring system monitors a service system, and specifically comprises:
and monitoring the service corresponding to the access of the zabbix simulation user.
5. The method according to any one of claims 1 to 4, wherein the number of the monitoring systems is three, and the alarm control node responds according to a preset logic, specifically comprising:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
6. A system for predicting and self-healing a software application service failure is characterized by comprising the following components:
the monitoring system is used for monitoring the service system and sending an alarm to the alarm control node when the alarm is triggered;
and the alarm control node is used for responding according to preset logic.
7. The system of claim 6, wherein the monitoring system is specifically configured to:
and transmitting the application log collected by the filebeat into logstach, preprocessing the application log by the logstach, transmitting the application log to an elastic search by Kafka, and monitoring the self-defined key error characters by elastic alert.
8. The system of claim 6, wherein the monitoring system is specifically configured to:
and monitoring the internal call link of the service through a skywalk full-link monitoring system.
9. The system of claim 6, wherein the monitoring system is specifically configured to:
and monitoring the service corresponding to the access of the zabbix simulation user.
10. The system according to any one of claims 6 to 9, wherein the number of monitoring systems is three, and the alarm control node is specifically configured to:
if only one monitoring system gives an alarm in the same time period, sending the alarm to a corresponding responsible person and recording the alarm;
if two monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, checking the load condition of the corresponding application system, and if the pressure is higher, automatically expanding the capacity of the application nodes;
if the three monitoring systems give alarms in the same time period, sending alarms to corresponding responsible personnel, automatically removing and restarting the application nodes from load balance after automatically expanding the application nodes, and connecting the application nodes on line after detecting that the service is normal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110082882.0A CN112749064A (en) | 2021-01-21 | 2021-01-21 | Method and system for predicting and self-healing fault of software application service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110082882.0A CN112749064A (en) | 2021-01-21 | 2021-01-21 | Method and system for predicting and self-healing fault of software application service |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112749064A true CN112749064A (en) | 2021-05-04 |
Family
ID=75652810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110082882.0A Pending CN112749064A (en) | 2021-01-21 | 2021-01-21 | Method and system for predicting and self-healing fault of software application service |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112749064A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312202A (en) * | 2021-07-29 | 2021-08-27 | 太平金融科技服务(上海)有限公司 | Fault processing logic generation method, device, equipment and medium based on component |
CN115396291A (en) * | 2022-08-23 | 2022-11-25 | 度小满科技(北京)有限公司 | Redis cluster fault self-healing method based on kubernets trustees |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030037288A1 (en) * | 2001-08-15 | 2003-02-20 | International Business Machines Corporation | Method and system for reduction of service costs by discrimination between software and hardware induced outages |
CN102981943A (en) * | 2012-10-29 | 2013-03-20 | 新浪技术(中国)有限公司 | Method and system for monitoring application logs |
CN107819614A (en) * | 2017-10-27 | 2018-03-20 | 中航信移动科技有限公司 | Application monitoring system and method based on analog subscriber request |
CN110750426A (en) * | 2019-10-30 | 2020-02-04 | 北京明朝万达科技股份有限公司 | Service state monitoring method and device, electronic equipment and readable storage medium |
CN111752795A (en) * | 2020-06-18 | 2020-10-09 | 多加网络科技(北京)有限公司 | Full-process monitoring alarm platform and method thereof |
-
2021
- 2021-01-21 CN CN202110082882.0A patent/CN112749064A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030037288A1 (en) * | 2001-08-15 | 2003-02-20 | International Business Machines Corporation | Method and system for reduction of service costs by discrimination between software and hardware induced outages |
CN102981943A (en) * | 2012-10-29 | 2013-03-20 | 新浪技术(中国)有限公司 | Method and system for monitoring application logs |
CN107819614A (en) * | 2017-10-27 | 2018-03-20 | 中航信移动科技有限公司 | Application monitoring system and method based on analog subscriber request |
CN110750426A (en) * | 2019-10-30 | 2020-02-04 | 北京明朝万达科技股份有限公司 | Service state monitoring method and device, electronic equipment and readable storage medium |
CN111752795A (en) * | 2020-06-18 | 2020-10-09 | 多加网络科技(北京)有限公司 | Full-process monitoring alarm platform and method thereof |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312202A (en) * | 2021-07-29 | 2021-08-27 | 太平金融科技服务(上海)有限公司 | Fault processing logic generation method, device, equipment and medium based on component |
CN113312202B (en) * | 2021-07-29 | 2021-11-12 | 太平金融科技服务(上海)有限公司 | Fault processing logic generation method, device, equipment and medium based on component |
CN115396291A (en) * | 2022-08-23 | 2022-11-25 | 度小满科技(北京)有限公司 | Redis cluster fault self-healing method based on kubernets trustees |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107179957B (en) | Physical machine fault classification processing method and device and virtual machine recovery method and system | |
CN112749064A (en) | Method and system for predicting and self-healing fault of software application service | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
CN110677480B (en) | Node health management method and device and computer readable storage medium | |
CN101296135A (en) | Fault information processing method and device | |
CN103812675A (en) | Method and system for realizing allopatric disaster recovery switching of service delivery platform | |
CN113282635A (en) | Micro-service system fault root cause positioning method and device | |
US7278048B2 (en) | Method, system and computer program product for improving system reliability | |
CN110727533A (en) | Alarm method, device, equipment and medium | |
CN101488881A (en) | A fault processing method | |
CN114356499A (en) | Kubernetes cluster alarm root cause analysis method and device | |
CN111142801B (en) | Distributed storage system network sub-health detection method and device | |
CN108243031B (en) | Method and device for realizing dual-computer hot standby | |
CN101102217A (en) | Processing method for duplicate alert and discontinuous reporting and monitoring in telecom network management system | |
CN115202958A (en) | Power abnormity monitoring method and device, electronic equipment and storage medium | |
Koutras et al. | Semi-Markov availability modeling of a redundant system with partial and full rejuvenation actions | |
CN115632706B (en) | FC link management method, device, equipment and readable storage medium | |
CN113541982B (en) | Health early warning method and device for network element, computing equipment and computer storage medium | |
CN103684862A (en) | Alarm information processing method, device and system and equipment | |
CN113064798A (en) | Exception handling method and device, electronic equipment and system | |
CN113157493A (en) | Backup method, device and system based on ticket checking system and computer equipment | |
CN111309504A (en) | Control method for embedded module serial port redundant transmission and related components | |
CN110795263B (en) | Hard disk link protection method and related device | |
CN115426247B (en) | Fault node processing method and device, storage medium and electronic equipment | |
CN112486765B (en) | Java application interface management method, system and device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |