CN111581062A - Service fault processing method and server - Google Patents

Service fault processing method and server Download PDF

Info

Publication number
CN111581062A
CN111581062A CN202010419919.XA CN202010419919A CN111581062A CN 111581062 A CN111581062 A CN 111581062A CN 202010419919 A CN202010419919 A CN 202010419919A CN 111581062 A CN111581062 A CN 111581062A
Authority
CN
China
Prior art keywords
monitoring data
service
fault
diagnosis
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010419919.XA
Other languages
Chinese (zh)
Inventor
赵贝
崔贺
矫恒浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Hisense Media Network Technology Co Ltd
Juhaokan Technology Co Ltd
Original Assignee
Qingdao Hisense Media Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Hisense Media Network Technology Co Ltd filed Critical Qingdao Hisense Media Network Technology Co Ltd
Priority to CN202010419919.XA priority Critical patent/CN111581062A/en
Publication of CN111581062A publication Critical patent/CN111581062A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a service fault processing method and a server, wherein service monitoring data are acquired, and comprise monitoring data of a container corresponding to a service and monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel. The method and the device have the advantages that the pre-generated faults are predicted and positioned, the efficiency of positioning the pre-generated faults is improved, the positioning problems are carried in the early warning information and are sent to the terminal equipment of operation and maintenance personnel, the pre-generated faults are early warned, and the timeliness of finding the faults is improved.

Description

Service fault processing method and server
Technical Field
The present application relates to the technical field of servers, and in particular, to a service fault handling method and a server.
Background
The container cloud is a mainstream cloud computing mode at present, and has the advantages of high starting speed, low resource consumption and the like. The container cloud environment faces a great reliability challenge, and services in the container cloud environment may often fail for some reason, so that one service or even multiple services need to be executed again, and normal operation of the services is seriously affected.
In the prior art, when a problem occurs in a service under a container cloud environment, a large amount of manpower, material resources and time are consumed to check service related information to locate the problem after the fault occurs, then the fault is solved by using a common fault processing method after the problem is located, so that the problem is prevented from being found out in time due to the fact that the fault is not found in a fault detection mode after the fault occurs.
Disclosure of Invention
The application provides a service fault processing method and a server, so that early warning of service faults is achieved, and timeliness of finding the faults is improved.
In a first aspect, an embodiment of the present application provides a method for processing a service failure, including:
acquiring monitoring data of a service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a prediction algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel.
In a second aspect, an embodiment of the present application provides a server, including:
the acquisition module is used for acquiring monitoring data of the service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.
The prediction module is used for performing fault prediction diagnosis on the monitoring data of the service through a prediction algorithm;
and the processing module is used for determining the positioning problem of the pre-occurred fault when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, and carrying the positioning problem in the early warning information to be sent to the terminal equipment of the operation and maintenance personnel.
In a third aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as provided in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer program product, including: executable instructions for implementing the method as provided by the first aspect.
According to the service fault processing method and the server, the monitoring data of the service are acquired, wherein the monitoring data of the service comprise the monitoring data of a container corresponding to the service and the monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel. In the embodiment of the application, the fault prediction diagnosis is carried out on the monitoring data of the service through the preset algorithm, and when the fault prediction diagnosis result of the monitoring data of the service is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, the prediction and the positioning of the pre-occurred fault are realized, the efficiency of positioning the pre-occurred fault is improved, the positioning problem is carried in the early warning information and is sent to the terminal equipment of the operation and maintenance personnel, the early warning of the pre-occurred fault is realized, and the timeliness of the fault being discovered is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a diagram of an exemplary application scenario provided by an embodiment of the present application;
FIG. 2 is an exemplary application scenario architecture diagram provided by an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for handling a failure of a service provided by an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a decision tree provided in an embodiment of the present application;
FIG. 5 is a flow chart illustrating a method for handling a failure of a service provided in another embodiment of the present application;
FIG. 6 is a flow chart illustrating a method for handling a failure of a service provided by another embodiment of the present application;
FIG. 7 is a flow chart illustrating a method for handling a failure of a service provided by yet another embodiment of the present application;
FIG. 8 is a block diagram of a server according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a server according to another embodiment of the present application;
fig. 10 is a schematic structural diagram of a server according to another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The container cloud is a mainstream cloud computing mode at present, and has the advantages of high starting speed, low resource consumption and the like. The container cloud environment faces a great reliability challenge, and services in the container cloud environment may often fail for some reason, so that one service or even multiple services need to be executed again, and normal operation of the services is seriously affected. In the prior art, when a problem occurs in a service under a container cloud environment, a large amount of manpower, material resources and time are consumed to check service related information to locate the problem after the fault occurs, then the fault is solved by using a common fault processing method after the problem is located, so that the problem is prevented from being found out in time due to the fact that the fault is not found in a fault detection mode after the fault occurs.
The invention concept of the service fault processing method and the server provided by the embodiment of the application is that the monitoring data related to the service and the decision tree algorithm are obtained to carry out fault prediction diagnosis on the monitoring data of the service, so that not only can faults possibly occurring be predicted, but also the positioning problems of the faults possibly occurring can be sent to the terminal equipment of operation and maintenance personnel, so that the operation and maintenance personnel are reminded to process the positioning problems, and the timeliness of fault discovery is improved.
An exemplary application scenario of the embodiments of the present application is described below.
The fault handling method for the service provided by the embodiment of the present application may be executed by the server provided by the embodiment of the present application, fig. 1 is an exemplary application scenario diagram provided by the embodiment of the present application, and fig. 2 is an exemplary application scenario architecture diagram provided by the embodiment of the present application, as shown in fig. 1 and fig. 2, the fault handling method for the service provided by the embodiment of the present application may be applied to a server 11, and the server 11 may include a plurality of containers. The container is an interface set located between the application or component and the server platform, so that the application or component can be conveniently deployed to a server to run, each service can be executed through one container, and there is data communication between the server 11 and the terminal device 12, which is not limited in this embodiment of the present application. The specific types of the terminal device and the server are not limited in the embodiments of the present application, for example, the terminal device may be a smart phone, a personal computer, a tablet computer, a wearable device, a vehicle-mounted terminal, and the like, and the server may be an application server, a WEB (WEB) server, a WEB application server, and the like.
Fig. 3 is a schematic flowchart of a fault handling method for a service according to an embodiment of the present application, where the method may be executed by a server, and the following describes a fault handling method for a service by using the server as an execution subject, and as shown in fig. 3, the fault handling method for a service according to the embodiment of the present application may include:
step S101: and acquiring monitoring data of the service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.
For different services, different data flows may be involved, for example, in an application scenario as shown in fig. 2, a service may be a process executed by an application program or a component, and for the judgment of the failure of the service, the judgment may be generally performed based on the monitoring data of the container corresponding to the service and the monitoring data of the physical machine bearing the container. In a possible implementation manner, the monitoring data of the container may include a number of requests of the container, a Central Processing Unit (CPU) memory ratio of the container, a request status code distribution, a request time, a response time, or a Query Per Second (QPS) and the like, and the monitoring data of the physical machine may include a CPU memory ratio of the physical machine, network card information of the physical machine, and the like.
In a possible implementation manner, log data in a past preset time period may be queried through a configured es (elastic search) address to obtain request status code distribution, request time, response time, QPS, or the like, and resource occupation conditions of a node where a service is located may be determined by querying a container and a physical machine corresponding to the service, for example, CPU memory occupation ratio of a central processing unit of the container, CPU memory occupation ratio of the physical machine, network card information of the physical machine, and the like.
Step S102: and performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm.
The embodiment of the present application does not limit the specific algorithm of the preset algorithm, as long as the fault prediction diagnosis can be performed on the monitoring data of the service to obtain the result of the fault prediction diagnosis, for example, the preset algorithm may include algorithms such as a decision tree algorithm and a neural network algorithm, and may also be a combination of multiple algorithms. In the following, a preset algorithm is taken as an example of the decision tree algorithm.
For convenience of introduction, fig. 4 is a schematic structural diagram of a decision tree provided in this embodiment of the present application, where the decision tree is a tree structure, where the tree structure may be a binary tree or a non-binary tree, and this is not limited in this embodiment of the present application. The tree structure comprises a root node, non-leaf nodes and leaf nodes, and the process of performing fault prediction diagnosis on the monitoring data of the service by using the decision tree algorithm is started from the root node, and each monitoring data in the monitoring data of the service is tested, wherein each non-leaf node represents the judgment on one monitoring data, and each leaf node represents one result of the fault prediction diagnosis. The specific type of the decision tree algorithm is not limited in the embodiment of the present application, for example, the decision tree algorithm may be cart (classification analysis tree), cls (termination Learning system), or the like. In fig. 4, a tree structure is taken as a binary tree, and the service monitoring data includes monitoring data a, monitoring data B, and monitoring data C for example, as shown in fig. 4, a root node may be the service monitoring data, and starting from the root node, the monitoring data a in the service monitoring data is determined, if the monitoring data a satisfies a preset condition of the monitoring data a, the non-leaf node 1 is entered, and if the monitoring data a does not satisfy the preset condition of the monitoring data a, the non-leaf node 2 is entered. Then, the monitoring data B is respectively judged at the non-leaf node 1 and the non-leaf node 2, for example, the judgment of the monitoring data B at the non-leaf node 1 is described, and the mode of judging the monitoring data B at the non-leaf node 2 is similar to this, and is not described again. If the monitoring data B meets the preset conditions of the monitoring data B, the leaf node 1 is entered to obtain the result of the fault prediction diagnosis of the service monitoring data; if the monitoring data B does not meet the preset condition of the monitoring data B, entering a non-leaf node 3, then judging the monitoring data C in the non-leaf node 3, if the monitoring data C meets the preset condition of the monitoring data C, entering a leaf node 2 to obtain a fault prediction diagnosis result of the monitoring data for service, and if the monitoring data C does not meet the preset condition of the monitoring data C, entering a node 3 to obtain a fault prediction diagnosis result of the monitoring data for service. The embodiments of the present application are merely examples, and are not limited thereto.
Step S103: and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel.
The results of the fault predictive diagnosis of the monitored data of the service may include the presence or absence of a pre-occurred fault. In a possible implementation manner, if the result of the fault prediction diagnosis is that no pre-occurred fault exists, the notification information may not be sent to the terminal device of the operation and maintenance personnel, so as to save resources; the notification information may also be sent to the terminal device of the operation and maintenance staff, so that the operation and maintenance staff can know the operation status of the service, for example, the failure prediction diagnosis result of the monitoring data of the service is sent to the terminal device of the operation and maintenance staff at regular time, which is not limited in this embodiment of the application.
In another possible implementation manner, if the result of the fault prediction diagnosis is that a pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, and the accuracy and timeliness of determining the positioning problem are improved. Carry the terminal equipment who sends for operation and maintenance personnel in early warning information with the location problem, for example, the location problem of the trouble that takes place in advance is the network card of physics machine probably breaks down, then carries the information that the network card of physics machine probably breaks down and sends for operation and maintenance personnel's terminal equipment in early warning information to make operation and maintenance personnel can take corresponding rescue measures before the trouble takes place, in order to avoid the emergence of trouble, and then avoided because the loss that service failure leads to. The embodiment of the present application does not limit the specific implementation manner of sending the warning information to the terminal device of the operation and maintenance staff, for example, the implementation manner may be implemented by a mail, a short message, a push message, a system prompt, and the like.
According to the embodiment of the application, the fault prediction diagnosis is carried out on the monitoring data of the service through the decision tree algorithm, when the fault prediction diagnosis result of the monitoring data of the service is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, the prediction and the positioning of the pre-occurred fault are realized, the efficiency of positioning the pre-occurred fault is improved, the positioning problem is carried in the early warning information and is sent to the terminal equipment of operation and maintenance personnel, the early warning on the pre-occurred fault is realized, and the timeliness of the fault being discovered is further improved.
Fig. 5 is a schematic flowchart of a fault handling method for a service provided in another embodiment of the present application, and based on the foregoing embodiment, as shown in fig. 5, before step S102, the fault handling method for a service provided in the embodiment of the present application may further include:
step S201: and converting the monitoring data of the service to obtain the service monitoring data of the array type, wherein the service monitoring data of the array type comprises element values corresponding to the monitoring data.
The step S102 is to perform a fault prediction diagnosis on the monitoring data of the service through a preset algorithm, and may be implemented through the step S202.
Step S202: and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.
Before the fault prediction diagnosis is carried out on the service monitoring data through the preset algorithm, the service monitoring data can be converted to obtain the array type service monitoring data, the standardized processing of the service monitoring data is realized, the array type service monitoring data is input into the decision tree algorithm, the fault prediction diagnosis is carried out on the service monitoring data, the fault prediction diagnosis result is obtained, and the reliability of the fault prediction diagnosis result is improved.
The embodiment of the present application does not limit the specific implementation manner of performing conversion processing on the service monitoring data, for example, the service monitoring data may be converted according to the specific data type of the service monitoring data and the condition that may cause a fault, for example, when the request status code distribution has a request status code with 499 exception or 500 exception that is greater than 50% of the total number of the request status codes and the service may have a fault, the proportion of 499 exception or 500 exception in the request status code distribution may be used as the element value of the request status code distribution, or the element value of the request status code distribution is determined by determining whether the proportion of 499 exception or 500 exception in the request status code distribution exceeds 50%; for another example, if a sudden increase in QPS occurs, and a problem may occur in the service, the values of the elements of QPS are determined by determining whether the sudden increase in QPS occurs. The embodiments of the present application are merely examples, and are not limited thereto.
For different monitoring data of the service, different manners may be adopted to determine the element values corresponding to the monitoring data to obtain the array type of service monitoring data, and in a possible implementation manner, the conversion processing is performed on the monitoring data of the service to obtain the array type of service monitoring data, including: and if the service monitoring data comprises any one or more combinations of the container request number, the CPU memory ratio of the container and the CPU memory ratio of the physical machine, respectively using the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine as the respective corresponding element values of the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the array type service monitoring data.
In yet another possible implementation, if the monitoring data of the service includes any one or more combinations of request status code distribution, request time, response time or query rate per second QPS, judging whether the abnormal request occupation ratio in the request state code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold, or whether the QPS exceeds a fourth preset threshold, and setting the element value of the monitoring data exceeding the corresponding preset threshold value as a first numerical value, as the element value corresponding to each of request state code distribution, request time, response time or QPS in the service monitoring data of the array type, and setting the element value of the monitoring data not exceeding the corresponding preset threshold value as a third numerical value, as the element value corresponding to each of request state code distribution, request time, response time or QPS in the service monitoring data of the array type.
In another possible implementation manner, if monitoring data in the monitoring data of the service is empty, the value of the element corresponding to the monitoring data in the monitoring data of the service which is empty is assigned as the second numerical value.
The specific numerical values of the first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold are not limited in the embodiment of the application. Taking the example that the monitoring data of the service includes the request time, the first value is 1, the second value is 2, and the third value is 0, if the request time exceeds the second preset threshold, the element value of the request time is set to 1, if the request time does not exceed the second preset threshold, the element value of the request time is set to 0, and if the request time is empty, the element value of the request time is set to 2.
The specific values of the first value, the second value and the third value are not limited in this application, and in a possible implementation manner, the first value, the second value and the third value are different values. For example, the first numerical value may be 3, the second numerical value may be 5, and the third numerical value may be 4, and still take the example that the monitoring data of the service includes the request time, if the request time exceeds the second preset threshold, the element value of the request time is set to 3, if the request time does not exceed the second preset threshold, the element value of the request time is set to 4, and if the request time is empty, the element value of the request time is set to 5.
In the embodiment of the application, the monitoring data of the service is converted, so that the standardized processing of the monitoring data of the service is realized, the service monitoring data of the array type is obtained, the reliability of the service monitoring data of the array type is improved, the service monitoring data of the array type is input into a decision tree algorithm, the fault prediction diagnosis is carried out on the service monitoring data, and the reliability of the fault prediction diagnosis result is improved.
Fig. 6 is a schematic flowchart of a fault handling method for a service provided in another embodiment of the present application, and based on any one of the embodiments shown in fig. 3 or fig. 5, as shown in fig. 6, before step S102, the fault handling method for a service provided in the embodiment of the present application may further include:
step S301: and acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data.
Step S302: and training the preset algorithm through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm.
The step S102, that is, performing the fault prediction diagnosis on the monitoring data of the service through the preset algorithm, may be implemented through the step S303.
Step S303: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.
In this embodiment of the present application, the historical monitoring data of the service may be historical monitoring data in a preset time period, for example, the monitoring data of the service in two years before is used as the historical monitoring data of the service, and this embodiment of the present application is not limited to the specific time period of the preset time period, and this is only taken as an example, and is not limited thereto. The historical diagnosis result of the historical monitoring data may include the existence of a fault or the absence of a fault, and if a fault exists, the historical diagnosis result may also include a fault positioning problem.
After the historical monitoring data of the service and the historical diagnosis result of the historical monitoring data are obtained, the preset algorithm is trained through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm, the trained preset algorithm can carry out fault prediction diagnosis according to the input monitoring data of the service, and the fault prediction diagnosis result of the monitoring data of the service can be obtained.
When the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault can be determined, and the positioning problem is carried in the early warning information and is sent to the terminal equipment of the operation and maintenance personnel. In a possible implementation manner, the method for processing a fault of a service provided in an embodiment of the present application further includes:
determining a fault solution to the positioning problem; and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction.
For each positioning problem, at least one fault solution corresponding to each positioning problem exists, for example, if the positioning problem is that a network card of a physical machine may be faulty, the fault solution corresponding to the positioning problem may be to replace the network card of the physical machine, or may be to replace the physical machine carrying a container corresponding to a service. After determining the failure solution of the positioning problem, an operation instruction corresponding to the failure solution may be further obtained, so as to debug the physical machine according to the operation instruction, taking that a network card of the physical machine may fail as an example, the operation instruction corresponding to the failure solution may include an operation instruction whether to replace the physical machine and/or an operation instruction how to replace the physical machine, and the like.
In the embodiment of the application, the fault solution of the positioning problem is determined, the operation instruction corresponding to the fault solution is obtained, the physical machine is debugged according to the operation instruction, the occurrence of the pre-occurring fault can be avoided, and the loss caused by the service fault is avoided.
Fig. 7 is a flowchart illustrating a method for processing a fault of a service provided in a further embodiment of the present application, where on the basis of any one of the embodiments shown in fig. 3 or fig. 5, as shown in fig. 7, the method for processing a fault of a service provided in an embodiment of the present application may further include, before step S102:
step S401: obtaining historical monitoring data of the service, historical fault diagnosis results of the historical monitoring data and fault solutions of positioning problems in the historical fault diagnosis results.
Step S402: and training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and the fault solution of the positioning problem in the historical fault diagnosis result to generate the trained preset algorithm.
The step S102, that is, performing the fault prediction diagnosis on the monitoring data of the service through the preset algorithm, may be implemented through the step S403.
Step S403: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution of the positioning problem in the fault prediction diagnosis result.
The difference between the embodiment of the present application and the embodiment shown in fig. 6 is that in the process of training the decision tree algorithm, a fault solution to a positioning problem in a historical fault diagnosis result is also considered, and a specific training mode of the embodiment of the present application is similar to that of the embodiment shown in fig. 6, and is not repeated.
When the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault and the fault solution of the positioning problem can be determined, and the positioning problem and the fault solution of the positioning problem in the fault diagnosis result are carried in the early warning information and sent to the terminal equipment of the operation and maintenance personnel.
In a possible implementation manner, the method for processing a fault of a service provided in an embodiment of the present application further includes: and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction. In this embodiment of the present application, the operation instruction corresponding to the failure solution is obtained, so as to obtain the specific implementation manner of performing the debugging processing on the physical machine according to the operation instruction, refer to the operation instruction corresponding to the failure solution obtained in the embodiment shown in fig. 6, so as to obtain the specific implementation manner of performing the debugging processing on the physical machine according to the operation instruction, which is not described again.
The embodiment of the present application provides a service fault handling apparatus, which may be used to execute the embodiment of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application. In some possible embodiments, the fault handling apparatus of the service provided by the embodiment of the present application may be a server.
Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, and as shown in fig. 8, the server according to the embodiment of the present application may include an obtaining module 51, a predicting module 52, and a processing module 53.
The obtaining module 51 is configured to obtain monitoring data of a service, where the monitoring data of the service includes monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.
And the prediction module 52 is configured to perform fault prediction diagnosis on the monitoring data of the service through a preset algorithm.
And the processing module 53 is configured to determine a location problem of the pre-occurred fault when it is determined that the result of the fault prediction and diagnosis of the service monitoring data is that the pre-occurred fault exists, and send the location problem to the terminal device of the operation and maintenance staff by carrying the location problem in the early warning information.
The apparatus of this embodiment may perform the method embodiment shown in fig. 3, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.
Based on the embodiment shown in fig. 8, further, in another embodiment provided by the present application, the processing module 53 is further configured to perform conversion processing on the monitoring data of the service to obtain array-type service monitoring data, where the array-type service monitoring data includes element values corresponding to each monitoring data.
In a possible implementation, the processing module 53 is specifically configured to:
and if the service monitoring data comprises any one or more combinations of the container request number, the Central Processing Unit (CPU) memory ratio of the container and the CPU memory ratio of the physical machine, respectively using the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine as the respective corresponding element values of the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the array type service monitoring data.
In a possible implementation, the processing module 53 is further configured to:
if the service monitoring data comprises any one or more combinations of request state code distribution, request time, response time or query rate per second QPS, judging whether the abnormal request proportion in the request state code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold or whether QPS exceeds a fourth preset threshold, and setting the element value of the monitoring data exceeding the corresponding preset threshold as a first numerical value as the respective corresponding element value of the request state code distribution, the request time, the response time or the QPS in the array type service monitoring data.
In a possible implementation, the processing module 53 is further configured to:
and if the monitoring data in the service monitoring data is empty, assigning the element value corresponding to the empty monitoring data in the service monitoring data as a second numerical value.
The prediction module 52 is specifically configured to: and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.
The apparatus of this embodiment may perform the method embodiment shown in fig. 5, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.
Based on the embodiment shown in fig. 8, further, fig. 9 is a schematic structural diagram of a server provided in another embodiment of the present application, and as shown in fig. 9, the server provided in the present application further includes a training module 61 and a determining module 62.
The training module 61 is used for acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data; and training the preset algorithm through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm.
The prediction module 52 is specifically configured to: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.
In a possible implementation manner, the server provided in the embodiment of the present application further includes:
a determination module 62 for determining a fault solution to the positioning problem; and the debugging module 63 is configured to obtain an operation instruction corresponding to the failure solution, so as to perform debugging processing on the physical machine according to the operation instruction.
The apparatus of this embodiment may perform the method embodiment shown in fig. 6, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.
On the basis of the embodiment shown in fig. 9, the present application provides another embodiment, a training module 61, configured to obtain historical monitoring data of a service, historical fault diagnosis results of the historical monitoring data, and a fault solution of a positioning problem in the historical fault diagnosis results; and training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and the fault solution of the positioning problem in the historical fault diagnosis result to generate the trained preset algorithm.
The prediction module 52 is specifically configured to: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution of the positioning problem in the fault prediction diagnosis result.
In a possible implementation manner, in the server provided in the embodiment of the present application, the early warning information further carries a fault solution to the positioning problem, and the server further includes:
and the debugging module 63 is configured to obtain an operation instruction corresponding to the failure solution, so as to perform debugging processing on the physical machine according to the operation instruction.
The apparatus of this embodiment can execute the method embodiment shown in fig. 7, and the technical principle and technical effect thereof are similar to those of the above embodiment, and are not described herein again.
The device embodiments provided in the present application are merely schematic, and the module division in fig. 8 or fig. 9 is only one logic function division, and there may be other division ways in actual implementation. For example, multiple modules may be combined or may be integrated into another system. The coupling of the various modules to each other may be through interfaces that are typically electrical communication interfaces, but mechanical or other forms of interfaces are not excluded. Thus, modules described as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices.
Fig. 10 is a schematic structural diagram of a server according to another embodiment of the present application, and as shown in fig. 10, the server according to the embodiment of the present application may include:
a processor 61, a memory 62, a transceiver 63 and a computer program; wherein the transceiver 63 enables data transmission with other devices, a computer program is stored in the memory 62 and configured to be executed by the processor 61, the computer program comprising instructions for performing the fault handling method of the above-mentioned service, the contents and effects of which refer to the method embodiments.
In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.
Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method for handling a failure of a service, comprising:
acquiring monitoring data of a service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container;
performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm;
and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that a pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, carrying the positioning problem in early warning information, and sending the early warning information to the terminal equipment of operation and maintenance personnel.
2. The method of claim 1, further comprising, after the obtaining the monitoring data for the service:
performing conversion processing on the monitoring data of the service to acquire array type service monitoring data, wherein the array type service monitoring data comprises element values corresponding to each monitoring data;
the performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm comprises the following steps:
and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.
3. The method according to claim 2, wherein the converting the monitoring data of the service to obtain the array type of service monitoring data comprises:
if the monitoring data of the service includes any one or more combinations of a container request number, a Central Processing Unit (CPU) memory ratio of the container and a CPU memory ratio of the physical machine, the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine are respectively used as the element values corresponding to the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the service monitoring data of the array type.
4. The method of claim 3, further comprising:
if the monitored data of the service includes any one or more combinations of request status code distribution, request time, response time, or per second query rate QPS, determining whether an abnormal request duty ratio in the request status code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold, or whether the QPS exceeds a fourth preset threshold, and setting an element value of the monitored data exceeding the corresponding preset threshold as a first numerical value, as an element value corresponding to each of the request status code distribution, the request time, the response time, or the QPS in the service monitored data of the array type.
5. The method of claim 3 or 4, further comprising:
and if the monitoring data in the service monitoring data is empty, assigning the element value corresponding to the empty monitoring data in the service monitoring data as a second numerical value.
6. The method according to any one of claims 1-3, further comprising, prior to performing fault predictive diagnosis on the monitored data of the service by a preset algorithm:
acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data;
training the preset algorithm according to the historical monitoring data and the historical fault diagnosis result to generate a trained preset algorithm;
the performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm comprises the following steps:
and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.
7. The method according to any one of claims 1-3, further comprising, prior to performing fault predictive diagnosis on the monitored data of the service by a preset algorithm:
acquiring historical monitoring data of the service, historical fault diagnosis results of the historical monitoring data and fault solutions of positioning problems in the historical fault diagnosis results;
training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and a fault solution for positioning the problem in the historical fault diagnosis result to generate a trained preset algorithm;
the performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm comprises the following steps:
and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution for positioning the problem in the fault prediction diagnosis result.
8. The method of claim 7, wherein the early warning information further carries a fault solution to the positioning problem, the method further comprising:
and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction.
9. A server, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring monitoring data of a service, and the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container;
the prediction module is used for carrying out fault prediction diagnosis on the monitoring data of the service through a preset algorithm;
and the processing module is used for determining the positioning problem of the pre-occurred fault when the result of the fault prediction and diagnosis of the monitoring data of the service is determined to be that the pre-occurred fault exists, and carrying the positioning problem in early warning information to be sent to the terminal equipment of the operation and maintenance personnel.
10. The server according to claim 9, wherein the processing module is further configured to perform conversion processing on the monitoring data of the service to obtain array-type service monitoring data, where the array-type service monitoring data includes an element value corresponding to each monitoring data;
the prediction module is specifically configured to:
and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.
CN202010419919.XA 2020-05-18 2020-05-18 Service fault processing method and server Pending CN111581062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010419919.XA CN111581062A (en) 2020-05-18 2020-05-18 Service fault processing method and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010419919.XA CN111581062A (en) 2020-05-18 2020-05-18 Service fault processing method and server

Publications (1)

Publication Number Publication Date
CN111581062A true CN111581062A (en) 2020-08-25

Family

ID=72113626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010419919.XA Pending CN111581062A (en) 2020-05-18 2020-05-18 Service fault processing method and server

Country Status (1)

Country Link
CN (1) CN111581062A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099983A (en) * 2020-09-22 2020-12-18 北京知道创宇信息技术股份有限公司 Service exception handling method and device, electronic equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015109443A1 (en) * 2014-01-21 2015-07-30 华为技术有限公司 Method for processing network service faults, service management system and system management module
CN106330576A (en) * 2016-11-18 2017-01-11 北京红马传媒文化发展有限公司 Automatic scaling and migration scheduling method, system and device for containerization micro-service
CN109634828A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Failure prediction method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015109443A1 (en) * 2014-01-21 2015-07-30 华为技术有限公司 Method for processing network service faults, service management system and system management module
CN106330576A (en) * 2016-11-18 2017-01-11 北京红马传媒文化发展有限公司 Automatic scaling and migration scheduling method, system and device for containerization micro-service
CN109634828A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Failure prediction method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099983A (en) * 2020-09-22 2020-12-18 北京知道创宇信息技术股份有限公司 Service exception handling method and device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN112162878B (en) Database fault discovery method and device, electronic equipment and storage medium
CN109039833B (en) Method and device for monitoring bandwidth state
CN110213068B (en) Message middleware monitoring method and related equipment
CN110740061B (en) Fault early warning method and device and computer storage medium
CN111897705B (en) Service state processing and model training method, device, equipment and storage medium
US11743237B2 (en) Utilizing machine learning models to determine customer care actions for telecommunications network providers
CN111651595A (en) Abnormal log processing method and device
CN111400294B (en) Data anomaly monitoring method, device and system
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN112685207A (en) Method, apparatus and computer program product for error assessment
CN108039971A (en) A kind of alarm method and device
CN111581062A (en) Service fault processing method and server
CN113778960A (en) Fault determination method and device for Internet of things system and storage medium
CN109522184A (en) A kind of server system method for safety monitoring, device and terminal
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN114116128B (en) Container instance fault diagnosis method, device, equipment and storage medium
US20050198640A1 (en) Methods, systems and computer program products for selecting among alert conditions for resource management systems
CN111211938B (en) Biological information software monitoring system and method
CN115941441A (en) System link automation monitoring operation and maintenance method, system, equipment and medium
CN114861909A (en) Model quality monitoring method and device, electronic equipment and storage medium
CN114661506A (en) Fault isolation method and fault isolation device
CN113238888A (en) Data processing method, system and device
CN108959100A (en) Test method, the device and system of application program
CN111131292B (en) Message distribution method and device, network security detection equipment and storage medium
CN112199247B (en) Method and device for checking Docker container process activity in non-service state

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination