CN111581062A

CN111581062A - Service fault processing method and server

Info

Publication number: CN111581062A
Application number: CN202010419919.XA
Authority: CN
Inventors: 赵贝; 崔贺; 矫恒浩
Original assignee: Qingdao Hisense Media Network Technology Co Ltd
Current assignee: Qingdao Hisense Media Network Technology Co Ltd; Juhaokan Technology Co Ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-08-25

Abstract

The application provides a service fault processing method and a server, wherein service monitoring data are acquired, and comprise monitoring data of a container corresponding to a service and monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel. The method and the device have the advantages that the pre-generated faults are predicted and positioned, the efficiency of positioning the pre-generated faults is improved, the positioning problems are carried in the early warning information and are sent to the terminal equipment of operation and maintenance personnel, the pre-generated faults are early warned, and the timeliness of finding the faults is improved.

Description

Service fault processing method and server

Technical Field

The present application relates to the technical field of servers, and in particular, to a service fault handling method and a server.

Background

The container cloud is a mainstream cloud computing mode at present, and has the advantages of high starting speed, low resource consumption and the like. The container cloud environment faces a great reliability challenge, and services in the container cloud environment may often fail for some reason, so that one service or even multiple services need to be executed again, and normal operation of the services is seriously affected.

In the prior art, when a problem occurs in a service under a container cloud environment, a large amount of manpower, material resources and time are consumed to check service related information to locate the problem after the fault occurs, then the fault is solved by using a common fault processing method after the problem is located, so that the problem is prevented from being found out in time due to the fact that the fault is not found in a fault detection mode after the fault occurs.

Disclosure of Invention

The application provides a service fault processing method and a server, so that early warning of service faults is achieved, and timeliness of finding the faults is improved.

In a first aspect, an embodiment of the present application provides a method for processing a service failure, including:

acquiring monitoring data of a service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a prediction algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel.

In a second aspect, an embodiment of the present application provides a server, including:

the acquisition module is used for acquiring monitoring data of the service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.

The prediction module is used for performing fault prediction diagnosis on the monitoring data of the service through a prediction algorithm;

and the processing module is used for determining the positioning problem of the pre-occurred fault when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, and carrying the positioning problem in the early warning information to be sent to the terminal equipment of the operation and maintenance personnel.

In a third aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as provided in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer program product, including: executable instructions for implementing the method as provided by the first aspect.

According to the service fault processing method and the server, the monitoring data of the service are acquired, wherein the monitoring data of the service comprise the monitoring data of a container corresponding to the service and the monitoring data of a physical machine bearing the container; performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm; and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel. In the embodiment of the application, the fault prediction diagnosis is carried out on the monitoring data of the service through the preset algorithm, and when the fault prediction diagnosis result of the monitoring data of the service is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, the prediction and the positioning of the pre-occurred fault are realized, the efficiency of positioning the pre-occurred fault is improved, the positioning problem is carried in the early warning information and is sent to the terminal equipment of the operation and maintenance personnel, the early warning of the pre-occurred fault is realized, and the timeliness of the fault being discovered is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a diagram of an exemplary application scenario provided by an embodiment of the present application;

FIG. 2 is an exemplary application scenario architecture diagram provided by an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for handling a failure of a service provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a decision tree provided in an embodiment of the present application;

FIG. 5 is a flow chart illustrating a method for handling a failure of a service provided in another embodiment of the present application;

FIG. 6 is a flow chart illustrating a method for handling a failure of a service provided by another embodiment of the present application;

FIG. 7 is a flow chart illustrating a method for handling a failure of a service provided by yet another embodiment of the present application;

FIG. 8 is a block diagram of a server according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a server according to another embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The container cloud is a mainstream cloud computing mode at present, and has the advantages of high starting speed, low resource consumption and the like. The container cloud environment faces a great reliability challenge, and services in the container cloud environment may often fail for some reason, so that one service or even multiple services need to be executed again, and normal operation of the services is seriously affected. In the prior art, when a problem occurs in a service under a container cloud environment, a large amount of manpower, material resources and time are consumed to check service related information to locate the problem after the fault occurs, then the fault is solved by using a common fault processing method after the problem is located, so that the problem is prevented from being found out in time due to the fact that the fault is not found in a fault detection mode after the fault occurs.

The invention concept of the service fault processing method and the server provided by the embodiment of the application is that the monitoring data related to the service and the decision tree algorithm are obtained to carry out fault prediction diagnosis on the monitoring data of the service, so that not only can faults possibly occurring be predicted, but also the positioning problems of the faults possibly occurring can be sent to the terminal equipment of operation and maintenance personnel, so that the operation and maintenance personnel are reminded to process the positioning problems, and the timeliness of fault discovery is improved.

An exemplary application scenario of the embodiments of the present application is described below.

The fault handling method for the service provided by the embodiment of the present application may be executed by the server provided by the embodiment of the present application, fig. 1 is an exemplary application scenario diagram provided by the embodiment of the present application, and fig. 2 is an exemplary application scenario architecture diagram provided by the embodiment of the present application, as shown in fig. 1 and fig. 2, the fault handling method for the service provided by the embodiment of the present application may be applied to a server 11, and the server 11 may include a plurality of containers. The container is an interface set located between the application or component and the server platform, so that the application or component can be conveniently deployed to a server to run, each service can be executed through one container, and there is data communication between the server 11 and the terminal device 12, which is not limited in this embodiment of the present application. The specific types of the terminal device and the server are not limited in the embodiments of the present application, for example, the terminal device may be a smart phone, a personal computer, a tablet computer, a wearable device, a vehicle-mounted terminal, and the like, and the server may be an application server, a WEB (WEB) server, a WEB application server, and the like.

Fig. 3 is a schematic flowchart of a fault handling method for a service according to an embodiment of the present application, where the method may be executed by a server, and the following describes a fault handling method for a service by using the server as an execution subject, and as shown in fig. 3, the fault handling method for a service according to the embodiment of the present application may include:

step S101: and acquiring monitoring data of the service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.

For different services, different data flows may be involved, for example, in an application scenario as shown in fig. 2, a service may be a process executed by an application program or a component, and for the judgment of the failure of the service, the judgment may be generally performed based on the monitoring data of the container corresponding to the service and the monitoring data of the physical machine bearing the container. In a possible implementation manner, the monitoring data of the container may include a number of requests of the container, a Central Processing Unit (CPU) memory ratio of the container, a request status code distribution, a request time, a response time, or a Query Per Second (QPS) and the like, and the monitoring data of the physical machine may include a CPU memory ratio of the physical machine, network card information of the physical machine, and the like.

In a possible implementation manner, log data in a past preset time period may be queried through a configured es (elastic search) address to obtain request status code distribution, request time, response time, QPS, or the like, and resource occupation conditions of a node where a service is located may be determined by querying a container and a physical machine corresponding to the service, for example, CPU memory occupation ratio of a central processing unit of the container, CPU memory occupation ratio of the physical machine, network card information of the physical machine, and the like.

Step S102: and performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm.

The embodiment of the present application does not limit the specific algorithm of the preset algorithm, as long as the fault prediction diagnosis can be performed on the monitoring data of the service to obtain the result of the fault prediction diagnosis, for example, the preset algorithm may include algorithms such as a decision tree algorithm and a neural network algorithm, and may also be a combination of multiple algorithms. In the following, a preset algorithm is taken as an example of the decision tree algorithm.

For convenience of introduction, fig. 4 is a schematic structural diagram of a decision tree provided in this embodiment of the present application, where the decision tree is a tree structure, where the tree structure may be a binary tree or a non-binary tree, and this is not limited in this embodiment of the present application. The tree structure comprises a root node, non-leaf nodes and leaf nodes, and the process of performing fault prediction diagnosis on the monitoring data of the service by using the decision tree algorithm is started from the root node, and each monitoring data in the monitoring data of the service is tested, wherein each non-leaf node represents the judgment on one monitoring data, and each leaf node represents one result of the fault prediction diagnosis. The specific type of the decision tree algorithm is not limited in the embodiment of the present application, for example, the decision tree algorithm may be cart (classification analysis tree), cls (termination Learning system), or the like. In fig. 4, a tree structure is taken as a binary tree, and the service monitoring data includes monitoring data a, monitoring data B, and monitoring data C for example, as shown in fig. 4, a root node may be the service monitoring data, and starting from the root node, the monitoring data a in the service monitoring data is determined, if the monitoring data a satisfies a preset condition of the monitoring data a, the non-leaf node 1 is entered, and if the monitoring data a does not satisfy the preset condition of the monitoring data a, the non-leaf node 2 is entered. Then, the monitoring data B is respectively judged at the non-leaf node 1 and the non-leaf node 2, for example, the judgment of the monitoring data B at the non-leaf node 1 is described, and the mode of judging the monitoring data B at the non-leaf node 2 is similar to this, and is not described again. If the monitoring data B meets the preset conditions of the monitoring data B, the leaf node 1 is entered to obtain the result of the fault prediction diagnosis of the service monitoring data; if the monitoring data B does not meet the preset condition of the monitoring data B, entering a non-leaf node 3, then judging the monitoring data C in the non-leaf node 3, if the monitoring data C meets the preset condition of the monitoring data C, entering a leaf node 2 to obtain a fault prediction diagnosis result of the monitoring data for service, and if the monitoring data C does not meet the preset condition of the monitoring data C, entering a node 3 to obtain a fault prediction diagnosis result of the monitoring data for service. The embodiments of the present application are merely examples, and are not limited thereto.

Step S103: and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, and carrying the positioning problem in the early warning information to send the early warning information to the terminal equipment of the operation and maintenance personnel.

The results of the fault predictive diagnosis of the monitored data of the service may include the presence or absence of a pre-occurred fault. In a possible implementation manner, if the result of the fault prediction diagnosis is that no pre-occurred fault exists, the notification information may not be sent to the terminal device of the operation and maintenance personnel, so as to save resources; the notification information may also be sent to the terminal device of the operation and maintenance staff, so that the operation and maintenance staff can know the operation status of the service, for example, the failure prediction diagnosis result of the monitoring data of the service is sent to the terminal device of the operation and maintenance staff at regular time, which is not limited in this embodiment of the application.

In another possible implementation manner, if the result of the fault prediction diagnosis is that a pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, and the accuracy and timeliness of determining the positioning problem are improved. Carry the terminal equipment who sends for operation and maintenance personnel in early warning information with the location problem, for example, the location problem of the trouble that takes place in advance is the network card of physics machine probably breaks down, then carries the information that the network card of physics machine probably breaks down and sends for operation and maintenance personnel's terminal equipment in early warning information to make operation and maintenance personnel can take corresponding rescue measures before the trouble takes place, in order to avoid the emergence of trouble, and then avoided because the loss that service failure leads to. The embodiment of the present application does not limit the specific implementation manner of sending the warning information to the terminal device of the operation and maintenance staff, for example, the implementation manner may be implemented by a mail, a short message, a push message, a system prompt, and the like.

According to the embodiment of the application, the fault prediction diagnosis is carried out on the monitoring data of the service through the decision tree algorithm, when the fault prediction diagnosis result of the monitoring data of the service is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault is determined, the prediction and the positioning of the pre-occurred fault are realized, the efficiency of positioning the pre-occurred fault is improved, the positioning problem is carried in the early warning information and is sent to the terminal equipment of operation and maintenance personnel, the early warning on the pre-occurred fault is realized, and the timeliness of the fault being discovered is further improved.

Fig. 5 is a schematic flowchart of a fault handling method for a service provided in another embodiment of the present application, and based on the foregoing embodiment, as shown in fig. 5, before step S102, the fault handling method for a service provided in the embodiment of the present application may further include:

step S201: and converting the monitoring data of the service to obtain the service monitoring data of the array type, wherein the service monitoring data of the array type comprises element values corresponding to the monitoring data.

The step S102 is to perform a fault prediction diagnosis on the monitoring data of the service through a preset algorithm, and may be implemented through the step S202.

Step S202: and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.

Before the fault prediction diagnosis is carried out on the service monitoring data through the preset algorithm, the service monitoring data can be converted to obtain the array type service monitoring data, the standardized processing of the service monitoring data is realized, the array type service monitoring data is input into the decision tree algorithm, the fault prediction diagnosis is carried out on the service monitoring data, the fault prediction diagnosis result is obtained, and the reliability of the fault prediction diagnosis result is improved.

The embodiment of the present application does not limit the specific implementation manner of performing conversion processing on the service monitoring data, for example, the service monitoring data may be converted according to the specific data type of the service monitoring data and the condition that may cause a fault, for example, when the request status code distribution has a request status code with 499 exception or 500 exception that is greater than 50% of the total number of the request status codes and the service may have a fault, the proportion of 499 exception or 500 exception in the request status code distribution may be used as the element value of the request status code distribution, or the element value of the request status code distribution is determined by determining whether the proportion of 499 exception or 500 exception in the request status code distribution exceeds 50%; for another example, if a sudden increase in QPS occurs, and a problem may occur in the service, the values of the elements of QPS are determined by determining whether the sudden increase in QPS occurs. The embodiments of the present application are merely examples, and are not limited thereto.

For different monitoring data of the service, different manners may be adopted to determine the element values corresponding to the monitoring data to obtain the array type of service monitoring data, and in a possible implementation manner, the conversion processing is performed on the monitoring data of the service to obtain the array type of service monitoring data, including: and if the service monitoring data comprises any one or more combinations of the container request number, the CPU memory ratio of the container and the CPU memory ratio of the physical machine, respectively using the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine as the respective corresponding element values of the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the array type service monitoring data.

In yet another possible implementation, if the monitoring data of the service includes any one or more combinations of request status code distribution, request time, response time or query rate per second QPS, judging whether the abnormal request occupation ratio in the request state code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold, or whether the QPS exceeds a fourth preset threshold, and setting the element value of the monitoring data exceeding the corresponding preset threshold value as a first numerical value, as the element value corresponding to each of request state code distribution, request time, response time or QPS in the service monitoring data of the array type, and setting the element value of the monitoring data not exceeding the corresponding preset threshold value as a third numerical value, as the element value corresponding to each of request state code distribution, request time, response time or QPS in the service monitoring data of the array type.

In another possible implementation manner, if monitoring data in the monitoring data of the service is empty, the value of the element corresponding to the monitoring data in the monitoring data of the service which is empty is assigned as the second numerical value.

The specific numerical values of the first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold are not limited in the embodiment of the application. Taking the example that the monitoring data of the service includes the request time, the first value is 1, the second value is 2, and the third value is 0, if the request time exceeds the second preset threshold, the element value of the request time is set to 1, if the request time does not exceed the second preset threshold, the element value of the request time is set to 0, and if the request time is empty, the element value of the request time is set to 2.

The specific values of the first value, the second value and the third value are not limited in this application, and in a possible implementation manner, the first value, the second value and the third value are different values. For example, the first numerical value may be 3, the second numerical value may be 5, and the third numerical value may be 4, and still take the example that the monitoring data of the service includes the request time, if the request time exceeds the second preset threshold, the element value of the request time is set to 3, if the request time does not exceed the second preset threshold, the element value of the request time is set to 4, and if the request time is empty, the element value of the request time is set to 5.

In the embodiment of the application, the monitoring data of the service is converted, so that the standardized processing of the monitoring data of the service is realized, the service monitoring data of the array type is obtained, the reliability of the service monitoring data of the array type is improved, the service monitoring data of the array type is input into a decision tree algorithm, the fault prediction diagnosis is carried out on the service monitoring data, and the reliability of the fault prediction diagnosis result is improved.

Fig. 6 is a schematic flowchart of a fault handling method for a service provided in another embodiment of the present application, and based on any one of the embodiments shown in fig. 3 or fig. 5, as shown in fig. 6, before step S102, the fault handling method for a service provided in the embodiment of the present application may further include:

step S301: and acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data.

Step S302: and training the preset algorithm through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm.

The step S102, that is, performing the fault prediction diagnosis on the monitoring data of the service through the preset algorithm, may be implemented through the step S303.

Step S303: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.

In this embodiment of the present application, the historical monitoring data of the service may be historical monitoring data in a preset time period, for example, the monitoring data of the service in two years before is used as the historical monitoring data of the service, and this embodiment of the present application is not limited to the specific time period of the preset time period, and this is only taken as an example, and is not limited thereto. The historical diagnosis result of the historical monitoring data may include the existence of a fault or the absence of a fault, and if a fault exists, the historical diagnosis result may also include a fault positioning problem.

After the historical monitoring data of the service and the historical diagnosis result of the historical monitoring data are obtained, the preset algorithm is trained through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm, the trained preset algorithm can carry out fault prediction diagnosis according to the input monitoring data of the service, and the fault prediction diagnosis result of the monitoring data of the service can be obtained.

When the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault can be determined, and the positioning problem is carried in the early warning information and is sent to the terminal equipment of the operation and maintenance personnel. In a possible implementation manner, the method for processing a fault of a service provided in an embodiment of the present application further includes:

determining a fault solution to the positioning problem; and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction.

For each positioning problem, at least one fault solution corresponding to each positioning problem exists, for example, if the positioning problem is that a network card of a physical machine may be faulty, the fault solution corresponding to the positioning problem may be to replace the network card of the physical machine, or may be to replace the physical machine carrying a container corresponding to a service. After determining the failure solution of the positioning problem, an operation instruction corresponding to the failure solution may be further obtained, so as to debug the physical machine according to the operation instruction, taking that a network card of the physical machine may fail as an example, the operation instruction corresponding to the failure solution may include an operation instruction whether to replace the physical machine and/or an operation instruction how to replace the physical machine, and the like.

In the embodiment of the application, the fault solution of the positioning problem is determined, the operation instruction corresponding to the fault solution is obtained, the physical machine is debugged according to the operation instruction, the occurrence of the pre-occurring fault can be avoided, and the loss caused by the service fault is avoided.

Fig. 7 is a flowchart illustrating a method for processing a fault of a service provided in a further embodiment of the present application, where on the basis of any one of the embodiments shown in fig. 3 or fig. 5, as shown in fig. 7, the method for processing a fault of a service provided in an embodiment of the present application may further include, before step S102:

step S401: obtaining historical monitoring data of the service, historical fault diagnosis results of the historical monitoring data and fault solutions of positioning problems in the historical fault diagnosis results.

Step S402: and training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and the fault solution of the positioning problem in the historical fault diagnosis result to generate the trained preset algorithm.

The step S102, that is, performing the fault prediction diagnosis on the monitoring data of the service through the preset algorithm, may be implemented through the step S403.

Step S403: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution of the positioning problem in the fault prediction diagnosis result.

The difference between the embodiment of the present application and the embodiment shown in fig. 6 is that in the process of training the decision tree algorithm, a fault solution to a positioning problem in a historical fault diagnosis result is also considered, and a specific training mode of the embodiment of the present application is similar to that of the embodiment shown in fig. 6, and is not repeated.

When the result of the fault prediction and diagnosis of the service monitoring data is determined to be that the pre-occurred fault exists, the positioning problem of the pre-occurred fault and the fault solution of the positioning problem can be determined, and the positioning problem and the fault solution of the positioning problem in the fault diagnosis result are carried in the early warning information and sent to the terminal equipment of the operation and maintenance personnel.

In a possible implementation manner, the method for processing a fault of a service provided in an embodiment of the present application further includes: and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction. In this embodiment of the present application, the operation instruction corresponding to the failure solution is obtained, so as to obtain the specific implementation manner of performing the debugging processing on the physical machine according to the operation instruction, refer to the operation instruction corresponding to the failure solution obtained in the embodiment shown in fig. 6, so as to obtain the specific implementation manner of performing the debugging processing on the physical machine according to the operation instruction, which is not described again.

The embodiment of the present application provides a service fault handling apparatus, which may be used to execute the embodiment of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application. In some possible embodiments, the fault handling apparatus of the service provided by the embodiment of the present application may be a server.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, and as shown in fig. 8, the server according to the embodiment of the present application may include an obtaining module 51, a predicting module 52, and a processing module 53.

The obtaining module 51 is configured to obtain monitoring data of a service, where the monitoring data of the service includes monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container.

And the prediction module 52 is configured to perform fault prediction diagnosis on the monitoring data of the service through a preset algorithm.

And the processing module 53 is configured to determine a location problem of the pre-occurred fault when it is determined that the result of the fault prediction and diagnosis of the service monitoring data is that the pre-occurred fault exists, and send the location problem to the terminal device of the operation and maintenance staff by carrying the location problem in the early warning information.

The apparatus of this embodiment may perform the method embodiment shown in fig. 3, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.

Based on the embodiment shown in fig. 8, further, in another embodiment provided by the present application, the processing module 53 is further configured to perform conversion processing on the monitoring data of the service to obtain array-type service monitoring data, where the array-type service monitoring data includes element values corresponding to each monitoring data.

In a possible implementation, the processing module 53 is specifically configured to:

and if the service monitoring data comprises any one or more combinations of the container request number, the Central Processing Unit (CPU) memory ratio of the container and the CPU memory ratio of the physical machine, respectively using the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine as the respective corresponding element values of the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the array type service monitoring data.

In a possible implementation, the processing module 53 is further configured to:

if the service monitoring data comprises any one or more combinations of request state code distribution, request time, response time or query rate per second QPS, judging whether the abnormal request proportion in the request state code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold or whether QPS exceeds a fourth preset threshold, and setting the element value of the monitoring data exceeding the corresponding preset threshold as a first numerical value as the respective corresponding element value of the request state code distribution, the request time, the response time or the QPS in the array type service monitoring data.

and if the monitoring data in the service monitoring data is empty, assigning the element value corresponding to the empty monitoring data in the service monitoring data as a second numerical value.

The prediction module 52 is specifically configured to: and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.

The apparatus of this embodiment may perform the method embodiment shown in fig. 5, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.

Based on the embodiment shown in fig. 8, further, fig. 9 is a schematic structural diagram of a server provided in another embodiment of the present application, and as shown in fig. 9, the server provided in the present application further includes a training module 61 and a determining module 62.

The training module 61 is used for acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data; and training the preset algorithm through the historical monitoring data and the historical fault diagnosis result to generate the trained preset algorithm.

The prediction module 52 is specifically configured to: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.

In a possible implementation manner, the server provided in the embodiment of the present application further includes:

a determination module 62 for determining a fault solution to the positioning problem; and the debugging module 63 is configured to obtain an operation instruction corresponding to the failure solution, so as to perform debugging processing on the physical machine according to the operation instruction.

The apparatus of this embodiment may perform the method embodiment shown in fig. 6, and the technical principle and technical effect are similar to those of the above embodiment, which are not described herein again.

On the basis of the embodiment shown in fig. 9, the present application provides another embodiment, a training module 61, configured to obtain historical monitoring data of a service, historical fault diagnosis results of the historical monitoring data, and a fault solution of a positioning problem in the historical fault diagnosis results; and training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and the fault solution of the positioning problem in the historical fault diagnosis result to generate the trained preset algorithm.

The prediction module 52 is specifically configured to: and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution of the positioning problem in the fault prediction diagnosis result.

In a possible implementation manner, in the server provided in the embodiment of the present application, the early warning information further carries a fault solution to the positioning problem, and the server further includes:

and the debugging module 63 is configured to obtain an operation instruction corresponding to the failure solution, so as to perform debugging processing on the physical machine according to the operation instruction.

The apparatus of this embodiment can execute the method embodiment shown in fig. 7, and the technical principle and technical effect thereof are similar to those of the above embodiment, and are not described herein again.

The device embodiments provided in the present application are merely schematic, and the module division in fig. 8 or fig. 9 is only one logic function division, and there may be other division ways in actual implementation. For example, multiple modules may be combined or may be integrated into another system. The coupling of the various modules to each other may be through interfaces that are typically electrical communication interfaces, but mechanical or other forms of interfaces are not excluded. Thus, modules described as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices.

Fig. 10 is a schematic structural diagram of a server according to another embodiment of the present application, and as shown in fig. 10, the server according to the embodiment of the present application may include:

a processor 61, a memory 62, a transceiver 63 and a computer program; wherein the transceiver 63 enables data transmission with other devices, a computer program is stored in the memory 62 and configured to be executed by the processor 61, the computer program comprising instructions for performing the fault handling method of the above-mentioned service, the contents and effects of which refer to the method embodiments.

In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.

Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for handling a failure of a service, comprising:

acquiring monitoring data of a service, wherein the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container;

performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm;

and when the result of the fault prediction and diagnosis of the service monitoring data is determined to be that a pre-occurred fault exists, determining the positioning problem of the pre-occurred fault, carrying the positioning problem in early warning information, and sending the early warning information to the terminal equipment of operation and maintenance personnel.

2. The method of claim 1, further comprising, after the obtaining the monitoring data for the service:

performing conversion processing on the monitoring data of the service to acquire array type service monitoring data, wherein the array type service monitoring data comprises element values corresponding to each monitoring data;

the performing fault prediction diagnosis on the monitoring data of the service through a preset algorithm comprises the following steps:

and inputting the array type service monitoring data into a decision tree algorithm, and performing fault prediction diagnosis on the service monitoring data to obtain a fault prediction diagnosis result.

3. The method according to claim 2, wherein the converting the monitoring data of the service to obtain the array type of service monitoring data comprises:

if the monitoring data of the service includes any one or more combinations of a container request number, a Central Processing Unit (CPU) memory ratio of the container and a CPU memory ratio of the physical machine, the container request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine are respectively used as the element values corresponding to the request number, the CPU memory ratio of the container or the CPU memory ratio of the physical machine in the service monitoring data of the array type.

4. The method of claim 3, further comprising:

if the monitored data of the service includes any one or more combinations of request status code distribution, request time, response time, or per second query rate QPS, determining whether an abnormal request duty ratio in the request status code distribution exceeds a first preset threshold, whether the request time exceeds a second preset threshold, whether the response time exceeds a third preset threshold, or whether the QPS exceeds a fourth preset threshold, and setting an element value of the monitored data exceeding the corresponding preset threshold as a first numerical value, as an element value corresponding to each of the request status code distribution, the request time, the response time, or the QPS in the service monitored data of the array type.

5. The method of claim 3 or 4, further comprising:

6. The method according to any one of claims 1-3, further comprising, prior to performing fault predictive diagnosis on the monitored data of the service by a preset algorithm:

acquiring historical monitoring data of the service and historical fault diagnosis results of the historical monitoring data;

training the preset algorithm according to the historical monitoring data and the historical fault diagnosis result to generate a trained preset algorithm;

and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data.

7. The method according to any one of claims 1-3, further comprising, prior to performing fault predictive diagnosis on the monitored data of the service by a preset algorithm:

acquiring historical monitoring data of the service, historical fault diagnosis results of the historical monitoring data and fault solutions of positioning problems in the historical fault diagnosis results;

training the preset algorithm through the historical monitoring data, the historical fault diagnosis result and a fault solution for positioning the problem in the historical fault diagnosis result to generate a trained preset algorithm;

and performing fault prediction diagnosis on the service monitoring data through the trained preset algorithm, and determining the fault prediction diagnosis result of the service monitoring data and a fault solution for positioning the problem in the fault prediction diagnosis result.

8. The method of claim 7, wherein the early warning information further carries a fault solution to the positioning problem, the method further comprising:

and acquiring an operation instruction corresponding to the fault solution so as to debug the physical machine according to the operation instruction.

9. A server, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring monitoring data of a service, and the monitoring data of the service comprises monitoring data of a container corresponding to the service and monitoring data of a physical machine bearing the container;

the prediction module is used for carrying out fault prediction diagnosis on the monitoring data of the service through a preset algorithm;

and the processing module is used for determining the positioning problem of the pre-occurred fault when the result of the fault prediction and diagnosis of the monitoring data of the service is determined to be that the pre-occurred fault exists, and carrying the positioning problem in early warning information to be sent to the terminal equipment of the operation and maintenance personnel.

10. The server according to claim 9, wherein the processing module is further configured to perform conversion processing on the monitoring data of the service to obtain array-type service monitoring data, where the array-type service monitoring data includes an element value corresponding to each monitoring data;

the prediction module is specifically configured to: