CN114116428A - Fault diagnosis method and equipment for dispatching system - Google Patents

Fault diagnosis method and equipment for dispatching system Download PDF

Info

Publication number
CN114116428A
CN114116428A CN202111459357.2A CN202111459357A CN114116428A CN 114116428 A CN114116428 A CN 114116428A CN 202111459357 A CN202111459357 A CN 202111459357A CN 114116428 A CN114116428 A CN 114116428A
Authority
CN
China
Prior art keywords
diagnosis
index
index data
rule
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111459357.2A
Other languages
Chinese (zh)
Inventor
苏超然
赖海滨
翁世清
陈守当
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202111459357.2A priority Critical patent/CN114116428A/en
Publication of CN114116428A publication Critical patent/CN114116428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/366Software debugging using diagnostics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a fault diagnosis method and equipment of a scheduling system, wherein the method comprises the following steps: receiving index data sent by a scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database; when a diagnosis request sent by a user is received, acquiring a target diagnosis rule from a second database according to a diagnosis type corresponding to the diagnosis request; acquiring target index data required by diagnosis from a first database according to the diagnosis type, and processing the target index data according to a target diagnosis rule to obtain a diagnosis report; and sending the file address of the diagnosis report to the dispatching system. The method and the device can better show the fault points of the job flow/job and the scheduling service to the user, facilitate the user to remove the fault by oneself or timely contact operation and maintenance personnel for processing, reduce the use threshold of the platform user, and improve the fault diagnosis efficiency.

Description

Fault diagnosis method and equipment for dispatching system
Technical Field
The embodiment of the application relates to the technical field of public clouds, in particular to a fault diagnosis method and equipment of a scheduling system.
Background
From a service object and scope, cloud computing platforms can be classified into public clouds, private clouds, and hybrid clouds. The public cloud platform generally responds, allocates and retransmits the received tasks through the job scheduling system.
In the face of massive tasks in a public cloud platform, some faults are inevitable in the job scheduling process, so that the faults generated in the scheduling process need to be analyzed and diagnosed at any time. When analyzing and diagnosing faults in a scheduling process, the existing cloud platform user generally has the following problems: 1) the existing scheduling process has more steps, the state transfer relationship between the operation flow and the operation is complex, a plurality of key steps are involved in any conversion process, the operation cannot be normally scheduled and run due to errors in any link, but the fault positioning means only depends on error reporting points in the background log; 2) the existing operation indexes are all statistical indexes corresponding to the whole scheduling system, index records aiming at single operation are not available, the records of the scheduling process of the single operation are difficult to aggregate, and fault points in the scheduling process are found out; 3) the existing scheduling framework lacks an analysis means for scheduling service faults, so that the use threshold is higher.
Therefore, how to improve the fault diagnosis capability of the job flow, job or scheduling service in the scheduling system is a technical problem which needs to be solved urgently at present.
Disclosure of Invention
The embodiment of the application provides a fault diagnosis method and equipment for a scheduling system, which can effectively improve the fault diagnosis capability of a job flow, a job or a scheduling service in the scheduling system.
In a first aspect, an embodiment of the present application provides a method for diagnosing a fault of a scheduling system, where the method includes:
receiving index data sent by a scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database;
when a diagnosis request sent by a user is received, acquiring a target diagnosis rule from a second database according to a diagnosis type corresponding to the diagnosis request, wherein the diagnosis type corresponding to the diagnosis request comprises any one of workflow diagnosis, job diagnosis and scheduling service diagnosis;
acquiring target index data required by diagnosis from the first database according to the diagnosis type, and processing the target index data according to the target diagnosis rule to obtain a diagnosis report;
and sending the file address of the diagnosis report to the dispatching system.
In a second aspect, an embodiment of the present application provides a fault diagnosis apparatus for a dispatch system, including:
the index aggregation module is used for receiving the index data sent by the scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database;
the diagnosis analysis module is used for acquiring a target diagnosis rule from a second database according to a diagnosis type corresponding to a diagnosis request when the diagnosis request sent by a user is received, wherein the diagnosis type corresponding to the diagnosis request comprises any one of job flow diagnosis, job diagnosis and scheduling service diagnosis;
the diagnosis analysis module is further configured to acquire target index data required for diagnosis from the first database according to the diagnosis type, and process the target index data according to the target diagnosis rule to obtain a diagnosis report;
the diagnosis analysis module is also used for sending the file address of the diagnosis report to the scheduling system.
In a third aspect, the present application provides an electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the method for fault diagnosis of a dispatch system as provided by the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method for diagnosing the fault of the dispatching system provided in the first aspect is implemented.
The application provides a fault diagnosis method of a dispatching system, records of a job flow and a job dispatching process and diagnosis of dispatching service are formed by aggregating index data, analysis and diagnosis are further performed on the aggregated index data, a job flow/job and dispatching service fault point can be better shown to a user, the user can conveniently remove obstacles by himself or timely contact operation and maintenance personnel for processing, the use threshold of a platform user is reduced, and the fault diagnosis efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present application or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic flowchart of a fault diagnosis method for a dispatch system according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a fault diagnosis system of a dispatch system according to an embodiment of the present disclosure;
FIG. 3 is a flow diagram of a workflow provided in an embodiment of the present application;
FIG. 4 is a flow chart of an operation state provided in an embodiment of the present application;
FIG. 5 is a schematic view of a workflow or job diagnostic analysis process according to an embodiment of the present application;
FIG. 6 is a schematic view of a service index diagnostic analysis process according to an embodiment of the present application;
fig. 7 is a schematic program module diagram of a fault diagnosis apparatus of a dispatch system provided in an embodiment of the present application;
fig. 8 is a schematic hardware structure diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.
The term "module," as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
The application relates to a job scheduling system in a public cloud platform, and provides a method for index aggregation and fault analysis and diagnosis of scheduling services and scheduling services for scheduling massive jobs in the public cloud platform, so that intelligent diagnosis of job scheduling processes is realized.
At present, a cloud platform user generally encounters the following problems when performing fault analysis and diagnosis on a scheduling process:
(1) the existing scheduling framework has more scheduling flow steps, the state transfer relationship between the operation flow and the operation is complex, a plurality of key steps are involved in any conversion process, the operation cannot be normally scheduled and run due to errors in any link, but the fault location means can only basically check error reporting points in background logs.
(2) The operation indexes of the existing scheduling framework are all statistical indexes corresponding to the whole scheduling system, index records aiming at single operation are not available, and the records of the scheduling process of the single operation are difficult to aggregate, and fault points in the scheduling process are difficult to find.
(3) The existing scheduling framework lacks a means for analyzing and diagnosing the scheduling service fault, so that the framework has a higher use threshold for common users.
(4) At present, heterogeneous and diversified computing platforms generate various types of jobs, the existing framework generally submits the jobs of a specific platform to a corresponding computing platform, monitoring job states are lacked, and job state monitoring indexes of the existing framework are not exposed to the outside, so that a specific computing platform log needs to be associated when a user conducts troubleshooting, and operation and maintenance troubleshooting is difficult.
For example, in some embodiments, the Apache Airflow scheduling system typically implements index aggregation and diagnostic analysis using the StatsD framework. In the technical scheme, the Apache Airflow scheduling system is mainly divided into three components, namely an Airflow WebServer (page server), an Airflow Scheduler (back-end scheduling engine) and an Airflow Wker (execution operation client). The codes of the three components are embedded with indexes for creating the StatsD client and sending response to the StatsD server. The StatsD server is a daemon process on the server which monitors statistical indexes in UDP and TCP requests and sends an aggregation result to a back-end database. After receiving the index, the StatsD server parses and aggregates the data, and then periodically pushes the data to a database at the back end, such as infiluxdb or elastic search. The external monitoring and warning system may request the back-end database to obtain the index to complete the monitoring and warning function. The diagnosis and analysis of the technical scheme are relatively simple, the overall scheduling condition can be analyzed by analyzing index data, for example, a user can use tools such as Kibana or Grafana to analyze; in addition, the user can draw a diagnosis conclusion of the service abnormality and the job abnormality by analyzing the log.
In the above technical solution, the indexes supported by the Airflow scheduling system mainly include three types: count types (Counters), scalar types (gates), and timer types (Timers). These three types of indicators are basically all situations reflecting the global level of scheduling, and only the timer indicator can be a record of responding to a specific job. The counting type means that the index aggregation is in a counting sum mode, and the Airflow scheduling system supports 26 counting indexes including counting indexes such as total number of job strength in success/failure/queuing, total number of running job flows and the like; the scalar type means that the indexes do not need to be aggregated, if the indexes are not updated next time, the original values are kept, 22 scalar indexes supported in the Airflow scheduling system comprise the total number of the jobs running/queued on a Worker service node, the total number of the tasks for analyzing dag files overtime and the like. The timer type refers to a mode that index aggregation is provided with a time window, and can represent the average value/total number/percentile of specified direct internal characteristic data, and the like, and 9 timer indexes including specific task execution time, specific task scheduling delay, time for the specific task to be executed to a successful state and the like are supported in the Airflow.
In addition, if the StatsD service needs to be horizontally extended, the StatsD Proxy service needs to be deployed again, and the service can ensure that the same index is sent to a fixed node.
The technical scheme has the defects that: the StatsD server needs to analyze data and aggregate indexes, when the data report amount is too large, a scheduling frame is needed to modify the sampling rate during sending, otherwise, the data processing of the server can form a performance bottleneck; the sampling rate is determined by the user according to experience, and the limitation is large; when the StatsD server side performs service horizontal expansion, the situation that the specific index flow is overlarge and load balancing is needed cannot be solved; the frame does not provide a scheduling index corresponding to a single job flow or job, so that abnormal diagnosis and analysis of the job flow and the job are difficult to form, and a user needs to check a log by himself; the framework itself does not provide a diagnostic analysis function, requiring the user to analyze the index data, even the service log and job log, by himself.
In other embodiments, the scheduling system may implement task index collection and diagnostic analysis within the Apache Azkaban framework. In the technical scheme, the Apache Azkaban scheduling system mainly comprises three key components: WebServer, ExecutionServer and MySQL relational database. The Azkaban ExecutionServer is mainly responsible for submitting and executing specific job flows, the scheduling capability can be expanded by starting a plurality of execution servers, and the ExecutionServer coordinates the execution of tasks through the Mysql database. In the process of submitting and executing the job flow, the ExecutionServer embeds the code of the index aggregation into the code, writes the index into the MySQL database and reports the index to the WebServer in a timing manner. The Azkaban scheduling system can expose the access link of the corresponding index to the external application through WebServer, and the external application can obtain index data through an http request. However, the current Azkaban framework only provides 5 pieces of index data, including the total number of job flows in success/failure/queuing and the total number of jobs in run/failure, and by modifying the http request parameters, aggregation by time window and by number of instances can be achieved.
In the technical scheme, the framework core component provides the service log and the operation log for the user, and the user can analyze the log content by himself to diagnose the abnormality.
The technical scheme has the defects that: the frame does not provide a scheduling index corresponding to a single job flow or job, so that abnormal diagnosis and analysis of the job flow and the job are difficult to form, and a user needs to check a log by himself; the frame does not provide a diagnosis and analysis function, and a user needs to analyze index data, even a service log and an operation log; the provided index data is less, and if the external system requests the index data at high frequency, the processing performance of the core scheduling component is easily influenced.
In view of the above technical problems, the present application provides a fault diagnosis method for a scheduling system, which forms records of a job flow and a job scheduling process and diagnoses of scheduling services by aggregating index data, and further performs analysis and diagnosis on the aggregated index data, so that a job flow/job and scheduling service fault point can be better displayed to a user, the user can conveniently perform fault removal by himself or contact operation and maintenance personnel in time for processing, the use threshold of a platform user is reduced, and the fault diagnosis efficiency is improved.
In some embodiments, in the face of existing scheduling systems, only aggregate indicators are generally provided to express the scheduling global situation, and only a small or even no corresponding indicators are expressed for a single job flow or job. Currently, scheduling operation indexes of jobs on various heterogeneous computing platforms are also supported by few systems, and the current situation results in that for the existing scheduling system, a fault positioning means can only basically scan error reporting points in service logs or job logs. The method abstracts and expands the service indexes of each corresponding scheduling stage of a single operation flow or operation on the basis of common aggregated index data, and forms cleaning, pruning and processing of the index data by defining and constructing a scheduling service rule tree, and finally forms a diagnosis report of the corresponding operation flow or operation.
In some embodiments, in the face of the existing scheduling system, a user needs to actively analyze the service log, or visually analyze the aggregate indicator data by using tools such as Kibana or Grafana to analyze the diagnostic report of the scheduling service, and the user use threshold is high. The method and the system have the advantages that the service index rule is established through definition, the method and the system are equivalent to a display template for the aggregated index data, the existing various index data are cleaned and processed according to the template, and finally, an integral diagnosis report corresponding to the scheduling service is formed.
In some embodiments, in the case that the service lateral expansion capability of the index aggregation service in the prior art is limited, in the present application, the functions of index aggregation and diagnosis analysis are combined and abstracted into one index diagnosis module, and the micro service architecture supports traffic load balancing and flexible scalability. The index diagnosis module is decoupled from the scheduling system in an http mode, the performance of the scheduling system is not directly influenced, and the scheduling system can flexibly select whether to start the function or not through a configuration file.
The following examples are given for illustrative purposes.
Referring to fig. 1, fig. 1 is a schematic flowchart of a fault diagnosis method of a scheduling system according to an embodiment of the present application. As shown in fig. 1, the fault diagnosis method of the dispatch system includes:
s101, receiving the index data sent by the scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database.
In the embodiment of the application, an index diagnosis module can be constructed in advance, and the index diagnosis module is divided into two sub-modules, namely an index aggregation module and a diagnosis analysis module.
In a feasible implementation manner, each core component of the scheduling system may report the index data to the index aggregation sub-module in an Http manner, and the index aggregation sub-module analyzes the index data in the message after receiving the index data sent by the scheduling system, determines a specific index type, and performs aggregation calculation as needed.
The index aggregation module can push the index data to a first database at the back end in a timing mode through a timing thread.
Optionally, the first database may be a time series database (infixdb) or a non-relational database (elastic search).
In some embodiments, the scheduling process business index data, such as the scheduling business indexes of the job flow and the job, are stored in time sequence, so that the index data can be screened according to the starting time and the ending time of the job flow or the job; for the scheduling service index data, such as the total number of the operation, success, failure and queuing, the service heartbeat time, the time required by the execution of the job flow and the like, the data are stored in the specific data area in the corresponding index meaning, so that the external service can read the data conveniently.
S102, when a diagnosis request sent by a user is received, a target diagnosis rule is obtained from a second database according to a diagnosis type corresponding to the diagnosis request.
In the embodiment of the application, when a user initiates a diagnosis request to the scheduling system, the scheduling system performs user authentication, if the user authentication is passed, the diagnosis request is forwarded to the diagnosis analysis module, otherwise, the diagnosis request is not forwarded to the diagnosis analysis module.
And after receiving the diagnosis request, the diagnosis analysis module analyzes the diagnosis type from the message, then acquires the diagnosis rule from the second database, and assembles a diagnosis rule tree or a diagnosis rule list.
Optionally, the second database is a relational database.
The types of diagnosis described above generally include: job flow diagnostics, job diagnostics, and dispatch service diagnostics.
S103, acquiring target index data required by diagnosis from the first database according to the diagnosis type, and processing the target index data according to the target diagnosis rule to obtain a diagnosis report.
In the embodiment of the application, target index data required by diagnosis is acquired from the first database according to the diagnosis type.
In some embodiments, when the diagnosis type corresponding to the diagnosis request is a scheduling service diagnosis, acquiring target index data currently uploaded by a first database at the latest; and when the diagnosis type corresponding to the diagnosis request is workflow diagnosis or job diagnosis, acquiring target index data uploaded in a specified time range in the first database according to the workflow operation starting time in the diagnosis request.
The specified time range is from the operation starting time of the workflow in the diagnosis request to the time when the diagnosis analysis module receives the diagnosis request.
In some embodiments, when the diagnosis type is workflow diagnosis or job diagnosis, the diagnosis rule tree is assembled according to the acquired diagnosis rules, and the acquired index data is cleaned, cut, processed and the like according to the diagnosis rule tree, and then a diagnosis report is generated; and when the diagnosis type is dispatching service diagnosis, assembling the diagnosis rule list according to the acquired diagnosis rules, and generating a diagnosis report after cleaning, cutting, processing and the like are carried out on the acquired index data according to the diagnosis rule list.
And S104, sending the file address of the diagnosis report to the scheduling system.
In the embodiment of the application, after the diagnosis report is generated, the address of the diagnosis report file is returned to the scheduling system. And after the scheduling system asynchronously obtains the address of the diagnosis report file, reading corresponding content and displaying the corresponding content on a front-end page for a user to check.
In addition, the user accesses a web interface of the diagnosis analysis submodule to upload json format data according to the iterative evolution of the scheduling system, and can modify, increase and decrease the existing diagnosis rule.
According to the fault diagnosis method of the scheduling system, the records of the job flow and the job scheduling process and the diagnosis of the scheduling service are formed by aggregating the index data, and further the aggregated index data is analyzed and diagnosed, so that the fault points of the job flow/job and the scheduling service can be better displayed for a user, the user can conveniently remove obstacles by himself or timely contact operation and maintenance personnel for processing, the use threshold of a platform user is reduced, and the fault diagnosis efficiency is improved.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a fault diagnosis system of a dispatch system according to an embodiment of the present application.
In some embodiments, the index aggregation module is responsible for receiving the index data sent by the scheduling system, and sending the index data to the first database at regular time after performing aggregation operation as required.
The index data is derived from the logic code of the scheduling system, and the logic for transmitting the index is embedded in the code logic.
In some embodiments, the message sending the indicator data contains the following information: metricName (index name), metricnime (index sending time), metricnType (index type), operator (index operand), operator element (index operation object), operator result (index operation result), instId (instance id), instStatus (instance state), zxid (global unique id), metricnontext (index content), errCauseReason (operation step error cause), errMsg (error specific information), module (scheduling module name), hostIP (module network IP), and the like.
The indexes are divided into scheduling service indexes and scheduling service indexes, and the difference lies in that the scheduling service indexes are biased to record the process result, so that the instance state and the instance id corresponding to the operation need to be increased, and the service indexes are biased to count and record the latest information, namely the index aggregation module is used for processing, so that the operation error and the corresponding instance information do not need to be filled.
In some embodiments, the scheduling service indicator is subdivided into a job flow scheduling service indicator and a job scheduling service indicator, and the scheduling service indicator is generally classified into a count type, a scalar type and a timing type according to an aggregation mode. Therefore, although there are only three different metrics in service, metricType needs to be mapped to 5 types, corresponding to 5 values.
When the index aggregation submodule receives the message, the index type needs to be checked firstly, whether the message information is correct and complete is checked according to the index type, and then different logic processing is carried out on the scheduling service index and the scheduling service index respectively.
It should be added that, firstly, because each core component of the scheduling system may be deployed in a distributed manner, a globally unique id needs to be sent for differentiation, and the module name and the module network IP are also used for differentiation; secondly, errors may be reported in the process of scheduling operation, so that the error reasons and specific information of the operation steps need to be supplemented additionally; finally, regarding the content of the index, it is the process record information for the business index, and the value of the corresponding index for the service index.
The scheduling service index is data of a buried point record index of the whole process from creation to destruction of the job flow and the job instance in the scheduling system. In the scheduling system, there are a lot of job flows and states of jobs, and there are many cases when the states are circulated. For the workflow state flow diagram, there are 7 workflow states, and the case of the state flow is 8in total. For the job state transition diagram, there are 11 kinds of job states, and there are 14 kinds of job state transitions. The job flow and the job state flow diagram have relatively common characteristics for different scheduling systems, but the intermediate steps specifically involved in the state flow are quite different.
For better understanding of the embodiment of the present application, refer to fig. 3 and fig. 4, where fig. 3 is a workflow state flow diagram provided in the embodiment of the present application; fig. 4 is an operation state transition diagram provided in the embodiment of the present application.
In the embodiment of the present application, the operation of the abstract business index includes an operand, an operation object, and an operation result. In order to cover the technical capability of intermediate operations of different systems as much as possible, such as files, message queues, relational databases, Redis, kubernets, Yarn, memory queues, Zookeeper and the like, 15 kinds of operation objects, and 19 kinds of corresponding operands and corresponding operation results are abstracted.
When the business key intermediate steps in the scheduling system can be combed, and the scheduling system reports the indexes, the indexes correspond to the sequence of the intermediate steps only through an algorithm, so that a report of the scheduling whole flow of a job flow or a job example can be obtained, and the report can be simply processed or analyzed to obtain abnormal points. The key intermediate steps of the combed determinacy can be expressed as diagnosis rules by the abstract operations, and various possible results after the state circulation operations in the whole scheduling process can be expressed by constructing a diagnosis rule tree.
For better understanding of the embodiments of the present application, referring to table 1, table 1 is a schematic table of operands, operation objects and operation results corresponding to indexes
Table 1: operand, operation object and operation result schematic table corresponding to index
Figure BDA0003387602840000111
Different modules in the scheduling system correspondingly execute different intermediate key steps in the job flow and the job state flow diagram, and after the execution is finished, logic for sending the service index needs to be added corresponding to the executed result or exception. For example, after receiving job information, an execution module in charge of job operation in the scheduling system analyzes job parameters and successfully submits the job parameters to kubernets to create a job operation instance, an http request needs to be sent to the index aggregation sub-module, the request includes information that an operand is submission, an operation object is kubernets, an operation result is success, and an instance id and an instance state are taken as operation, and the corresponding index content is information that the job is successfully submitted to kubernets to create the job instance. The subsequent operation module also needs to continuously request an Api Server service of Kubernetes to acquire the operation state of the operation example, an http request needs to be sent to the index aggregation sub-module once again each time monitoring data is acquired, and redundant index content is cleaned during diagnosis and analysis.
In some embodiments, after the indicator aggregation module receives the http request, according to the indicator type, if the indicator is not the service indicator data, the indicator aggregation module switches to the processing flow of other indicator types, if the indicator is the service indicator, the integrity of the information is verified, the indicator data is analyzed, the service indicator processing service is called and converted into an application command of a service processing queue, the application command is added into the queue, if the queue is added successfully, the request is returned successfully, and otherwise, the request fails. The service index processing service comprises a Sender instance, and whether the Sender instance sends data to an elastic search or an infiluxdb is selected according to a configuration file. Secondly, the service index processing service will have a thread to continuously take out the processing command from the processing queue, when the processing command is an appendix, the index data will be added to the memory queue of the Sender instance, and at this time, the index data will be put into the memory queue of the job flow or job according to the service index type. Moreover, the service index processing service also can start a timing thread, the timing interval is a refreshing interval of the service index sending back-end database in the configuration file, and a Flush command is put into the processing queue at the interval. And finally, when a thread takes out the Flush command from the processing queue, checking whether the data area exists or not, if the data area does not exist, creating index corresponding to the ES database, and if the data area exists, asynchronously writing the data in the memory queue of the Sender instance into the corresponding data area of the back-end database in batch by calling the corresponding Api of the database. For naming of the data area, a user can set a prefix in the configuration file, and the service index takes a date as a suffix and takes the date as a name to create the data area. Finally, when the service processing index service is to be closed, a flush command and then a close command need to be added to the processing queue, if the thread of the service processing index service processes the close command, the polling operation of the queue processing thread of the index processing service is stopped, the timing thread is waited to quit, and meanwhile, the connection of the back-end processing library is closed.
In some embodiments, the scheduling service indicator is a relevant indicator of a performance level of a scheduling service in a scheduling system, and is mainly classified into three types: scalar, counting, and clocked. The three types are distinguished according to the index type in the message. When the scheduling system sends the scheduling service index, the instance id and the instance state are not needed in the message.
It should be noted that, since a set of message structures is multiplexed, but the index content field indicates the record operation process information in the scheduling service index, and indicates the value of the corresponding index in the service index.
In some embodiments, according to the difference of the scheduling system, the user may customize different scheduling service indexes, and only the message structure of the service index is needed, and the index aggregation mode is determined and sent to the index aggregation sub-module in the http mode, so that the index can be recorded in the data area configured by the user in the configuration file. For example, the heartbeat Time of the execution module of the scheduling system, which is the latest Time shown in the diagnostic report required by the user, may be configured as a scalar type, and a code is embedded in the execution module of the scheduling system to report the index, and the specified index name is the heartbeat Time of the execution module in the message, and the reported heartbeat Time in the index content, where the operation object is Time. Moreover, if the total number of the job instances successfully executed by the scheduling system needs to be shown in the diagnosis report, the method can be configured as a counting type, and meanwhile, a counting index is reported when the execution module completes one job each time, and the total number of the job instances with the index name of being successfully executed by the execution module is specified in the message, the index content is 1, and the operation object is NUM. Service indicators when checking, whether the operation object is NUM and TIME, if so, the other object will directly return failure and discard the data.
In some embodiments, after the indicator aggregation module receives the http request, according to the indicator type, if the indicator is not the service indicator data, the indicator data is converted to the processing flow of other indicator types, if the indicator is the service indicator, the integrity of the information is verified, the indicator data is analyzed, the service indicator processing service is called and converted into an application command of a processing queue, if the type of the service indicator is a counting type or a timing type, a computer command needs to be added into the processing queue, if the adding of the queue is successful, the request is returned to be successful, and if the adding of the queue is unsuccessful, the request fails. The service index handling service comprises a Sender instance, and whether the Sender instance sends data to an ElasticSearch or infiluxdb is selected according to a configuration file.
Second, the service pointer processing service will have threads to continuously take out processing commands from the processing queue, and when the processing command is an appendix, will add the pointer data to the memory structure of the Sender instance. In the Sender instance, a Map is maintained for the index of the count type, the key value is the name of the index, the value is a List, and for the appendix command, only the index value needs to be added into the List; two maps are maintained for timing type indexes, key values of the two maps are index names, one value is a fixed-length Queue, for an application command, only an index value needs to be added into the Queue, the other value is Pair, the index is stored after the calculation is finished, and the index corresponds to the total duration and the accumulated index number obtained by memory calculation; a Map is also maintained for the scalar type index, the key value is the index name, the value is the value of the index, and the value is directly modified to the current value by the application instruction. When the processing command is computer, it is relatively special only for timing type, and for the time range, there is an abnormal extreme value, for example, when a certain service is abnormal, the time consumed from creation to completion of the job is much longer than normal, so it is necessary to sort the data in Queue, and remove 10% of the previous and subsequent data and then accumulate them, and the accumulated result will be stored in the memory.
And moreover, the service index processing service also can start a timing thread, the timing interval is a refreshing interval of the service index sending back-end database in the configuration file, and a SingleFlush command and a ReadFlush command are put in the processing queue at the interval. Then, when the command taken out from the processing queue by the thread is a SingleFlush command and a ReadFlush command, the SingleFlush command only processes scalar indexes, and the ReadFlush only processes counting indexes and timing indexes. Both will check whether the data area exists first, if not, create index corresponding to the ES database. And if the data area exists, asynchronously writing the Sender instance to the data in the scalar type memory Map in the SingleFlush command, and calling the corresponding Api of the database to write into the corresponding data area of the back-end database one by one. The Readflush command firstly processes the counting type indexes, namely maps corresponding to the counting type maintained in the memory of the circular Sender instance, reads corresponding accumulated values from the data area by an optimistic locking mechanism corresponding to each index, obtains new numerical values after the elements in the List are accumulated, updates the numerical values to the data area, simultaneously clears the List, then processes the timing type indexes, reads the values of the timing type corresponding indexes from the data area by the optimistic locking mechanism, can obtain the total duration and the count values, and accumulates the total duration and the count values in the maps.
When the service processing index service is to be closed, two flush commands (namely, SingleFlush and ReadFlush commands) are added to the processing queue firstly, and then a close command is added, if the thread of the service processing index service processes the close command, the polling operation of the queue processing thread of the index processing service is stopped, the timing thread is waited to quit together, and meanwhile, the connection of the back-end processing library is closed.
It should be noted that, in the service index aggregation, although the aggregation mode is basically consistent with the StatsD, the aggregation calculation mode is substantially different from the StatsD. The reason is that the StatsD depends on the StatsD proxy service when the cluster is deployed, and the StatsD proxy service fixedly sends an index to a fixed node for being compatible with the logic of a single service, so that the concurrence problem when multiple nodes write is simply avoided. However, when the requested traffic of a specific index is increased sharply, even if N service nodes are expanded, there is no way to achieve load balancing of the traffic. The invention introduces optimistic lock to balance performance and expansibility to a great extent when multi-node concurrent writing is carried out on counting type and timing type service indexes, and can directly support load balance of flow under the condition of sudden increase of flow and horizontal expansion.
In some embodiments, since the job flow and job state flow in the scheduling system are complex, before performing diagnostic analysis, the operation of a key intermediate step in the state flow is abstracted in an index reporting stage.
In abstract operation, the application defines 15 operation objects and 19 operands and corresponding operation results. When the diagnosis rule is defined, the diagnosis rule can simply abstract an operation object, an operand and a corresponding operation result to be written as grammar, and thus an intermediate step in each state transition process is expressed, or a constraint relation is expressed.
For example, given that the transfer of the JOB state from the waiting resource to the canceled state is a termination operation that requires human intervention, and the key step in terminating the operation is to modify the JOB state from the waiting resource to a termination in the relational database, the diagnostic rule can be written as update DB _ JOB _ STATUS SUCCESS; when the JOB STATUS is diagnosed as a waiting resource, the same update DB _ JOB _ STATUS object can be compared according to the operation information of the service index data to determine whether the operation result is SUCCESS, and if the operation result is FAILED, the fault point is shown in the diagnosis report. Moreover, the heartbeat reporting TIME of each module in the known scheduling system cannot exceed 10 minutes and is not updated, so that the diagnostic rule can be written as less _ than TIME 10, which means that 10 minutes are less than the current TIME; when all service heartbeat reporting times are diagnosed, the heartbeat time in the index content can be obtained according to the index content of the service index data, at this time, according to the operation calculation of the diagnosis rule, whether the heartbeat time is more than the constraint value for 10 minutes or not is checked by comparing the heartbeat time with the current time, and the condition that the heartbeat time does not meet the requirement can be diagnosed, so that whether the service heartbeat reporting times are abnormal or not is obtained.
The construction of the diagnosis rule tree is that during the process of restoring the circulation of the whole scheduling system, the abstracted diagnosis rule is embedded into the state circulation diagram, and then a diagnosis rule tree can be obtained. The tree structure of the diagnostic rule tree, the node relationship of the tree structure, can store it in the relational database. Therefore, the present application constructs a table diagnosese _ rules in the relational database, and includes the following fields: rule _ id, rule _ detail, rule _ sub _ condition, parent _ rule _ id, level (tree hierarchy, 0 indicating root node), err _ tips (abnormal information presentation), info _ tips (normal information), and rule _ type (rule type, 1 indicating job flow diagnostic rule, 2 indicating job diagnostic rule, and 3 indicating service diagnostic rule). The rule content details are the written rules themselves, the rule usage conditions are the matching expressions for the service index names, or the status of the job flow/job. It should be noted that, considering that the service index information has essentially no relationship of state transition, the level of the service diagnostic rule is a level and is set to-1 as a default.
The workflow and the workflow of the job diagnostic analysis are basically identical, and the only difference is that the diagnostic rule trees stored in the database are different, and the state flow is different.
Referring to fig. 5, fig. 5 is a schematic view illustrating a flow of a job flow or a job diagnosis analysis in the embodiment of the present application.
In some embodiments, the workflow or the job diagnosis analysis process includes:
s501, determining whether an instance id, a starting time and an ending time exist in the diagnosis request, wherein the instance id, the starting time and the ending time are of a job flow instance or a job instance.
In some embodiments, the diagnosis type and the diagnosis request are verified to have an instance id and a start time and an end time, and the time range is given to obtain data of different data areas according to the range, for example, the work instance service index data may be stored in an index of multiple time dates in the ElasticSearch, so the range needs to be determined.
S502, if the instance id, the starting time and the ending time exist in the diagnosis request, traversing each node of the diagnosis rule tree, and traversing the target index data at each traversed node.
In some embodiments, if the instance id, the start time and the end time exist in the diagnosis request, obtaining the diagnosis rule from the second database, and assembling into a diagnosis rule tree according to the obtained diagnosis rule; and if the example id, the starting time and the ending time do not exist in the diagnosis request, returning to the failure of the request and prompting that the content of the message is wrong.
In some embodiments, the pointer data is obtained from the first database according to the instance status, the instance id, the start time, and the end time, and converted into a List, and the List may be sorted according to zxid, wherein the larger zxid is, the newer the pointer is, and then the deduplication is performed according to the operation object, the operand, and the operation result.
S503, if the example state of the index in the target index data meets the diagnosis rule of the currently traversed node, adding the currently traversed node to a diagnosis result list.
In some embodiments, all job flow/job instance state values are traversed, where the traversal order is determined by the flow order.
According to the breadth-first search algorithm, part of nodes of the diagnosis rule tree are traversed, and a partQueue is initialized firstly, wherein the partQueue comprises a root node of the instance state of the current loop in the diagnosis rule tree. And then, through a breadth-first search algorithm, each node traverses all index data of the instance, and if the instance state with the index is consistent with the rule applicable condition of the current node and the operation of the index is consistent with the rule detail of the current node, the node is added into a diagnosis result list.
If the index has errMsg or errCausedReason, the level needs to be converted into ERROR when the index is output, otherwise, the level is INFO by default, and all displayed index contents are the scheduling operation records.
And S504, generating the diagnosis report according to the diagnosis result list.
And according to each instance state, performing breadth-first search and finally summarizing to obtain a diagnosis result list, traversing the diagnosis result list, and only keeping a first monitoring result and a last monitoring result for the index data of operation monitoring so as to beautify and display.
And writing all index results into the file in sequence according to the index reporting time, the information level and the index content, then returning the request success, and attaching the file address to the scheduling system.
Through the above steps, the format of the finally formed job flow/job diagnosis report is as follows:
2021-;
2021-;
2021-;
2021-;
2021-;
……
in some embodiments, compared to the workflow/job diagnosis analysis, the diagnosis rule of the scheduling service diagnosis analysis has no tree hierarchy, referring to fig. 6, fig. 6 is a schematic diagram of a service index diagnosis analysis process in the embodiment of the present application.
In some embodiments, the workflow or the job diagnosis analysis process includes:
s601, determining whether the scheduling module name exists in the diagnosis request.
In some embodiments, when a diagnosis request is received, whether the diagnosis type and the diagnosis request carry a specific module name or not is verified, if so, the service diagnosis analysis only analyzes specific index data of the module, and otherwise, all service index data are analyzed.
S602, when the scheduling module name exists in the diagnosis request, traversing each diagnosis rule in the diagnosis rule list, and traversing the target index data when the diagnosis rule is traversed.
In some embodiments, the diagnostic rules may be retrieved from a second database and converted to a list of diagnostic rules. And simultaneously, acquiring all service diagnosis rules from the database and converting the service diagnosis rules into a list.
And traversing all service diagnosis rules, traversing all service index data for each rule, if the expression of the rule use condition is matched with the current index name and the service index data can meet the rule constraint relation according to the diagnosis rule, recording the index into a diagnosis result List, adding the INFO _ tips of the rule into the index content, otherwise, recording the index into a result List, changing the output level from the default INFO into ERROR, and simultaneously adding the err _ tips of the rule into the index content.
S603, if the service index traversed currently in the target index data meets the diagnostic rule traversed currently, adding the service index traversed currently to a preset diagnostic result list.
And S604, generating the diagnosis report according to the diagnosis result list.
In some embodiments, after the traversal is finished, all the index data in the diagnosis result list is written into the file, and then the request success is returned and the file address is attached.
Through the above steps, when the diagnosis request is the scheduling execution machine, the finally formed diagnosis part report about the scheduling execution machine is as follows:
2021-;
2021-;
2021-;
……
in some embodiments, under the micro-service architecture, both the scheduling system and the index diagnosis module may be deployed in a Kubernetes container environment, and an Nginx service is added to perform load balancing and forwarding of traffic. Because the scheduling system and the index diagnosis module are decoupled through http, namely, the index report and the diagnosis request pass through Nginx, the scheduling system accesses the same service domain name, but does not know which index diagnosis module is specifically accessed. Each index diagnosis module is independent and consistent logically, and the same infiuxDB or elastic search backend database is accessed. When the flow is large, the flow bandwidth exceeds a preset threshold, and at the moment, a Kubernetes preconfigured AutoScale mechanism is triggered, so that the index diagnosis module is automatically subjected to horizontal capacity expansion. When the flow is small, the mechanism can horizontally reduce the volume of the module.
In the method and the device, the user only needs to assemble the data in the json format according to the field format of the database table and confirm the content of the diagnosis rule by self. When the interface is requested for the first time, if no data exists in the relational database, the rule data in the message sent by the user can be directly analyzed to be used as initialization data. And when an interface is subsequently requested, traversing the rule data in the message sent by the user, if the rule data does not exist in the data table, inserting the data table, and if the rule data exists in the data table, updating the base table by taking the message information sent by the user as the standard.
The index aggregation method for the scheduling system has the following advantages:
besides common aggregation indexes, the service indexes of each stage of corresponding scheduling of a single operation flow or operation are abstracted and expanded, the cleaning, pruning and processing of index data are formed by defining and constructing a scheduling service rule tree, and finally a diagnosis report corresponding to the operation flow or operation is formed.
The service index rule tree is established by definition, which is equivalent to a display template for aggregated index data, and the existing various index data are cleaned and processed according to the template, so that an integral diagnosis report corresponding to the scheduling service is formed finally.
The flexible capacity expansion service nodes with the flow under the micro service architecture are supported, and are decoupled with the scheduling system in an http mode, so that the influence on the scheduling performance is reduced.
The support scheduling system flexibly opens and closes whether to report the index and support the diagnosis and analysis by modifying the configuration file.
Based on the content described in the foregoing embodiment, an embodiment of the present application further provides a fault diagnosis apparatus for a dispatch system, and referring to fig. 7, fig. 7 is a schematic diagram of program modules of the fault diagnosis apparatus for the dispatch system provided in the embodiment of the present application, where the fault diagnosis apparatus for the dispatch system includes:
the index aggregation module 701 is configured to receive index data sent by a scheduling system, aggregate the received index data, and upload the aggregated index data to a first database.
A diagnosis analysis module 702, configured to, when a diagnosis request sent by a user is received, obtain a target diagnosis rule from a second database according to a diagnosis type corresponding to the diagnosis request, where the diagnosis type corresponding to the diagnosis request includes any one of job flow diagnosis, job diagnosis, and scheduling service diagnosis;
acquiring target index data required by diagnosis from the first database according to the diagnosis type, and processing the target index data according to the target diagnosis rule to obtain a diagnosis report;
and sending the file address of the diagnosis report to the dispatching system.
It should be noted that, in the embodiment of the present application, the content of the specific execution of the index aggregation module 701 and the diagnosis analysis module 702 may refer to the related content in the above embodiment, which is not described herein again.
Further, based on the content described in the foregoing embodiments, an electronic device is also provided in the embodiments of the present application, where the electronic device includes at least one processor and a memory; wherein the memory stores computer execution instructions; the at least one processor executes computer-executable instructions stored in the memory to implement the steps of the fault diagnosis of the scheduling system described in the above embodiments, which are not described herein again.
For better understanding of the embodiment of the present application, refer to fig. 8, and fig. 8 is a schematic diagram of a hardware structure of an electronic device according to the embodiment of the present application.
As shown in fig. 8, the electronic apparatus 80 of the present embodiment includes: a processor 801 and a memory 802; wherein:
a memory 802 for storing computer-executable instructions;
the processor 801 is configured to execute computer-executable instructions stored in the memory to implement the steps of the fault diagnosis of the scheduling system described in the foregoing embodiments, which are not described herein again.
Alternatively, the memory 802 may be separate or integrated with the processor 801.
When the memory 802 is provided separately, the apparatus further includes a bus 803 for connecting the memory 802 and the processor 801.
Further, based on the content described in the foregoing embodiments, an embodiment of the present application further provides a computer-readable storage medium, where a computer-executable instruction is stored in the computer-readable storage medium, and when the processor executes the computer-executable instruction, the steps of diagnosing the fault of the scheduling system described in the foregoing embodiments are implemented, and details of the embodiment are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The module integrated unit can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (12)

1. A method for fault diagnosis of a dispatch system, the method comprising:
receiving index data sent by a scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database;
when a diagnosis request sent by a user is received, acquiring a target diagnosis rule from a second database according to a diagnosis type corresponding to the diagnosis request, wherein the diagnosis type corresponding to the diagnosis request comprises any one of workflow diagnosis, job diagnosis and scheduling service diagnosis;
acquiring target index data required by diagnosis from the first database according to the diagnosis type, and processing the target index data according to the target diagnosis rule to obtain a diagnosis report;
and sending the file address of the diagnosis report to the dispatching system.
2. The method according to claim 1, wherein the obtaining target index data required for diagnosis from the first database according to the diagnosis type comprises:
when the diagnosis type corresponding to the diagnosis request is the workflow diagnosis or the job diagnosis, acquiring the index data uploaded within a specified time range from the first database as the target index data according to the workflow operation starting time in the diagnosis request;
and when the diagnosis type corresponding to the diagnosis request is the scheduling service diagnosis, acquiring the currently and newly uploaded index data from the first database as the target index data.
3. The method of claim 2, wherein processing the target metric data to obtain a diagnostic report according to the target diagnostic rule comprises:
when the diagnosis type corresponding to the diagnosis request is the job flow diagnosis or the job diagnosis, determining a state transition process of the scheduling system, and determining a state transition diagram of the scheduling system according to the state transition process, wherein the state transition diagram is of a tree structure;
adding the target diagnosis rule to each node in the state transition diagram respectively to generate a diagnosis rule tree;
and processing the target index data according to the diagnosis rule tree to obtain a diagnosis report.
4. The method of claim 3, wherein processing the target metric data to obtain a diagnostic report according to the diagnostic rule tree comprises:
determining whether an instance id, a start time, and an end time exist in the diagnostic request, the instance id, the start time, and the end time being of a workflow instance or a job instance;
if the diagnosis request has an instance id, a start time and an end time, traversing each node of the diagnosis rule tree, and traversing the target index data at each traversed node;
if the example state of the index in the target index data meets the diagnosis rule of the currently traversed node, adding the currently traversed node to a diagnosis result list;
and generating the diagnosis report according to the diagnosis result list.
5. The method of claim 2, wherein processing the target metric data to obtain a diagnostic report according to the target diagnostic rule comprises:
when the diagnosis type corresponding to the diagnosis request is the dispatch service diagnosis, generating a diagnosis rule list according to the target diagnosis rule;
and processing the target index data according to the diagnosis rule list to obtain a diagnosis report.
6. The method of claim 5, wherein processing the target metric data to obtain a diagnostic report according to the diagnostic rule list comprises:
determining whether a scheduling module name exists in the diagnostic request;
when the scheduling module name exists in the diagnosis request, traversing each diagnosis rule in the diagnosis rule list, and traversing the target index data when the traversed diagnosis rule exists;
if the service index traversed currently in the target index data meets the diagnostic rule traversed currently, adding the service index traversed currently to a preset diagnostic result list;
and generating the diagnosis report according to the diagnosis result list.
7. The method of claim 1, wherein receiving metric data transmitted by a scheduling system comprises:
receiving a message sent by the scheduling system, wherein the message comprises at least one of the following information: the method comprises the following steps of (1) index name, index sending time, index type, index operand, index operation object, index operation result, instance id, instance state, global unique id, index content, error reason of operation step, error specific information, scheduling module name and module network IP;
and analyzing the index data according to the message.
8. The method of claim 1, wherein the metric data comprises scheduled traffic metric data and scheduled service metric data;
the scheduling service index data comprises the data of the buried point record index of the whole process from creation to destruction of the job flow and the job instance in the scheduling system;
the scheduling service index data includes performance index data of a scheduling service in the scheduling system, and types of the scheduling service index data in the scheduling system include a scalar type, a counting type, and a timing type.
9. The method of claim 1, further comprising:
and newly adding a diagnosis rule in the second database or modifying the diagnosis rule stored in the second database according to the received diagnosis rule message.
10. A fault diagnosis apparatus of a dispatch system, comprising:
the index aggregation module is used for receiving the index data sent by the scheduling system, performing aggregation processing on the received index data, and uploading the aggregated index data to a first database;
the diagnosis analysis module is used for acquiring a target diagnosis rule from a second database according to a diagnosis type corresponding to a diagnosis request when the diagnosis request sent by a user is received, wherein the diagnosis type corresponding to the diagnosis request comprises any one of job flow diagnosis, job diagnosis and scheduling service diagnosis;
the diagnosis analysis module is further configured to acquire target index data required for diagnosis from the first database according to the diagnosis type, and process the target index data according to the target diagnosis rule to obtain a diagnosis report;
the diagnosis analysis module is also used for sending the file address of the diagnosis report to the scheduling system.
11. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of fault diagnosis of a dispatch system of any of claims 1 to 9.
12. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement a method of fault diagnosis for a dispatch system as claimed in any one of claims 1 to 9.
CN202111459357.2A 2021-12-01 2021-12-01 Fault diagnosis method and equipment for dispatching system Pending CN114116428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111459357.2A CN114116428A (en) 2021-12-01 2021-12-01 Fault diagnosis method and equipment for dispatching system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111459357.2A CN114116428A (en) 2021-12-01 2021-12-01 Fault diagnosis method and equipment for dispatching system

Publications (1)

Publication Number Publication Date
CN114116428A true CN114116428A (en) 2022-03-01

Family

ID=80365377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111459357.2A Pending CN114116428A (en) 2021-12-01 2021-12-01 Fault diagnosis method and equipment for dispatching system

Country Status (1)

Country Link
CN (1) CN114116428A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102137416A (en) * 2010-12-16 2011-07-27 华为软件技术有限公司 Method and device for analyzing network equipment fault
WO2016090929A1 (en) * 2014-12-10 2016-06-16 中兴通讯股份有限公司 Method, server and system for software system fault diagnosis
CN106528390A (en) * 2016-11-04 2017-03-22 智者四海(北京)技术有限公司 Application monitoring method and device
CN110990461A (en) * 2019-12-12 2020-04-10 国家电网有限公司大数据中心 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN111209131A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司广州航天软件分公司 Method and system for determining fault of heterogeneous system based on machine learning
CN113485989A (en) * 2021-07-02 2021-10-08 中国建设银行股份有限公司 Comprehensive analysis method, system, medium and equipment for supervision data
CN113608916A (en) * 2021-10-08 2021-11-05 苏州浪潮智能科技有限公司 Fault diagnosis method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102137416A (en) * 2010-12-16 2011-07-27 华为软件技术有限公司 Method and device for analyzing network equipment fault
WO2016090929A1 (en) * 2014-12-10 2016-06-16 中兴通讯股份有限公司 Method, server and system for software system fault diagnosis
CN106528390A (en) * 2016-11-04 2017-03-22 智者四海(北京)技术有限公司 Application monitoring method and device
CN110990461A (en) * 2019-12-12 2020-04-10 国家电网有限公司大数据中心 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN111209131A (en) * 2019-12-30 2020-05-29 航天信息股份有限公司广州航天软件分公司 Method and system for determining fault of heterogeneous system based on machine learning
CN113485989A (en) * 2021-07-02 2021-10-08 中国建设银行股份有限公司 Comprehensive analysis method, system, medium and equipment for supervision data
CN113608916A (en) * 2021-10-08 2021-11-05 苏州浪潮智能科技有限公司 Fault diagnosis method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
屋顶看飞机: "Apache Airflow指标监控实践", Retrieved from the Internet <URL:https://blog.csdn.net/Young2018/article/details/117190823> *

Similar Documents

Publication Publication Date Title
US10735522B1 (en) System and method for operation management and monitoring of bots
US11295228B2 (en) Methods for creating automated dynamic workflows of interoperable bots and devices thereof
US10116534B2 (en) Systems and methods for WebSphere MQ performance metrics analysis
US20110191394A1 (en) Method of processing log files in an information system, and log file processing system
US20110320228A1 (en) Automated Generation of Markov Chains for Use in Information Technology
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
US8631280B2 (en) Method of measuring and diagnosing misbehaviors of software components and resources
CN109977089A (en) Blog management method, device, computer equipment and computer readable storage medium
CN111026602A (en) Health inspection scheduling management method and device of cloud platform and electronic equipment
CN111737207B (en) Method and device for showing and collecting logs of service nodes in distributed system
US9600523B2 (en) Efficient data collection mechanism in middleware runtime environment
US10372572B1 (en) Prediction model testing framework
KR20150118963A (en) Queue monitoring and visualization
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
US8954563B2 (en) Event enrichment using data correlation
CN113360353B (en) Test server and cloud platform
US11461290B2 (en) System and method for run-time adaptable policy engine for heterogeneous managed entities
CN114116428A (en) Fault diagnosis method and equipment for dispatching system
JP5735998B2 (en) Operation system
US20230113860A1 (en) Proactive network application problem log analyzer
US11243857B2 (en) Executing test scripts with respect to a server stack
CN114676198A (en) Benchmark evaluation system for multimode database and construction method thereof
US11200097B2 (en) Device and method for optimizing the utilization over time of the resources of an IT infrastructure
CN113138896A (en) Application running condition monitoring method, device and equipment
Kleindienst Building a real-world logging infrastructure with Logstash, Elasticsearch and Kibana

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination