CN111190713A

CN111190713A - Job scheduling management method and device

Info

Publication number: CN111190713A
Application number: CN201911370441.XA
Authority: CN
Inventors: 王雄斌
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-22

Abstract

The invention provides a job scheduling management method and a device, wherein the method comprises the following steps: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation. In the job exit information, the hardware health score information of each node related to the job is added, so that a user can be helped to analyze a method for improving the operation efficiency of the job and analyze the reason of abnormal exit.

Description

Job scheduling management method and device

Technical Field

The invention relates to a job scheduling management method and device.

Background

The cluster computing system has the characteristics of low cost and high performance, provides strong batch processing and parallel computing capability, and represents the mainstream direction of the development of high-performance computers. In such systems, the complex and various requirements of users, especially the requirements of large-scale scientific computing and commercial applications, cannot be completely satisfied by improving the hardware performance, and the computing resources also need to be efficiently managed.

It is in response to this need that cluster job management systems have emerged and developed rapidly. The method can uniformly manage and schedule the software and hardware resources of the cluster according to the requirements of users, ensure that the users work to fairly and reasonably share the cluster resources, and improve the system utilization rate and the throughput rate. Currently, the popular job management systems include PBS, Slurm, and the like.

The user converts the own calculation requirements into individual jobs, and delivers the jobs to the job scheduling system for scheduling. The job scheduling system firstly puts the newly submitted job in a job queue and simultaneously judges whether the current idle resources of the user meet the hardware resources required by the job operation. And if the operation is satisfied, the operation is distributed to a plurality of nodes to run, and if the operation is not satisfied, the operation waits. After the operation is in the running state, the operation is quitted (success or failure) after a certain time according to the size of the task content.

The existing job scheduling management system mainly manages each state (queuing, running, completing, suspending, etc.) of a job from the aspect of software resources. When a node has a hardware fault (such as network failure, sudden power off, excessive CPU temperature, etc.), the running job on the node can be directly exited. These hardware faults are not included in the error information fed back to the user by the dispatch management system. The user can only go to check the writing logic of his job again and then resubmit the job, trying to run.

The prior art has the following defects:

1. the user of the cluster computing power cannot sense whether the node is in failure, so that a blind area exists in the reason analysis of abnormal exit of the operation.

2. The operation and maintenance personnel of the node fault can only track the reason and solve the problem of the node fault on the infrastructure layer, and can not accurately provide hardware fault reminding for the user of the upper application.

Disclosure of Invention

In view of the problems in the related art, an object of the present invention is to provide a method, an apparatus, a method, and an apparatus for job scheduling management, which can help a user analyze a method for improving job running efficiency and analyze a reason for abnormal exit by adding hardware health score information of each node associated with a job to job exit information.

According to an embodiment of the present invention, there is provided a job scheduling management method including: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation.

According to the embodiment of the invention, the hardware health degree score of the computing node is obtained according to the hardware fault content, and the hardware health degree score of the node is obtained according to whether the fault type can influence the operation of the job, wherein the hardware health degree score corresponding to the fault type influencing the operation of the job is lower than the hardware health degree score corresponding to the fault type not influencing the operation of the job.

According to the embodiment of the invention, the fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.

According to the embodiment of the invention, the method for collecting the hardware fault content generated on the computing node comprises the following steps: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.

According to the embodiment of the invention, the hardware health degree score is fed back to the user submitting the operation, including the lowest health degree score in the operation process of the operation is fed back to the user.

According to the embodiment of the invention, the job scheduling management method further comprises the following steps: monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result.

According to an embodiment of the present invention, there is provided a job scheduling management apparatus including: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.

According to the embodiment of the invention, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type affects the operation of the job, wherein the hardware health score corresponding to the fault type affecting the operation of the job is lower than the hardware health score corresponding to the fault type not affecting the operation of the job.

According to the embodiment of the invention, the hardware fault collection module is also used for collecting the hardware fault content from the power supply, the CPU, the memory, the hard disk, the network and the fan component.

The invention has the beneficial technical effects that:

according to the invention, the hardware health degree score information of each node associated with the operation is added in the operation exit information, so that a user can be helped to analyze a method for improving the operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow diagram of a job scheduling management method according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of a typical job scheduling management platform architecture according to the prior art of the present invention;

FIG. 3 is a schematic diagram of the position of the present solution in a typical job scheduling management platform according to one embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

As shown in fig. 1, the present invention provides a job scheduling management method, including:

s11, submitting the job and distributing the job to the computing node for running;

s12, collecting the hardware fault content on the computing node;

s13, obtaining the hardware health degree score of the computing node according to the hardware fault content;

and S14, when the operation of the job on the computing node exits, the hardware health degree score is fed back to a user submitting the job.

According to the technical scheme, the hardware health degree score information of each node related to the operation is added to the operation exit information, so that a user can be helped to analyze a method for improving operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.

S12 may specifically include: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.

S13 may specifically include obtaining the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.

The fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.

S14 may specifically include feeding back the lowest health score during the operation of the job to the user.

The job scheduling management method provided by the invention can also comprise the following steps: s15, monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result. The method monitors the occurrence condition of hardware faults of each computing node in the job scheduling system, and can provide a decision reference basis for resource allocation scheduling strategy adjustment of the system.

In some embodiments, according to the influence of the hardware fault condition of a node on a job running on the node, the technical scheme of the invention provides an algorithm of "node job association health degree", and the algorithm is combined into a job scheduling management platform to carry out node fault information push on the job influenced by the node fault.

A typical job scheduling management platform is shown in FIG. 2. Different users submit respective jobs through a "job submission platform". The job submitting platform adds the job to the scheduling queue, and the job scheduling module applies for resources (including computing resources and storage resources) to the resource allocation module according to the resource requirement of the job. After the resource application is completed, the job scheduling module deploys the job to a corresponding container (Caffe, Tensflow, Ansys, Fluent), monitors the job execution progress in the container, and feeds back the job execution progress to the user. And when the job execution is successfully or unsuccessfully exited, finishing the scheduling of the job.

In the invention, a node operation management health degree algorithm is added in the original framework to monitor the operation information influenced by the node after a hardware fault occurs, as shown in fig. 3.

In one embodiment, the overall flow of the node job associated health algorithm is as follows:

(1) acquiring all computing node information of a job scheduling management system;

(2) and monitoring the job scheduling state of the job scheduling system, and establishing the mapping relation between the nodes and the job id when a new job is distributed to a plurality of nodes to start running.

(3) And collecting the hardware fault content generated on each computing node according to a certain frequency. The collected hardware fault content is mainly carried out from the aspects of a power supply, a CPU, a memory, a hard disk, a network, a fan and the like.

(4) And for the fault of each computing node, according to whether the fault type of the node can influence the operation of the job, giving the current health score of the node. As shown in Table 1, the health score is from 0 to 100, and the lower the score is, the lower the health is, whereas the higher the health is. If during the operation of the job, the node has no fault in any of the following tables, the node health of the node to the job is 100.

TABLE 1

(5) When a certain job on the node is operated and quitted, the lowest health degree score of the node in the operation process of the job is fed back to the job scheduling system, and the job scheduling system feeds back the node-associated health degree score and other quitting information of the job to the user.

The invention also provides a job scheduling management device, comprising: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.

In an embodiment, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.

In one embodiment, the types of faults affecting the operation of the job include a first fault type and a second fault type, wherein the first fault type corresponds to a higher hardware health score than the second fault type, and the first fault type affects the operation of the job to a lesser extent than the second fault type.

In one embodiment, the hardware failure collection module is further configured to collect hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A job scheduling management method is characterized by comprising the following steps:

submitting a job and distributing the job to a computing node for running;

collecting the content of hardware faults occurring on the computing nodes;

obtaining a hardware health score of the computing node according to the hardware fault content;

and when the operation on the computing node exits, feeding back the hardware health degree score to a user submitting the operation.

2. The job scheduling management method according to claim 1, wherein obtaining the hardware health score of the computing node according to the hardware fault content comprises obtaining the hardware health score of the node according to whether a fault type affects job running, wherein the hardware health score corresponding to the fault type affecting job running is lower than the hardware health score corresponding to the fault type not affecting job running.

3. The job scheduling management method according to claim 2, wherein the fault types affecting job running include a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the degree of the effect of the first fault type on job running is smaller than the degree of the effect of the second fault type on job running.

4. The job scheduling management method according to claim 1, wherein collecting contents of hardware failures occurring on the computing nodes comprises: and collecting the hardware fault content from a power supply, a CPU, a memory, a hard disk, a network and a fan component.

5. The job scheduling management method according to claim 1, wherein feeding back the hardware health score to a user who submitted the job comprises feeding back a lowest health score of the job in operation to the user.

6. The job scheduling management method according to claim 1, further comprising:

monitoring the content of hardware faults occurring on the computing nodes;

and adjusting the distributed scheduling strategy according to the monitoring result.

7. A job scheduling management apparatus comprising:

the operation submitting and distributing module is used for submitting the operation and distributing the operation to the computing nodes to run;

the hardware fault acquisition module is used for acquiring hardware fault content generated on the computing node;

the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content;

and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.

8. The job scheduling management device according to claim 7, wherein the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job running, where the hardware health score corresponding to the fault type that affects the job running is lower than the hardware health score corresponding to the fault type that does not affect the job running.

9. The job scheduling management apparatus according to claim 8, wherein the fault types affecting job execution include a first fault type and a second fault type, wherein a hardware health score corresponding to the first fault type is higher than a hardware health score corresponding to the second fault type, and a degree of influence of the first fault type on job execution is smaller than a degree of influence of the second fault type on job execution.

10. The job scheduling management device according to claim 7, wherein the hardware failure collection module is further configured to collect the hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.