CN111190713A - Job scheduling management method and device - Google Patents
Job scheduling management method and device Download PDFInfo
- Publication number
- CN111190713A CN111190713A CN201911370441.XA CN201911370441A CN111190713A CN 111190713 A CN111190713 A CN 111190713A CN 201911370441 A CN201911370441 A CN 201911370441A CN 111190713 A CN111190713 A CN 111190713A
- Authority
- CN
- China
- Prior art keywords
- job
- hardware
- fault type
- fault
- health score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title claims abstract description 32
- 230000036541 health Effects 0.000 claims abstract description 70
- 238000012544 monitoring process Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 12
- 230000002159 abnormal effect Effects 0.000 abstract description 9
- 238000012423 maintenance Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013468 resource allocation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0715—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3017—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a job scheduling management method and a device, wherein the method comprises the following steps: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation. In the job exit information, the hardware health score information of each node related to the job is added, so that a user can be helped to analyze a method for improving the operation efficiency of the job and analyze the reason of abnormal exit.
Description
Technical Field
The invention relates to a job scheduling management method and device.
Background
The cluster computing system has the characteristics of low cost and high performance, provides strong batch processing and parallel computing capability, and represents the mainstream direction of the development of high-performance computers. In such systems, the complex and various requirements of users, especially the requirements of large-scale scientific computing and commercial applications, cannot be completely satisfied by improving the hardware performance, and the computing resources also need to be efficiently managed.
It is in response to this need that cluster job management systems have emerged and developed rapidly. The method can uniformly manage and schedule the software and hardware resources of the cluster according to the requirements of users, ensure that the users work to fairly and reasonably share the cluster resources, and improve the system utilization rate and the throughput rate. Currently, the popular job management systems include PBS, Slurm, and the like.
The user converts the own calculation requirements into individual jobs, and delivers the jobs to the job scheduling system for scheduling. The job scheduling system firstly puts the newly submitted job in a job queue and simultaneously judges whether the current idle resources of the user meet the hardware resources required by the job operation. And if the operation is satisfied, the operation is distributed to a plurality of nodes to run, and if the operation is not satisfied, the operation waits. After the operation is in the running state, the operation is quitted (success or failure) after a certain time according to the size of the task content.
The existing job scheduling management system mainly manages each state (queuing, running, completing, suspending, etc.) of a job from the aspect of software resources. When a node has a hardware fault (such as network failure, sudden power off, excessive CPU temperature, etc.), the running job on the node can be directly exited. These hardware faults are not included in the error information fed back to the user by the dispatch management system. The user can only go to check the writing logic of his job again and then resubmit the job, trying to run.
The prior art has the following defects:
1. the user of the cluster computing power cannot sense whether the node is in failure, so that a blind area exists in the reason analysis of abnormal exit of the operation.
2. The operation and maintenance personnel of the node fault can only track the reason and solve the problem of the node fault on the infrastructure layer, and can not accurately provide hardware fault reminding for the user of the upper application.
Disclosure of Invention
In view of the problems in the related art, an object of the present invention is to provide a method, an apparatus, a method, and an apparatus for job scheduling management, which can help a user analyze a method for improving job running efficiency and analyze a reason for abnormal exit by adding hardware health score information of each node associated with a job to job exit information.
According to an embodiment of the present invention, there is provided a job scheduling management method including: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation.
According to the embodiment of the invention, the hardware health degree score of the computing node is obtained according to the hardware fault content, and the hardware health degree score of the node is obtained according to whether the fault type can influence the operation of the job, wherein the hardware health degree score corresponding to the fault type influencing the operation of the job is lower than the hardware health degree score corresponding to the fault type not influencing the operation of the job.
According to the embodiment of the invention, the fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
According to the embodiment of the invention, the method for collecting the hardware fault content generated on the computing node comprises the following steps: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.
According to the embodiment of the invention, the hardware health degree score is fed back to the user submitting the operation, including the lowest health degree score in the operation process of the operation is fed back to the user.
According to the embodiment of the invention, the job scheduling management method further comprises the following steps: monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result.
According to an embodiment of the present invention, there is provided a job scheduling management apparatus including: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
According to the embodiment of the invention, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type affects the operation of the job, wherein the hardware health score corresponding to the fault type affecting the operation of the job is lower than the hardware health score corresponding to the fault type not affecting the operation of the job.
According to the embodiment of the invention, the fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
According to the embodiment of the invention, the hardware fault collection module is also used for collecting the hardware fault content from the power supply, the CPU, the memory, the hard disk, the network and the fan component.
The invention has the beneficial technical effects that:
according to the invention, the hardware health degree score information of each node associated with the operation is added in the operation exit information, so that a user can be helped to analyze a method for improving the operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow diagram of a job scheduling management method according to one embodiment of the present invention;
FIG. 2 is a schematic diagram of a typical job scheduling management platform architecture according to the prior art of the present invention;
FIG. 3 is a schematic diagram of the position of the present solution in a typical job scheduling management platform according to one embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
As shown in fig. 1, the present invention provides a job scheduling management method, including:
s11, submitting the job and distributing the job to the computing node for running;
s12, collecting the hardware fault content on the computing node;
s13, obtaining the hardware health degree score of the computing node according to the hardware fault content;
and S14, when the operation of the job on the computing node exits, the hardware health degree score is fed back to a user submitting the job.
According to the technical scheme, the hardware health degree score information of each node related to the operation is added to the operation exit information, so that a user can be helped to analyze a method for improving operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.
S12 may specifically include: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.
S13 may specifically include obtaining the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.
The fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
S14 may specifically include feeding back the lowest health score during the operation of the job to the user.
The job scheduling management method provided by the invention can also comprise the following steps: s15, monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result. The method monitors the occurrence condition of hardware faults of each computing node in the job scheduling system, and can provide a decision reference basis for resource allocation scheduling strategy adjustment of the system.
In some embodiments, according to the influence of the hardware fault condition of a node on a job running on the node, the technical scheme of the invention provides an algorithm of "node job association health degree", and the algorithm is combined into a job scheduling management platform to carry out node fault information push on the job influenced by the node fault.
A typical job scheduling management platform is shown in FIG. 2. Different users submit respective jobs through a "job submission platform". The job submitting platform adds the job to the scheduling queue, and the job scheduling module applies for resources (including computing resources and storage resources) to the resource allocation module according to the resource requirement of the job. After the resource application is completed, the job scheduling module deploys the job to a corresponding container (Caffe, Tensflow, Ansys, Fluent), monitors the job execution progress in the container, and feeds back the job execution progress to the user. And when the job execution is successfully or unsuccessfully exited, finishing the scheduling of the job.
In the invention, a node operation management health degree algorithm is added in the original framework to monitor the operation information influenced by the node after a hardware fault occurs, as shown in fig. 3.
In one embodiment, the overall flow of the node job associated health algorithm is as follows:
(1) acquiring all computing node information of a job scheduling management system;
(2) and monitoring the job scheduling state of the job scheduling system, and establishing the mapping relation between the nodes and the job id when a new job is distributed to a plurality of nodes to start running.
(3) And collecting the hardware fault content generated on each computing node according to a certain frequency. The collected hardware fault content is mainly carried out from the aspects of a power supply, a CPU, a memory, a hard disk, a network, a fan and the like.
(4) And for the fault of each computing node, according to whether the fault type of the node can influence the operation of the job, giving the current health score of the node. As shown in Table 1, the health score is from 0 to 100, and the lower the score is, the lower the health is, whereas the higher the health is. If during the operation of the job, the node has no fault in any of the following tables, the node health of the node to the job is 100.
TABLE 1
(5) When a certain job on the node is operated and quitted, the lowest health degree score of the node in the operation process of the job is fed back to the job scheduling system, and the job scheduling system feeds back the node-associated health degree score and other quitting information of the job to the user.
The invention also provides a job scheduling management device, comprising: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
In an embodiment, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.
In one embodiment, the types of faults affecting the operation of the job include a first fault type and a second fault type, wherein the first fault type corresponds to a higher hardware health score than the second fault type, and the first fault type affects the operation of the job to a lesser extent than the second fault type.
In one embodiment, the hardware failure collection module is further configured to collect hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A job scheduling management method is characterized by comprising the following steps:
submitting a job and distributing the job to a computing node for running;
collecting the content of hardware faults occurring on the computing nodes;
obtaining a hardware health score of the computing node according to the hardware fault content;
and when the operation on the computing node exits, feeding back the hardware health degree score to a user submitting the operation.
2. The job scheduling management method according to claim 1, wherein obtaining the hardware health score of the computing node according to the hardware fault content comprises obtaining the hardware health score of the node according to whether a fault type affects job running, wherein the hardware health score corresponding to the fault type affecting job running is lower than the hardware health score corresponding to the fault type not affecting job running.
3. The job scheduling management method according to claim 2, wherein the fault types affecting job running include a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the degree of the effect of the first fault type on job running is smaller than the degree of the effect of the second fault type on job running.
4. The job scheduling management method according to claim 1, wherein collecting contents of hardware failures occurring on the computing nodes comprises: and collecting the hardware fault content from a power supply, a CPU, a memory, a hard disk, a network and a fan component.
5. The job scheduling management method according to claim 1, wherein feeding back the hardware health score to a user who submitted the job comprises feeding back a lowest health score of the job in operation to the user.
6. The job scheduling management method according to claim 1, further comprising:
monitoring the content of hardware faults occurring on the computing nodes;
and adjusting the distributed scheduling strategy according to the monitoring result.
7. A job scheduling management apparatus comprising:
the operation submitting and distributing module is used for submitting the operation and distributing the operation to the computing nodes to run;
the hardware fault acquisition module is used for acquiring hardware fault content generated on the computing node;
the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content;
and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
8. The job scheduling management device according to claim 7, wherein the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job running, where the hardware health score corresponding to the fault type that affects the job running is lower than the hardware health score corresponding to the fault type that does not affect the job running.
9. The job scheduling management apparatus according to claim 8, wherein the fault types affecting job execution include a first fault type and a second fault type, wherein a hardware health score corresponding to the first fault type is higher than a hardware health score corresponding to the second fault type, and a degree of influence of the first fault type on job execution is smaller than a degree of influence of the second fault type on job execution.
10. The job scheduling management device according to claim 7, wherein the hardware failure collection module is further configured to collect the hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911370441.XA CN111190713A (en) | 2019-12-26 | 2019-12-26 | Job scheduling management method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911370441.XA CN111190713A (en) | 2019-12-26 | 2019-12-26 | Job scheduling management method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111190713A true CN111190713A (en) | 2020-05-22 |
Family
ID=70709582
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911370441.XA Pending CN111190713A (en) | 2019-12-26 | 2019-12-26 | Job scheduling management method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111190713A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101739292A (en) * | 2009-12-04 | 2010-06-16 | 曙光信息产业(北京)有限公司 | Application characteristic-based isomeric group operation self-adapting dispatching method and system |
CN107358338A (en) * | 2017-06-09 | 2017-11-17 | 国网冀北电力有限公司 | A kind of multi-service and the D5000 system healths degree layering evaluation of priorities method of hardware fusion |
CN108632086A (en) * | 2018-04-19 | 2018-10-09 | 山东省计算中心(国家超级计算济南中心) | A kind of concurrent job operation troubles localization method |
CN109086134A (en) * | 2018-07-19 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of operation method and device of deep learning operation |
-
2019
- 2019-12-26 CN CN201911370441.XA patent/CN111190713A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101739292A (en) * | 2009-12-04 | 2010-06-16 | 曙光信息产业(北京)有限公司 | Application characteristic-based isomeric group operation self-adapting dispatching method and system |
CN107358338A (en) * | 2017-06-09 | 2017-11-17 | 国网冀北电力有限公司 | A kind of multi-service and the D5000 system healths degree layering evaluation of priorities method of hardware fusion |
CN108632086A (en) * | 2018-04-19 | 2018-10-09 | 山东省计算中心(国家超级计算济南中心) | A kind of concurrent job operation troubles localization method |
CN109086134A (en) * | 2018-07-19 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of operation method and device of deep learning operation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5254547B2 (en) | Decentralized application deployment method for web application middleware, system and computer program thereof | |
US7689996B2 (en) | Method to distribute programs using remote Java objects | |
US9104498B2 (en) | Maximizing server utilization within a datacenter | |
US7783907B2 (en) | Power management of multi-processor servers | |
US20080229320A1 (en) | Method, an apparatus and a system for controlling of parallel execution of services | |
CN105183554B (en) | High-performance calculation and cloud computing hybrid system and its method for managing resource | |
CN107087019A (en) | A kind of end cloud cooperated computing framework and task scheduling apparatus and method | |
EP4029197B1 (en) | Utilizing network analytics for service provisioning | |
CN107943559A (en) | A kind of big data resource scheduling system and its method | |
CN110300188B (en) | Data transmission system, method and device | |
El Khoury et al. | Energy-aware placement and scheduling of network traffic flows with deadlines on virtual network functions | |
CN106528288A (en) | Resource management method, device and system | |
CN110727508A (en) | Task scheduling system and scheduling method | |
US20220272151A1 (en) | Server-side resource monitoring in a distributed data storage environment | |
CN108563495A (en) | The cloud resource queue graded dispatching system and method for data center's total management system | |
CN106020969A (en) | High-performance cloud computing hybrid computing system and method | |
WO2012100545A1 (en) | Method, system and device for service scheduling | |
US8819239B2 (en) | Distributed resource management systems and methods for resource management thereof | |
Jonathan et al. | Rethinking Adaptability in {Wide-Area} Stream Processing Systems | |
Marandi et al. | Filo: Consolidated consensus as a cloud service | |
Sanjeevi et al. | DTCF: deadline task consolidation first for energy minimisation in cloud data centres | |
CN110928659B (en) | Numerical value pool system remote multi-platform access method with self-adaptive function | |
CN111190713A (en) | Job scheduling management method and device | |
CN108833157A (en) | Computer communicates NFV resource scheduling system | |
CN112527469B (en) | Fault-tolerant combination method of cloud computing server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200522 |
|
RJ01 | Rejection of invention patent application after publication |