CN111190713A - Job scheduling management method and device - Google Patents

Job scheduling management method and device Download PDF

Info

Publication number
CN111190713A
CN111190713A CN201911370441.XA CN201911370441A CN111190713A CN 111190713 A CN111190713 A CN 111190713A CN 201911370441 A CN201911370441 A CN 201911370441A CN 111190713 A CN111190713 A CN 111190713A
Authority
CN
China
Prior art keywords
job
hardware
fault type
fault
health score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911370441.XA
Other languages
Chinese (zh)
Inventor
王雄斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Beijing Co Ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN201911370441.XA priority Critical patent/CN111190713A/en
Publication of CN111190713A publication Critical patent/CN111190713A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3017Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a job scheduling management method and a device, wherein the method comprises the following steps: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation. In the job exit information, the hardware health score information of each node related to the job is added, so that a user can be helped to analyze a method for improving the operation efficiency of the job and analyze the reason of abnormal exit.

Description

Job scheduling management method and device
Technical Field
The invention relates to a job scheduling management method and device.
Background
The cluster computing system has the characteristics of low cost and high performance, provides strong batch processing and parallel computing capability, and represents the mainstream direction of the development of high-performance computers. In such systems, the complex and various requirements of users, especially the requirements of large-scale scientific computing and commercial applications, cannot be completely satisfied by improving the hardware performance, and the computing resources also need to be efficiently managed.
It is in response to this need that cluster job management systems have emerged and developed rapidly. The method can uniformly manage and schedule the software and hardware resources of the cluster according to the requirements of users, ensure that the users work to fairly and reasonably share the cluster resources, and improve the system utilization rate and the throughput rate. Currently, the popular job management systems include PBS, Slurm, and the like.
The user converts the own calculation requirements into individual jobs, and delivers the jobs to the job scheduling system for scheduling. The job scheduling system firstly puts the newly submitted job in a job queue and simultaneously judges whether the current idle resources of the user meet the hardware resources required by the job operation. And if the operation is satisfied, the operation is distributed to a plurality of nodes to run, and if the operation is not satisfied, the operation waits. After the operation is in the running state, the operation is quitted (success or failure) after a certain time according to the size of the task content.
The existing job scheduling management system mainly manages each state (queuing, running, completing, suspending, etc.) of a job from the aspect of software resources. When a node has a hardware fault (such as network failure, sudden power off, excessive CPU temperature, etc.), the running job on the node can be directly exited. These hardware faults are not included in the error information fed back to the user by the dispatch management system. The user can only go to check the writing logic of his job again and then resubmit the job, trying to run.
The prior art has the following defects:
1. the user of the cluster computing power cannot sense whether the node is in failure, so that a blind area exists in the reason analysis of abnormal exit of the operation.
2. The operation and maintenance personnel of the node fault can only track the reason and solve the problem of the node fault on the infrastructure layer, and can not accurately provide hardware fault reminding for the user of the upper application.
Disclosure of Invention
In view of the problems in the related art, an object of the present invention is to provide a method, an apparatus, a method, and an apparatus for job scheduling management, which can help a user analyze a method for improving job running efficiency and analyze a reason for abnormal exit by adding hardware health score information of each node associated with a job to job exit information.
According to an embodiment of the present invention, there is provided a job scheduling management method including: submitting the operation and distributing the operation to the computing nodes for running; collecting hardware fault content generated on a computing node; obtaining a hardware health score of the computing node according to the hardware fault content; and when the operation of the operation on the computing node exits, the hardware health degree score is fed back to the user submitting the operation.
According to the embodiment of the invention, the hardware health degree score of the computing node is obtained according to the hardware fault content, and the hardware health degree score of the node is obtained according to whether the fault type can influence the operation of the job, wherein the hardware health degree score corresponding to the fault type influencing the operation of the job is lower than the hardware health degree score corresponding to the fault type not influencing the operation of the job.
According to the embodiment of the invention, the fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
According to the embodiment of the invention, the method for collecting the hardware fault content generated on the computing node comprises the following steps: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.
According to the embodiment of the invention, the hardware health degree score is fed back to the user submitting the operation, including the lowest health degree score in the operation process of the operation is fed back to the user.
According to the embodiment of the invention, the job scheduling management method further comprises the following steps: monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result.
According to an embodiment of the present invention, there is provided a job scheduling management apparatus including: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
According to the embodiment of the invention, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type affects the operation of the job, wherein the hardware health score corresponding to the fault type affecting the operation of the job is lower than the hardware health score corresponding to the fault type not affecting the operation of the job.
According to the embodiment of the invention, the fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
According to the embodiment of the invention, the hardware fault collection module is also used for collecting the hardware fault content from the power supply, the CPU, the memory, the hard disk, the network and the fan component.
The invention has the beneficial technical effects that:
according to the invention, the hardware health degree score information of each node associated with the operation is added in the operation exit information, so that a user can be helped to analyze a method for improving the operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow diagram of a job scheduling management method according to one embodiment of the present invention;
FIG. 2 is a schematic diagram of a typical job scheduling management platform architecture according to the prior art of the present invention;
FIG. 3 is a schematic diagram of the position of the present solution in a typical job scheduling management platform according to one embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
As shown in fig. 1, the present invention provides a job scheduling management method, including:
s11, submitting the job and distributing the job to the computing node for running;
s12, collecting the hardware fault content on the computing node;
s13, obtaining the hardware health degree score of the computing node according to the hardware fault content;
and S14, when the operation of the job on the computing node exits, the hardware health degree score is fed back to a user submitting the job.
According to the technical scheme, the hardware health degree score information of each node related to the operation is added to the operation exit information, so that a user can be helped to analyze a method for improving operation efficiency of the operation and analyze the reason of abnormal exit. Moreover, after the node fails, data support is provided for cluster maintenance of an operation and maintenance manager, and a reference basis is provided for analysis of the abnormal reason of the operation by a user with cluster computing capacity. In addition, the knowledge points between the node fault curve and the operation abnormal curve can be enhanced, and experience is improved for different types of nodes and different types of operation features.
S12 may specifically include: hardware fault content is collected from power supply, CPU, internal memory, hard disk, network and fan parts.
S13 may specifically include obtaining the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.
The fault types influencing the operation of the job comprise a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the influence degree of the first fault type on the operation of the job is smaller than the influence degree of the second fault type on the operation of the job.
S14 may specifically include feeding back the lowest health score during the operation of the job to the user.
The job scheduling management method provided by the invention can also comprise the following steps: s15, monitoring the content of hardware faults occurring on the computing nodes; and adjusting the distributed scheduling strategy according to the monitoring result. The method monitors the occurrence condition of hardware faults of each computing node in the job scheduling system, and can provide a decision reference basis for resource allocation scheduling strategy adjustment of the system.
In some embodiments, according to the influence of the hardware fault condition of a node on a job running on the node, the technical scheme of the invention provides an algorithm of "node job association health degree", and the algorithm is combined into a job scheduling management platform to carry out node fault information push on the job influenced by the node fault.
A typical job scheduling management platform is shown in FIG. 2. Different users submit respective jobs through a "job submission platform". The job submitting platform adds the job to the scheduling queue, and the job scheduling module applies for resources (including computing resources and storage resources) to the resource allocation module according to the resource requirement of the job. After the resource application is completed, the job scheduling module deploys the job to a corresponding container (Caffe, Tensflow, Ansys, Fluent), monitors the job execution progress in the container, and feeds back the job execution progress to the user. And when the job execution is successfully or unsuccessfully exited, finishing the scheduling of the job.
In the invention, a node operation management health degree algorithm is added in the original framework to monitor the operation information influenced by the node after a hardware fault occurs, as shown in fig. 3.
In one embodiment, the overall flow of the node job associated health algorithm is as follows:
(1) acquiring all computing node information of a job scheduling management system;
(2) and monitoring the job scheduling state of the job scheduling system, and establishing the mapping relation between the nodes and the job id when a new job is distributed to a plurality of nodes to start running.
(3) And collecting the hardware fault content generated on each computing node according to a certain frequency. The collected hardware fault content is mainly carried out from the aspects of a power supply, a CPU, a memory, a hard disk, a network, a fan and the like.
(4) And for the fault of each computing node, according to whether the fault type of the node can influence the operation of the job, giving the current health score of the node. As shown in Table 1, the health score is from 0 to 100, and the lower the score is, the lower the health is, whereas the higher the health is. If during the operation of the job, the node has no fault in any of the following tables, the node health of the node to the job is 100.
TABLE 1
Figure BDA0002339524740000061
(5) When a certain job on the node is operated and quitted, the lowest health degree score of the node in the operation process of the job is fed back to the job scheduling system, and the job scheduling system feeds back the node-associated health degree score and other quitting information of the job to the user.
The invention also provides a job scheduling management device, comprising: the job submitting and distributing module is used for submitting the job and distributing the job to the computing nodes to run; the hardware fault acquisition module is used for acquiring the content of hardware faults occurring on the computing nodes; the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content; and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
In an embodiment, the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job operation, where the hardware health score corresponding to the fault type that affects the job operation is lower than the hardware health score corresponding to the fault type that does not affect the job operation.
In one embodiment, the types of faults affecting the operation of the job include a first fault type and a second fault type, wherein the first fault type corresponds to a higher hardware health score than the second fault type, and the first fault type affects the operation of the job to a lesser extent than the second fault type.
In one embodiment, the hardware failure collection module is further configured to collect hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A job scheduling management method is characterized by comprising the following steps:
submitting a job and distributing the job to a computing node for running;
collecting the content of hardware faults occurring on the computing nodes;
obtaining a hardware health score of the computing node according to the hardware fault content;
and when the operation on the computing node exits, feeding back the hardware health degree score to a user submitting the operation.
2. The job scheduling management method according to claim 1, wherein obtaining the hardware health score of the computing node according to the hardware fault content comprises obtaining the hardware health score of the node according to whether a fault type affects job running, wherein the hardware health score corresponding to the fault type affecting job running is lower than the hardware health score corresponding to the fault type not affecting job running.
3. The job scheduling management method according to claim 2, wherein the fault types affecting job running include a first fault type and a second fault type, wherein the hardware health score corresponding to the first fault type is higher than the hardware health score corresponding to the second fault type, and the degree of the effect of the first fault type on job running is smaller than the degree of the effect of the second fault type on job running.
4. The job scheduling management method according to claim 1, wherein collecting contents of hardware failures occurring on the computing nodes comprises: and collecting the hardware fault content from a power supply, a CPU, a memory, a hard disk, a network and a fan component.
5. The job scheduling management method according to claim 1, wherein feeding back the hardware health score to a user who submitted the job comprises feeding back a lowest health score of the job in operation to the user.
6. The job scheduling management method according to claim 1, further comprising:
monitoring the content of hardware faults occurring on the computing nodes;
and adjusting the distributed scheduling strategy according to the monitoring result.
7. A job scheduling management apparatus comprising:
the operation submitting and distributing module is used for submitting the operation and distributing the operation to the computing nodes to run;
the hardware fault acquisition module is used for acquiring hardware fault content generated on the computing node;
the hardware health degree score obtaining module is used for obtaining a hardware health degree score of the computing node according to the hardware fault content;
and the feedback module is used for feeding back the hardware health degree score to a user submitting the operation when the operation on the computing node exits.
8. The job scheduling management device according to claim 7, wherein the hardware health score obtaining module is further configured to obtain the hardware health score of the node according to whether the fault type may affect the job running, where the hardware health score corresponding to the fault type that affects the job running is lower than the hardware health score corresponding to the fault type that does not affect the job running.
9. The job scheduling management apparatus according to claim 8, wherein the fault types affecting job execution include a first fault type and a second fault type, wherein a hardware health score corresponding to the first fault type is higher than a hardware health score corresponding to the second fault type, and a degree of influence of the first fault type on job execution is smaller than a degree of influence of the second fault type on job execution.
10. The job scheduling management device according to claim 7, wherein the hardware failure collection module is further configured to collect the hardware failure content from a power supply, a CPU, a memory, a hard disk, a network, and a fan unit.
CN201911370441.XA 2019-12-26 2019-12-26 Job scheduling management method and device Pending CN111190713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911370441.XA CN111190713A (en) 2019-12-26 2019-12-26 Job scheduling management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911370441.XA CN111190713A (en) 2019-12-26 2019-12-26 Job scheduling management method and device

Publications (1)

Publication Number Publication Date
CN111190713A true CN111190713A (en) 2020-05-22

Family

ID=70709582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911370441.XA Pending CN111190713A (en) 2019-12-26 2019-12-26 Job scheduling management method and device

Country Status (1)

Country Link
CN (1) CN111190713A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739292A (en) * 2009-12-04 2010-06-16 曙光信息产业(北京)有限公司 Application characteristic-based isomeric group operation self-adapting dispatching method and system
CN107358338A (en) * 2017-06-09 2017-11-17 国网冀北电力有限公司 A kind of multi-service and the D5000 system healths degree layering evaluation of priorities method of hardware fusion
CN108632086A (en) * 2018-04-19 2018-10-09 山东省计算中心(国家超级计算济南中心) A kind of concurrent job operation troubles localization method
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739292A (en) * 2009-12-04 2010-06-16 曙光信息产业(北京)有限公司 Application characteristic-based isomeric group operation self-adapting dispatching method and system
CN107358338A (en) * 2017-06-09 2017-11-17 国网冀北电力有限公司 A kind of multi-service and the D5000 system healths degree layering evaluation of priorities method of hardware fusion
CN108632086A (en) * 2018-04-19 2018-10-09 山东省计算中心(国家超级计算济南中心) A kind of concurrent job operation troubles localization method
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation

Similar Documents

Publication Publication Date Title
JP5254547B2 (en) Decentralized application deployment method for web application middleware, system and computer program thereof
US7689996B2 (en) Method to distribute programs using remote Java objects
US9104498B2 (en) Maximizing server utilization within a datacenter
US7783907B2 (en) Power management of multi-processor servers
US20080229320A1 (en) Method, an apparatus and a system for controlling of parallel execution of services
CN105183554B (en) High-performance calculation and cloud computing hybrid system and its method for managing resource
CN107087019A (en) A kind of end cloud cooperated computing framework and task scheduling apparatus and method
EP4029197B1 (en) Utilizing network analytics for service provisioning
CN107943559A (en) A kind of big data resource scheduling system and its method
CN110300188B (en) Data transmission system, method and device
El Khoury et al. Energy-aware placement and scheduling of network traffic flows with deadlines on virtual network functions
CN106528288A (en) Resource management method, device and system
CN110727508A (en) Task scheduling system and scheduling method
US20220272151A1 (en) Server-side resource monitoring in a distributed data storage environment
CN108563495A (en) The cloud resource queue graded dispatching system and method for data center's total management system
CN106020969A (en) High-performance cloud computing hybrid computing system and method
WO2012100545A1 (en) Method, system and device for service scheduling
US8819239B2 (en) Distributed resource management systems and methods for resource management thereof
Jonathan et al. Rethinking Adaptability in {Wide-Area} Stream Processing Systems
Marandi et al. Filo: Consolidated consensus as a cloud service
Sanjeevi et al. DTCF: deadline task consolidation first for energy minimisation in cloud data centres
CN110928659B (en) Numerical value pool system remote multi-platform access method with self-adaptive function
CN111190713A (en) Job scheduling management method and device
CN108833157A (en) Computer communicates NFV resource scheduling system
CN112527469B (en) Fault-tolerant combination method of cloud computing server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200522

RJ01 Rejection of invention patent application after publication