CN113065848A

CN113065848A - Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Info

Publication number: CN113065848A
Application number: CN202110360064.2A
Authority: CN
Inventors: 黄进军; 谢冬鸣; 林健
Original assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Current assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-07-02

Abstract

The application provides a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends, wherein the system comprises a job management component, a cluster management component and at least one back end cluster; each back-end cluster corresponds to one job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of various clusters, the job management component is responsible for distributing deep learning jobs to proper clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component can monitor and record the execution condition and resource use condition of the jobs and provide subsequent query analysis for users. The invention can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.

Description

Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Technical Field

The application relates to the technical field of deep learning, in particular to a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends.

Background

Artificial intelligence and cloud computing technology have evolved vigorously since the 21 st century. The deep learning is a foundation for artificial intelligence research, which is to simulate a human brain to analyze and learn by establishing a neural network and simulate a human brain mechanism to explain data such as images, sounds, texts and the like, and is mainly divided into two levels of services, wherein one level is oriented to artificial intelligence developers and provides basic facility services such as hardware, software, algorithms, computing power and the like required by algorithm development, model training, training visualization, model verification, service release and data inference for the developers; the other layer is to the end users such as the mass consumers or the technicians in specific industries, and mainly provide the application layer services taking data reasoning as the core for the end users. Deep learning services can be divided into a micro-service mode and a batch processing operation mode according to an operation mode, wherein the micro-service mode naturally supports servitization; the batch processing operation has a plurality of different service modes according to different scenes, and the following table provides a scheduling framework used by a common service mode of deep learning batch processing operation and a main applicable scene thereof.

Common servitization mode for deep learning batch jobs

Big data scheduling framework: the method is characterized by being mature in ecology, good in interactivity with big data components and easy to construct a workflow with data as a center; the scalability and fault tolerance design is more perfect, and the method is suitable for being deployed on the existing large data cluster.

High performance scheduling framework: the deep learning engine has good interactivity with high-performance computing, communication and storage components, meets the requirements of deep learning training on large-scale matrix operation and distributed communication, and is particularly suitable for the deep learning engine based on MPI optimization; the stability and the expandability of the system are verified in a large-scale ultra-computation environment, and the system is suitable for being deployed on the existing ultra-computation infrastructure.

Containerized scheduling framework: the framework is specially designed according to the requirements and characteristics of cloud service, has good interactivity with cloud service infrastructure, and brings great convenience to cloud service; resource elasticity and fault tolerance are main advantages of the method, and the method is suitable for being deployed in the existing cloud computing environment.

The traditional scheduling framework is designed according to the characteristics of respective fields and operating environments, and although the traditional scheduling framework can process respective services, the operating principle and the using mode of the traditional scheduling framework are very different, so that the traditional scheduling framework is not beneficial to environment migration, resource integration and application field expansion. How to fully utilize the respective specific capabilities of various clusters (containerization, high-performance and big data clusters) and integrate the respective advantages of the various clusters, thereby expanding the application field of the deep learning platform and improving the utilization efficiency of cluster resources, and becoming a problem to be solved urgently.

Disclosure of Invention

Aiming at the defects in the prior art, a deep learning scheduling system and a scheduling method supporting the rear ends of various clusters are provided, a smooth transition scheme can be provided for architecture evolution and transformation of an enterprise platform, computing resources of various clusters can be fully utilized, and the efficiency of distributed deep learning is improved.

The system comprises an operation management component, a cluster management component and at least one back-end cluster;

the operation management component is used for receiving a deep learning operation request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;

the operation management component is further used for acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information;

the job management component is also used for converting the unified job format data into a target job format according to the matched job cluster information of the target rear-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target rear-end cluster;

the job management component is further configured to invoke a corresponding driver-side program of the target back-end cluster to submit the target job format to the target back-end cluster, so as to obtain a target job response result from the target back-end cluster;

the job management component is also used for converting the target job response result into a uniform abstract data format;

the job management component is further configured to return the uniform abstract data format to the end user.

Preferably, the types of the backend cluster include at least one of a high performance cluster, a containerized cluster, and a big data cluster.

Preferably, the high-performance cluster is a churm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.

Preferably, the job management component is configured to provide a REST API for submitting deep learning jobs in a uniform abstract data format;

the job management component is further configured to provide a REST API that obtains a state of the deep learning job in a uniform abstract data format;

the job management component is further configured to provide a REST API to stop deep learning jobs in a uniform abstract data format;

the operation management component is also used for internally processing the conversion from the external uniform abstract operation format to the concrete format of the cluster side driver;

the job management component is also used for sending the unified job request to the back-end job cluster;

preferably, the cluster management component is configured to add a back-end job cluster;

the cluster management component is also used for inquiring the metadata information of the back-end operation cluster.

Preferably, the cluster management component is configured to access one or more backend clusters simultaneously, and the type of the backend cluster is related to the adaptation support provided by the component.

The cluster management component is further configured to provide a uniform abstract description of multiple types of backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address and cluster authentication information;

the cluster management component is also used for providing a method for inquiring the information of all back-end clusters;

the cluster management component is also used for providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring, wherein the cluster management component acquires the latest state information and the related runtime information of the deep learning operation by monitoring the cluster;

the cluster management component is also used for providing an API (application programming interface) for the client to perform cluster management and query cluster information.

Preferably, the cluster management component is configured to provide unified job creation, stop, and deletion operation entries for multiple types of clusters;

the cluster management component is also used for realizing programming of a uniform and abstract operation data interface;

the cluster management component is also used for realizing programming of the life cycle management of the uniform abstract operation;

the cluster management component is also used for providing a uniform access interface for the terminal users;

the cluster management component is further configured to support scheduling of deep learning jobs in multiple operating modes, including but not limited to: a single process mode, a multi-process mode, a PS-Worker distributed mode, a Master-Worker distributed mode, and an MPI distributed mode;

the cluster management component is further configured to provide adaptation support of cluster side driver for each type of cluster environment, including but not limited to: submitting support of the job, stopping support of the job, and acquiring support of the job state;

the cluster management component is also used for providing a method for uniformly inquiring the job state, the job log and the job resource use condition for the clusters.

In addition, in order to achieve the above object, the present invention further provides a scheduling method based on a deep learning scheduling system supporting multiple types of cluster backend, where the system includes a job management component, a cluster management component, and at least one backend cluster;

correspondingly, the scheduling method comprises the following steps:

receiving, by the job management component, a deep learning job request conforming to a uniform abstract data format submitted by a terminal user through a preset interface; analyzing the operation information according to a uniform abstract data format of deep learning operation;

acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;

converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which can be received by the matched job cluster information of the target back-end cluster;

calling a corresponding driving side program of the target back-end cluster by the job management component to submit the target job format to the target back-end cluster so as to obtain a target job response result from the target back-end cluster;

the job management component is further configured to convert the target job response result to a uniform abstract data format; and returning the uniform abstract data format to the terminal user.

The invention has the beneficial effects that: the method can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.

Specifically, the method comprises the following steps: (1) for the terminal user, deep learning jobs can be run on a plurality of different types of back-end clusters in a unified mode, coupling of jobs and resources is avoided, and high job scheduling efficiency is obtained; (2) for a cluster operator, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost for building a deep learning cluster.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments of the present application or the background art will be briefly described below.

FIG. 1 is a schematic diagram of an architecture of a deep learning scheduling system supporting multiple types of cluster backend according to the present invention;

FIG. 2 is a schematic diagram of the structure of the job management component of the scheduling system of the present invention;

FIG. 3 is a schematic flow chart of a scheduling method of a deep learning scheduling system supporting multiple types of cluster backend according to the present invention;

fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a deep learning scheduling system supporting multiple types of cluster backend provided in the present application, where the system includes a job management component, a cluster management component, and at least one backend cluster;

Specifically, the uniform abstract data format is a JSON format; the preset interface is an REST API interface;

the types of the back-end clusters include at least one of high performance clusters, containerized clusters, and big data clusters. The high-performance cluster is a Slurm cluster; the containerized cluster is a Kubernetes cluster;

the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.

Understandably, the user submits a unified job request through a preset interface provided by the job management component. Wherein the job request includes basic information of the job. The job cluster performs back-end cluster type adaptation and format processing according to job information carried by the job request, sends the job request to a corresponding back-end cluster by inquiring the back-end cluster information of the cluster management component, and sends the obtained response to a user after uniform format processing;

in the specific implementation, each back-end cluster corresponds to one job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of multiple clusters, the job management component is responsible for allocating deep learning jobs to appropriate clusters according to user requirements, then the job scheduling component allocates jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and resource use condition of the jobs to provide subsequent query analysis for users. The invention can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.

As shown in fig. 1, an important feature of the embodiment of the present application is to support multiple types of backend job clusters, and in this embodiment, two types of job cluster support, kubernets and Slurm, are implemented.

Please refer to the following, which is a schematic diagram of a unified abstract data format of a deep learning operation according to an embodiment of the present application. The deep learning job data format in the embodiment of the present application includes, but is not limited to, the following:

name of field	Type of field	Field description
			displayName	String	Name of operation
imageSpec	Object	Work mirror
			programSpec	Object	Program configuration
resourceSpec	Object	Resource allocation
			logSpec	Object	Log configuration
renderSpec	Object	Rendering configuration
			runtimeInfo	Object	Runtime information
createTime	DateTime	Creation time

The specific uniform abstract format for deep learning in the embodiment of the present application is described in JSON as follows:

referring to FIG. 2, FIG. 2 is a schematic diagram of the job management components of the scheduling system of the present invention;

the job management component of the scheduling system of the present invention uses Java first middleware developed by Spring Boot technology, which provides an access interface to a terminal user in the form of REST API, wherein:

the Java first middleware is used for providing a REST API for submitting deep learning jobs in a uniform abstract data format;

the Java first middleware is also used for providing a REST API for acquiring the state of deep learning operation in a uniform abstract data format;

the Java first middleware is further used for providing a REST API for stopping deep learning operation in a uniform abstract data format;

the Java first middleware is also used for internally processing the conversion from an external uniform abstract job format to a concrete format of a cluster side driver;

the Java first middleware is also used for sending the unified job request to the back-end job cluster.

Further, with continued reference to FIG. 2, FIG. 2 illustrates how a job management component interacts with a backend multi-type cluster. The job management component in the embodiment of the application comprises a multi-class cluster driver realized by Java, and the multi-class cluster driver communicates with a back-end cluster by using an API (application programming interface) provided by the back-end cluster so as to submit a job request and acquire a job running state. The job management component in the embodiment of the present application includes drivers for kubernets and Slurm clusters, and interacts with a specific backend cluster using a REST API provided by kubernets and a command line tool provided by Slurm, respectively.

Referring to fig. 3, fig. 3 is a schematic flow chart of a scheduling method of a deep learning scheduling system supporting multiple classes of cluster backend according to the present invention, where the scheduling method includes:

step S10, the job management component receives a deep learning job request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;

step S20, the job management component acquires a target back-end cluster matched with the running condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;

step S30, the job management component converts the unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target back-end cluster;

step S40, the job management component calls a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to obtain a target job response result from the target back-end cluster;

step S50, the job management component is also used for converting the target job response result into a uniform abstract data format; and returning the uniform abstract data format to the terminal user.

Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention;

the cluster management component is Java second middleware developed by using Spring Boot;

the cluster management component is used for adding a back-end operation cluster;

Specifically, the cluster management component is configured to access one or more backend clusters simultaneously, where a type of the backend cluster is related to adaptation support provided by the component.

Specifically, the Java second middleware is used for providing unified job creation, stop, and deletion operation entries for the multiple types of clusters;

the Java second middleware is also used for realizing programming of a uniform and abstract operation data interface;

the Java second middleware is also used for realizing programming of the life cycle management of the uniform abstract operation;

the Java second middleware is also used for providing a uniform access interface for the end user;

the Java second middleware is also used to support scheduling of deep learning jobs in multiple operating modes, including but not limited to: a single process mode, a multi-process mode, a PS-Worker distributed mode, a Master-Worker distributed mode, and an MPI distributed mode;

the Java second middleware is further configured to provide adaptation support of the cluster side driver for each type of cluster environment, including but not limited to: submitting support of the job, stopping support of the job, and acquiring support of the job state;

the Java second middleware is also used for providing a method for uniformly inquiring the job state, the job log and the job resource use condition for the clusters.

It can be understood that, the cluster management component of this embodiment stores metadata information of multiple clusters in its own database, and provides a REST API for the job management component in this embodiment to call while completing the basic management capability; in other embodiments, this cluster management component may be deployed alone as a component or may be included as a module in the job management component.

In a specific implementation, please refer to the following content, which is an illustration of a job format after a job management component performs format conversion on a backend kubernets type job cluster according to an embodiment of the present application.

Please refer to the following content, which is a schematic description of the job format after the job management component performs format conversion on the back-end churm type job cluster according to the embodiment of the present application.

As can be seen from comparison of the job format data of the Kubernetes job clusters and the Slurm job clusters, in the embodiment of the present application, when the same job is submitted to different types of job clusters at the back end, the descriptions of the same job are different.

Has the advantages that: the invention provides a deep learning scheduling system and a scheduling method supporting the back ends of various clusters, which can simultaneously support containerization, high performance and large data cluster service scheduling by a set of software; the beneficial effects are as follows: (1) for the terminal user, deep learning jobs can be run on a plurality of different types of back-end clusters in a unified mode, coupling of jobs and resources is avoided, and high job scheduling efficiency is obtained; (2) for a cluster operator, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost for building a deep learning cluster.

Claims

1. The deep learning scheduling system supporting the back ends of various clusters is characterized by comprising a job management component, a cluster management component and at least one back end cluster;

2. The scheduling system of claim 1 wherein the types of the back-end clusters comprise at least one of high performance clusters, containerized clusters, and big data clusters.

3. The scheduling system of claim 2 wherein the high performance cluster is a churm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.

4. The scheduling system of claim 1 wherein the scheduling system further comprises a scheduling module for scheduling the scheduled packets to the plurality of packets in a time division multiplex manner

The job management component is used for providing a REST API for submitting deep learning jobs in a uniform abstract data format;

the job management component is used for providing a REST API for acquiring the state of the deep learning job in a uniform abstract data format;

the job management component is used for providing a REST API for stopping deep learning jobs in a uniform abstract data format;

the job management component is further used for sending the unified job request to the back-end job cluster.

5. The scheduling system of any one of claims 1-4 wherein,

6. The scheduling system of claim 5 wherein the cluster management component is configured to access one or more back-end clusters simultaneously, the back-end clusters being of a type related to adaptation support provided by the component.

7. The scheduling system of claim 5 wherein the cluster management component is configured to provide unified job creation, stop, and deletion operation entries for multiple types of clusters;

8. A scheduling method of a deep learning scheduling system based on supporting multi-class cluster back ends is characterized in that the system comprises a job management component, a cluster management component and at least one back end cluster;

correspondingly, the scheduling method comprises the following steps:

9. The scheduling method of claim 8, wherein the type of the back-end cluster comprises at least one of a high performance cluster, a containerized cluster, and a big data cluster;

the high-performance cluster is a Slurm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.

10. The scheduling method according to any one of claims 1-4,