CN115048216B

CN115048216B - Resource management scheduling method, device and equipment of artificial intelligent cluster

Info

Publication number: CN115048216B
Application number: CN202210609937.3A
Authority: CN
Inventors: 李铭琨
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2024-06-04
Anticipated expiration: 2042-05-31
Also published as: CN115048216A

Abstract

The invention relates to a resource management scheduling method, a device and equipment of an artificial intelligent cluster, wherein the resource management scheduling method comprises the following steps: after the GPU management module deploys the GPU driving installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver on a physical machine in a containerization mode; the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage; when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes. By the technical scheme, the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized can be solved.

Description

Resource management scheduling method, device and equipment of artificial intelligent cluster

Technical Field

The invention relates to the technical field of artificial intelligence clusters, in particular to a resource management scheduling method, device and equipment of an artificial intelligence cluster.

Background

Graphics processor (Graphics Processing Unit, GPU), also known as display core, vision processor, display chip, is a microprocessor that performs image and graphics related operations on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.).

In the artificial intelligence development process, the continuous updating iterative technology of the GPU accelerates the training speed and scale of the deep learning, and the traditional single-node training mode in the deep learning training is gradually replaced by a multi-machine multi-card training mode.

In artificial intelligence clusters, GPUs generally refer to GPU accelerator cards for deep learning. In large-scale artificial intelligence clusters, efficient configuration and utilization of GPU resources is often not achieved. How to ensure the high-efficiency utilization rate of GPU resources gradually becomes a key problem of deep learning training, so that the utilization rate of cluster resources is improved, and the deep learning training efficiency is improved.

Meanwhile, the network transmission speed has an increasing influence on the artificial intelligence training task. How to reasonably manage and schedule GPU resources and network resources to effectively configure and utilize various resources becomes a problem to be solved in the prior art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a resource management scheduling method, device and equipment of an artificial intelligent cluster, which are used for solving the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized.

In order to achieve the above purpose, the present invention provides a resource management scheduling method for an artificial intelligent cluster, where an information storage module, a resource scheduling module, and a plurality of GPU nodes are provided in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;

the resource management scheduling method comprises the following steps:

after the GPU management module deploys the GPU drive installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;

The node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;

when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.

Further, the resource management scheduling method further includes:

After the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the step of installing the network card driver of the GPU node on a physical machine in a containerization mode;

the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

When a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and the preset scheduling strategy.

Further, before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:

The resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same;

And the resource scheduling module selects network cards with the same communication connection mode from the target GPU node and is used for scheduling network resources.

Further, before the GPU management module obtains GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module, the resource management scheduling method further includes:

the resource scheduling module deploys GPU virtualization services of the GPU node to the GPU node;

And/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;

And when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as the target GPU node.

Further, the preset scheduling policy includes at least one of the following:

sequencing and scheduling all the deep learning tasks according to the task scheduling priority level;

scheduling all deep learning tasks according to a first-in first-out principle;

and scheduling all the deep learning tasks according to a high-priority queue and a high-priority task priority scheduling principle.

Further, after the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:

and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.

The invention also provides a resource management scheduling device of the artificial intelligent cluster, which is used for realizing the resource management scheduling method of the artificial intelligent cluster, and comprises the following steps:

The GPU management module is used for deploying the GPU driving installation service of the GPU node to the GPU node, acquiring the GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;

The node management module is used for sending the GPU resource configuration information of the GPU node to the information storage module for storage;

And the resource scheduling module is used for sending the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received and the preset scheduling strategy.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

The present invention further provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of:

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

The artificial intelligent cluster is provided with a plurality of GPU nodes, an information storage module and a resource scheduling module; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing the corresponding GPU node, and the GPU management module is used for managing GPU resources in the GPU node; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;

Firstly, in a single GPU node, a GPU management module deploys a GPU driver installation service of the GPU node to the GPU node, so that the GPU driver of the GPU node is installed on a physical machine in a containerization mode to realize GPU driver installation;

After the mounting is completed, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the node management module sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;

Each GPU node can send the own GPU resource configuration information to the information storage module for unified storage, and the final information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligent cluster;

When the artificial intelligent cluster receives the deep learning task, the resource scheduling module firstly acquires resource information requested by the deep learning task, then combines GPU resource configuration information of all GPU nodes in the information storage module, schedules and manages the deep learning task according to a preset scheduling strategy, sends the deep learning task to a target GPU node, and processes the deep learning task through the target GPU node;

Therefore, GPU resources of the GPU nodes can be shared and the utilization efficiency of the GPU resources is improved by installing the GPU drivers of the GPU nodes on a physical machine through the container to realize mounting;

simultaneously, the information storage module is used for uniformly storing all GPU resource configuration information, and the resource scheduling module is used for uniformly scheduling GPU resources of all GPU nodes in the cluster, so that the GPU resource configuration efficiency is improved, and the utilization rate of cluster resources is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a resource management scheduling method of an artificial intelligent cluster in a first embodiment of the invention;

FIG. 2 is a block diagram of a resource management scheduling device of an artificial intelligent cluster in a practical embodiment of the invention;

FIG. 3 is a flowchart of a method for resource management scheduling of an artificial intelligence cluster in a practical embodiment of the invention;

fig. 4 is an internal structure diagram of a computer device in the second embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

as shown in fig. 1, an embodiment of the present invention provides a resource management scheduling method for an artificial intelligent cluster, where an information storage module, a resource scheduling module, and a plurality of GPU nodes are disposed in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;

The resource management scheduling method comprises the following steps:

S1, after a GPU management module deploys a GPU driving installation service of a GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;

S2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;

And S3, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.

In a specific embodiment, a plurality of GPU nodes, an information storage module and a resource scheduling module are arranged in an artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing the corresponding GPU node, and the GPU management module is used for managing GPU resources in the GPU node; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;

As shown in fig. 2, in a practical embodiment, a deployment module is further provided in the artificial intelligent cluster, where the deployment module is configured to deploy each module of the GPU management module, the network management module, the node management module, the information storage module, and the scheduling module over the whole cluster.

The deployment module may also deploy kubernetes into the entire cluster. Kubernetes, K8S cluster, is a Google-based container orchestration scheduling engine of Borg open source; the K8S cluster is generally distributed and comprises a master node and a node; the master node is mainly responsible for cluster control, scheduling tasks and resources and the like; node nodes are workload nodes.

In addition, the storage module can adopt single-point service or high-availability service to ensure the stability of functions.

In a preferred embodiment, in S4, the preset scheduling policy includes at least one of:

scheduling all deep learning tasks according to a first-in first-out principle;

In a specific embodiment, the scheduling scheme may select a scheduling scheme according to actual requirements, for example, a priority queue, a first-in-first-out queue, a maximum resource utilization rate, and the like.

The three scheduling cases are specifically as follows:

The scheduling module can put the scheduling tasks into a scheduling queue, schedule and prioritize each task scheduling according to the scheduling priority, and select the scheduling with the highest priority;

If the scheduling queue is a first-in first-out queue, all tasks are scheduled according to a first-in first-scheduling principle;

if the task priority processing of the high-priority queue is adopted, the queue with the highest priority is selected according to the priority of the queue, and then the task with the highest priority in the queue is selected for scheduling.

Meanwhile, in order to meet the affinity requirement of the GPU or the network card, the scheduling scheme can adopt the scheduling scheme that the GPU and the network card with affinity are preferentially used and the like, so that the calling speed of resources such as the GPU or the network card and the like is improved, and the calling efficiency is improved.

In a preferred embodiment, the resource management scheduling method further includes:

S5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;

S6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

And S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and network resource configuration information of all GPU nodes in the information storage module.

In a specific embodiment, similarly, the network card driver of the GPU node is installed on the physical machine through the container to realize the mounting, so that the network card/network resource of the GPU node can be shared, and the network card/network resource utilization efficiency is improved.

Meanwhile, the information storage module is used for uniformly storing all network resource configuration information, and the resource scheduling module is used for uniformly scheduling network resources of all GPU nodes in the cluster, so that the network resource configuration efficiency is improved, and the utilization rate of cluster resources is improved.

In a preferred embodiment, in S4, before the deep learning task is sent to the target GPU node according to the preset scheduling policy, the resource management scheduling method further includes:

S311, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as a target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same;

S312, the resource scheduling module selects network cards with the same communication connection mode in the target GPU node for scheduling network resources.

In particular embodiments, after determining a particular scheduling task, the scheduling module may select candidate nodes based on the remaining amount of node resources, and traverse the nodes.

In order to improve the calling speed and the calling efficiency of the resources, the scheduling module can preferentially select the node with the affinity GPU and the network card resources as a target GPU node. The GPU resource affinity refers to the fact that the communication connection modes of the GPUs are the same, the GPUs with affinity are preferentially used, and communication among the GPUs can be faster; the network resource affinity refers to that the network card communication connection mode is the same as the GPU communication connection mode, so that the communication efficiency between the GPUs can be further improved.

Therefore, after the candidate GPU node with the affinity of the GPU and the network card is used as the target GPU node, the target GPU node is used for processing the deep learning task, so that the task processing efficiency can be effectively improved, and the utilization efficiency of each resource in the cluster can be improved.

In a preferred embodiment, before S1, the resource management scheduling method further includes:

the resource scheduling module deploys GPU virtualization services of the GPU nodes to the GPU nodes;

GPU virtualization can divide GPU resources into a plurality of virtual resources so as to allocate configuration and improve the utilization rate of the GPU resources;

Similarly, the network card virtualization can divide the network card/network resource into a plurality of virtual resources so as to allocate and configure and improve the utilization rate of the network resource.

Through GPU virtualization and network card virtualization, effective configuration and utilization of resources in each node can be improved, and therefore utilization rate of cluster resources is improved.

S321, a resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;

S322, when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.

When the GPU or the network card in the candidate node is subjected to virtualization processing in advance, virtual GPU and virtual network card resources belonging to the same communication connection mode can be selected in the candidate node, and a virtual resource group matched with the required quantity is selected from the virtual GPU and the virtual network card resources and used for processing the deep learning task.

In a preferred embodiment, after S4, the resource management scheduling method further includes:

As shown in fig. 3, in a practical embodiment, the implementation process of the resource management scheduling method of the artificial intelligence cluster is as follows:

The deployment module deploys kubernetes into the entire cluster and deploys other related modules into the cluster.

The GPU management module deploys relevant services to the GPU nodes according to whether the nodes are the GPU nodes or not; related services include, but are not limited to: GPU driver, containortool, monitor, etc., GPU virtualization, etc. At the same time, the information is reported to the node management module.

Meanwhile, the network management module configures the network on the node according to the network card type and the configuration file of the node, and deploys relevant services to relevant nodes, such as network card virtualization. At the same time, the information is reported to the node management module.

The node management module may store all of the information described above to the information storage module.

After the information storage module stores all relevant GPU and network information, the scheduling module may schedule tasks according to the resource information stored by the information storage module and the resources requested by the deep learning task. The scheduling scheme may be performed according to a common scheduling scheme, for example, a priority queue, a first-in-first-out queue, a maximum resource utilization, and the like. Meanwhile, in order to meet the requirements of affinity of the GPU and the network card, the scheduling scheme can adopt a scheduling scheme in which the GPU and the network card with affinity are preferentially used.

And the scheduling module issues the scheduling task to a node management module of the node to be scheduled, and the node management module starts the relevant training task according to the resource consumption and updates the rest resource information to an information storage module for subsequent use.

It should be noted that, although the steps in the flowchart are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

Embodiment two:

The embodiment of the invention also provides a resource management scheduling device of the artificial intelligent cluster, which is used for realizing the resource management scheduling method of the artificial intelligent cluster, and comprises the following steps:

The GPU management module is used for deploying the GPU driving installation service of the GPU node to the GPU node, acquiring the GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;

and the resource scheduling module is used for sending the deep learning task to the target GPU node according to a preset scheduling strategy according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received.

In a preferred embodiment, the GPU management module is further configured to: deploying the network card drive installation service of the GPU node to the GPU node, acquiring network resource configuration information of the GPU node, and sending the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;

The node management module is further configured to: the network resource configuration information of the GPU node is sent to an information storage module for storage;

The resource scheduling module is further configured to: and when the deep learning task is received, the deep learning task is sent to the target GPU node according to a preset scheduling strategy according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module.

In a preferred embodiment, the resource scheduling module is further configured to:

screening out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selecting the candidate GPU nodes with GPU resource affinity from the candidate GPU nodes as target GPU nodes; wherein, the communication connection modes of all GPUs in the target GPU node are the same;

And selecting a network card with the same communication connection mode from the target GPU node for scheduling network resources.

Deploying GPU virtualization services of the GPU nodes to the GPU nodes;

and/or deploying the network card virtualization service of the GPU node to the GPU node.

Selecting a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;

And when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, taking the candidate GPU node as a target GPU node.

In a preferred embodiment, the node management module is further configured to: and sending the residual GPU resource information of the target GPU node to an information storage module for updating.

For specific limitations of the above apparatus, reference may be made to the limitations of the method described above, which are not repeated here.

Each of the modules in the above apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware, or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The computer device may be a terminal, as shown in fig. 4, which includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It is to be understood that the structures shown in the above figures are merely block diagrams of some of the structures associated with the present invention and are not limiting of the computer devices to which the present invention may be applied, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

Embodiment III:

The embodiment of the invention also provides a computer device, which comprises a memory, a processor and a computer program, wherein the computer program is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the computer program:

And S4, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.

In a preferred embodiment, the processor when executing the computer program further performs the steps of:

Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same; the resource scheduling module selects network cards with the same communication connection mode in the target GPU node and is used for scheduling network resources.

Before a GPU management module obtains GPU resource configuration information of a GPU node and sends the GPU resource configuration information to a node management module, a resource scheduling module deploys GPU virtualization services of the GPU node on the GPU node; and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.

after the deep learning task is sent to the target GPU node according to a preset scheduling strategy, a node management module in the target GPU node sends the residual GPU resource information of the target GPU node to an information storage module for updating.

Embodiment four:

the embodiment of the invention further provides a computer readable storage medium storing a computer program which when executed by a processor realizes the following steps:

In a preferred embodiment, the computer program when executed by the processor further implements the steps of:

It will be appreciated that implementation of all or part of the flow of the methods of the above embodiments may be accomplished by a computer program that instructs related hardware, and that the computer program may be stored on a non-volatile computer readable storage medium, which when executed may include the flow of the embodiments of the methods as described above.

Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. The resource management scheduling method of the artificial intelligent cluster is characterized in that an information storage module, a resource scheduling module and a plurality of GPU nodes are arranged in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;

the resource management scheduling method comprises the following steps:

2. The method for resource management scheduling of an artificial intelligence cluster according to claim 1, further comprising:

when a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and the preset scheduling strategy.

3. The method according to claim 2, wherein before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the method further comprises:

4. The method according to claim 3, wherein before the GPU management module obtains GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module, the method further comprises:

5. The method according to claim 4, wherein before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the method further comprises:

6. The method for resource management scheduling of an artificial intelligence cluster according to claim 1, wherein the preset scheduling policy comprises at least one of:

scheduling all deep learning tasks according to a first-in first-out principle;

7. The method according to claim 1, further comprising, after the deep learning task is sent to the target GPU node according to a preset scheduling policy:

8. A resource management scheduling device of an artificial intelligence cluster, for implementing a resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7, the resource management scheduling device comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7.