CN115048216B - Resource management scheduling method, device and equipment of artificial intelligent cluster - Google Patents

Resource management scheduling method, device and equipment of artificial intelligent cluster Download PDF

Info

Publication number
CN115048216B
CN115048216B CN202210609937.3A CN202210609937A CN115048216B CN 115048216 B CN115048216 B CN 115048216B CN 202210609937 A CN202210609937 A CN 202210609937A CN 115048216 B CN115048216 B CN 115048216B
Authority
CN
China
Prior art keywords
gpu
node
resource
scheduling
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210609937.3A
Other languages
Chinese (zh)
Other versions
CN115048216A (en
Inventor
李铭琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210609937.3A priority Critical patent/CN115048216B/en
Publication of CN115048216A publication Critical patent/CN115048216A/en
Application granted granted Critical
Publication of CN115048216B publication Critical patent/CN115048216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a resource management scheduling method, a device and equipment of an artificial intelligent cluster, wherein the resource management scheduling method comprises the following steps: after the GPU management module deploys the GPU driving installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver on a physical machine in a containerization mode; the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage; when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes. By the technical scheme, the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized can be solved.

Description

Resource management scheduling method, device and equipment of artificial intelligent cluster
Technical Field
The invention relates to the technical field of artificial intelligence clusters, in particular to a resource management scheduling method, device and equipment of an artificial intelligence cluster.
Background
Graphics processor (Graphics Processing Unit, GPU), also known as display core, vision processor, display chip, is a microprocessor that performs image and graphics related operations on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.).
In the artificial intelligence development process, the continuous updating iterative technology of the GPU accelerates the training speed and scale of the deep learning, and the traditional single-node training mode in the deep learning training is gradually replaced by a multi-machine multi-card training mode.
In artificial intelligence clusters, GPUs generally refer to GPU accelerator cards for deep learning. In large-scale artificial intelligence clusters, efficient configuration and utilization of GPU resources is often not achieved. How to ensure the high-efficiency utilization rate of GPU resources gradually becomes a key problem of deep learning training, so that the utilization rate of cluster resources is improved, and the deep learning training efficiency is improved.
Meanwhile, the network transmission speed has an increasing influence on the artificial intelligence training task. How to reasonably manage and schedule GPU resources and network resources to effectively configure and utilize various resources becomes a problem to be solved in the prior art.
Disclosure of Invention
In order to solve the technical problems, the invention provides a resource management scheduling method, device and equipment of an artificial intelligent cluster, which are used for solving the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized.
In order to achieve the above purpose, the present invention provides a resource management scheduling method for an artificial intelligent cluster, where an information storage module, a resource scheduling module, and a plurality of GPU nodes are provided in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;
the resource management scheduling method comprises the following steps:
after the GPU management module deploys the GPU drive installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
Further, the resource management scheduling method further includes:
After the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the step of installing the network card driver of the GPU node on a physical machine in a containerization mode;
the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
When a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and the preset scheduling strategy.
Further, before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
The resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same;
And the resource scheduling module selects network cards with the same communication connection mode from the target GPU node and is used for scheduling network resources.
Further, before the GPU management module obtains GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module, the resource management scheduling method further includes:
the resource scheduling module deploys GPU virtualization services of the GPU node to the GPU node;
And/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
Further, before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
And when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as the target GPU node.
Further, the preset scheduling policy includes at least one of the following:
sequencing and scheduling all the deep learning tasks according to the task scheduling priority level;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all the deep learning tasks according to a high-priority queue and a high-priority task priority scheduling principle.
Further, after the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
The invention also provides a resource management scheduling device of the artificial intelligent cluster, which is used for realizing the resource management scheduling method of the artificial intelligent cluster, and comprises the following steps:
The GPU management module is used for deploying the GPU driving installation service of the GPU node to the GPU node, acquiring the GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module is used for sending the GPU resource configuration information of the GPU node to the information storage module for storage;
And the resource scheduling module is used for sending the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received and the preset scheduling strategy.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
after the GPU management module deploys the GPU drive installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
The present invention further provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of:
after the GPU management module deploys the GPU drive installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
The artificial intelligent cluster is provided with a plurality of GPU nodes, an information storage module and a resource scheduling module; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing the corresponding GPU node, and the GPU management module is used for managing GPU resources in the GPU node; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;
Firstly, in a single GPU node, a GPU management module deploys a GPU driver installation service of the GPU node to the GPU node, so that the GPU driver of the GPU node is installed on a physical machine in a containerization mode to realize GPU driver installation;
After the mounting is completed, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the node management module sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;
Each GPU node can send the own GPU resource configuration information to the information storage module for unified storage, and the final information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligent cluster;
When the artificial intelligent cluster receives the deep learning task, the resource scheduling module firstly acquires resource information requested by the deep learning task, then combines GPU resource configuration information of all GPU nodes in the information storage module, schedules and manages the deep learning task according to a preset scheduling strategy, sends the deep learning task to a target GPU node, and processes the deep learning task through the target GPU node;
Therefore, GPU resources of the GPU nodes can be shared and the utilization efficiency of the GPU resources is improved by installing the GPU drivers of the GPU nodes on a physical machine through the container to realize mounting;
simultaneously, the information storage module is used for uniformly storing all GPU resource configuration information, and the resource scheduling module is used for uniformly scheduling GPU resources of all GPU nodes in the cluster, so that the GPU resource configuration efficiency is improved, and the utilization rate of cluster resources is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a resource management scheduling method of an artificial intelligent cluster in a first embodiment of the invention;
FIG. 2 is a block diagram of a resource management scheduling device of an artificial intelligent cluster in a practical embodiment of the invention;
FIG. 3 is a flowchart of a method for resource management scheduling of an artificial intelligence cluster in a practical embodiment of the invention;
fig. 4 is an internal structure diagram of a computer device in the second embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one:
as shown in fig. 1, an embodiment of the present invention provides a resource management scheduling method for an artificial intelligent cluster, where an information storage module, a resource scheduling module, and a plurality of GPU nodes are disposed in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;
The resource management scheduling method comprises the following steps:
S1, after a GPU management module deploys a GPU driving installation service of a GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;
S2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
And S3, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
In a specific embodiment, a plurality of GPU nodes, an information storage module and a resource scheduling module are arranged in an artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing the corresponding GPU node, and the GPU management module is used for managing GPU resources in the GPU node; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;
Firstly, in a single GPU node, a GPU management module deploys a GPU driver installation service of the GPU node to the GPU node, so that the GPU driver of the GPU node is installed on a physical machine in a containerization mode to realize GPU driver installation;
After the mounting is completed, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the node management module sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;
Each GPU node can send the own GPU resource configuration information to the information storage module for unified storage, and the final information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligent cluster;
When the artificial intelligent cluster receives the deep learning task, the resource scheduling module firstly acquires resource information requested by the deep learning task, then combines GPU resource configuration information of all GPU nodes in the information storage module, schedules and manages the deep learning task according to a preset scheduling strategy, sends the deep learning task to a target GPU node, and processes the deep learning task through the target GPU node;
Therefore, GPU resources of the GPU nodes can be shared and the utilization efficiency of the GPU resources is improved by installing the GPU drivers of the GPU nodes on a physical machine through the container to realize mounting;
simultaneously, the information storage module is used for uniformly storing all GPU resource configuration information, and the resource scheduling module is used for uniformly scheduling GPU resources of all GPU nodes in the cluster, so that the GPU resource configuration efficiency is improved, and the utilization rate of cluster resources is improved.
As shown in fig. 2, in a practical embodiment, a deployment module is further provided in the artificial intelligent cluster, where the deployment module is configured to deploy each module of the GPU management module, the network management module, the node management module, the information storage module, and the scheduling module over the whole cluster.
The deployment module may also deploy kubernetes into the entire cluster. Kubernetes, K8S cluster, is a Google-based container orchestration scheduling engine of Borg open source; the K8S cluster is generally distributed and comprises a master node and a node; the master node is mainly responsible for cluster control, scheduling tasks and resources and the like; node nodes are workload nodes.
In addition, the storage module can adopt single-point service or high-availability service to ensure the stability of functions.
In a preferred embodiment, in S4, the preset scheduling policy includes at least one of:
sequencing and scheduling all the deep learning tasks according to the task scheduling priority level;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all the deep learning tasks according to a high-priority queue and a high-priority task priority scheduling principle.
In a specific embodiment, the scheduling scheme may select a scheduling scheme according to actual requirements, for example, a priority queue, a first-in-first-out queue, a maximum resource utilization rate, and the like.
The three scheduling cases are specifically as follows:
The scheduling module can put the scheduling tasks into a scheduling queue, schedule and prioritize each task scheduling according to the scheduling priority, and select the scheduling with the highest priority;
If the scheduling queue is a first-in first-out queue, all tasks are scheduled according to a first-in first-scheduling principle;
if the task priority processing of the high-priority queue is adopted, the queue with the highest priority is selected according to the priority of the queue, and then the task with the highest priority in the queue is selected for scheduling.
Meanwhile, in order to meet the affinity requirement of the GPU or the network card, the scheduling scheme can adopt the scheduling scheme that the GPU and the network card with affinity are preferentially used and the like, so that the calling speed of resources such as the GPU or the network card and the like is improved, and the calling efficiency is improved.
In a preferred embodiment, the resource management scheduling method further includes:
S5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;
S6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
And S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and network resource configuration information of all GPU nodes in the information storage module.
In a specific embodiment, similarly, the network card driver of the GPU node is installed on the physical machine through the container to realize the mounting, so that the network card/network resource of the GPU node can be shared, and the network card/network resource utilization efficiency is improved.
Meanwhile, the information storage module is used for uniformly storing all network resource configuration information, and the resource scheduling module is used for uniformly scheduling network resources of all GPU nodes in the cluster, so that the network resource configuration efficiency is improved, and the utilization rate of cluster resources is improved.
In a preferred embodiment, in S4, before the deep learning task is sent to the target GPU node according to the preset scheduling policy, the resource management scheduling method further includes:
S311, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as a target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same;
S312, the resource scheduling module selects network cards with the same communication connection mode in the target GPU node for scheduling network resources.
In particular embodiments, after determining a particular scheduling task, the scheduling module may select candidate nodes based on the remaining amount of node resources, and traverse the nodes.
In order to improve the calling speed and the calling efficiency of the resources, the scheduling module can preferentially select the node with the affinity GPU and the network card resources as a target GPU node. The GPU resource affinity refers to the fact that the communication connection modes of the GPUs are the same, the GPUs with affinity are preferentially used, and communication among the GPUs can be faster; the network resource affinity refers to that the network card communication connection mode is the same as the GPU communication connection mode, so that the communication efficiency between the GPUs can be further improved.
Therefore, after the candidate GPU node with the affinity of the GPU and the network card is used as the target GPU node, the target GPU node is used for processing the deep learning task, so that the task processing efficiency can be effectively improved, and the utilization efficiency of each resource in the cluster can be improved.
In a preferred embodiment, before S1, the resource management scheduling method further includes:
the resource scheduling module deploys GPU virtualization services of the GPU nodes to the GPU nodes;
And/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
GPU virtualization can divide GPU resources into a plurality of virtual resources so as to allocate configuration and improve the utilization rate of the GPU resources;
Similarly, the network card virtualization can divide the network card/network resource into a plurality of virtual resources so as to allocate and configure and improve the utilization rate of the network resource.
Through GPU virtualization and network card virtualization, effective configuration and utilization of resources in each node can be improved, and therefore utilization rate of cluster resources is improved.
In a preferred embodiment, in S4, before the deep learning task is sent to the target GPU node according to the preset scheduling policy, the resource management scheduling method further includes:
S321, a resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
S322, when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
When the GPU or the network card in the candidate node is subjected to virtualization processing in advance, virtual GPU and virtual network card resources belonging to the same communication connection mode can be selected in the candidate node, and a virtual resource group matched with the required quantity is selected from the virtual GPU and the virtual network card resources and used for processing the deep learning task.
In a preferred embodiment, after S4, the resource management scheduling method further includes:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
As shown in fig. 3, in a practical embodiment, the implementation process of the resource management scheduling method of the artificial intelligence cluster is as follows:
The deployment module deploys kubernetes into the entire cluster and deploys other related modules into the cluster.
The GPU management module deploys relevant services to the GPU nodes according to whether the nodes are the GPU nodes or not; related services include, but are not limited to: GPU driver, containortool, monitor, etc., GPU virtualization, etc. At the same time, the information is reported to the node management module.
Meanwhile, the network management module configures the network on the node according to the network card type and the configuration file of the node, and deploys relevant services to relevant nodes, such as network card virtualization. At the same time, the information is reported to the node management module.
The node management module may store all of the information described above to the information storage module.
After the information storage module stores all relevant GPU and network information, the scheduling module may schedule tasks according to the resource information stored by the information storage module and the resources requested by the deep learning task. The scheduling scheme may be performed according to a common scheduling scheme, for example, a priority queue, a first-in-first-out queue, a maximum resource utilization, and the like. Meanwhile, in order to meet the requirements of affinity of the GPU and the network card, the scheduling scheme can adopt a scheduling scheme in which the GPU and the network card with affinity are preferentially used.
And the scheduling module issues the scheduling task to a node management module of the node to be scheduled, and the node management module starts the relevant training task according to the resource consumption and updates the rest resource information to an information storage module for subsequent use.
It should be noted that, although the steps in the flowchart are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Embodiment two:
The embodiment of the invention also provides a resource management scheduling device of the artificial intelligent cluster, which is used for realizing the resource management scheduling method of the artificial intelligent cluster, and comprises the following steps:
The GPU management module is used for deploying the GPU driving installation service of the GPU node to the GPU node, acquiring the GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;
the node management module is used for sending the GPU resource configuration information of the GPU node to the information storage module for storage;
and the resource scheduling module is used for sending the deep learning task to the target GPU node according to a preset scheduling strategy according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received.
In a preferred embodiment, the GPU management module is further configured to: deploying the network card drive installation service of the GPU node to the GPU node, acquiring network resource configuration information of the GPU node, and sending the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;
The node management module is further configured to: the network resource configuration information of the GPU node is sent to an information storage module for storage;
The resource scheduling module is further configured to: and when the deep learning task is received, the deep learning task is sent to the target GPU node according to a preset scheduling strategy according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module.
In a preferred embodiment, the resource scheduling module is further configured to:
screening out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selecting the candidate GPU nodes with GPU resource affinity from the candidate GPU nodes as target GPU nodes; wherein, the communication connection modes of all GPUs in the target GPU node are the same;
And selecting a network card with the same communication connection mode from the target GPU node for scheduling network resources.
In a preferred embodiment, the resource scheduling module is further configured to:
Deploying GPU virtualization services of the GPU nodes to the GPU nodes;
and/or deploying the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the resource scheduling module is further configured to:
Selecting a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
And when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, taking the candidate GPU node as a target GPU node.
In a preferred embodiment, the node management module is further configured to: and sending the residual GPU resource information of the target GPU node to an information storage module for updating.
For specific limitations of the above apparatus, reference may be made to the limitations of the method described above, which are not repeated here.
Each of the modules in the above apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware, or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The computer device may be a terminal, as shown in fig. 4, which includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It is to be understood that the structures shown in the above figures are merely block diagrams of some of the structures associated with the present invention and are not limiting of the computer devices to which the present invention may be applied, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
Embodiment III:
The embodiment of the invention also provides a computer device, which comprises a memory, a processor and a computer program, wherein the computer program is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the computer program:
S1, after a GPU management module deploys a GPU driving installation service of a GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;
S2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
And S4, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
S5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;
S6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
And S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and network resource configuration information of all GPU nodes in the information storage module.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same; the resource scheduling module selects network cards with the same communication connection mode in the target GPU node and is used for scheduling network resources.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
Before a GPU management module obtains GPU resource configuration information of a GPU node and sends the GPU resource configuration information to a node management module, a resource scheduling module deploys GPU virtualization services of the GPU node on the GPU node; and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
after the deep learning task is sent to the target GPU node according to a preset scheduling strategy, a node management module in the target GPU node sends the residual GPU resource information of the target GPU node to an information storage module for updating.
Embodiment four:
the embodiment of the invention further provides a computer readable storage medium storing a computer program which when executed by a processor realizes the following steps:
S1, after a GPU management module deploys a GPU driving installation service of a GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver of a GPU node on a physical machine in a containerization mode;
S2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
And S4, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
In a preferred embodiment, the computer program when executed by the processor further implements the steps of:
S5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps of installing a network card driver of the GPU node on a physical machine in a containerization mode;
S6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
And S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and network resource configuration information of all GPU nodes in the information storage module.
In a preferred embodiment, the computer program when executed by the processor further implements the steps of:
Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same; the resource scheduling module selects network cards with the same communication connection mode in the target GPU node and is used for scheduling network resources.
In a preferred embodiment, the computer program when executed by the processor further implements the steps of:
Before a GPU management module obtains GPU resource configuration information of a GPU node and sends the GPU resource configuration information to a node management module, a resource scheduling module deploys GPU virtualization services of the GPU node on the GPU node; and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the computer program when executed by the processor further implements the steps of:
Before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
In a preferred embodiment, the computer program when executed by the processor further implements the steps of:
after the deep learning task is sent to the target GPU node according to a preset scheduling strategy, a node management module in the target GPU node sends the residual GPU resource information of the target GPU node to an information storage module for updating.
It will be appreciated that implementation of all or part of the flow of the methods of the above embodiments may be accomplished by a computer program that instructs related hardware, and that the computer program may be stored on a non-volatile computer readable storage medium, which when executed may include the flow of the embodiments of the methods as described above.
Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. The resource management scheduling method of the artificial intelligent cluster is characterized in that an information storage module, a resource scheduling module and a plurality of GPU nodes are arranged in the artificial intelligent cluster; the GPU node is provided with a node management module and a GPU management module;
the resource management scheduling method comprises the following steps:
after the GPU management module deploys the GPU drive installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes in the information storage module.
2. The method for resource management scheduling of an artificial intelligence cluster according to claim 1, further comprising:
After the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the step of installing the network card driver of the GPU node on a physical machine in a containerization mode;
the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
when a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and the preset scheduling strategy.
3. The method according to claim 2, wherein before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the method further comprises:
The resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity from the candidate GPU nodes as the target GPU node; wherein, the communication connection modes of all GPUs in the target GPU node are the same;
And the resource scheduling module selects network cards with the same communication connection mode from the target GPU node and is used for scheduling network resources.
4. The method according to claim 3, wherein before the GPU management module obtains GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module, the method further comprises:
the resource scheduling module deploys GPU virtualization services of the GPU node to the GPU node;
And/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
5. The method according to claim 4, wherein before the deep learning task is sent to the target GPU node according to a preset scheduling policy, the method further comprises:
The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein, the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
And when the number of the virtual resource candidate groups reaches the number of resource demands of the deep learning task, the resource scheduling module takes the candidate GPU node as the target GPU node.
6. The method for resource management scheduling of an artificial intelligence cluster according to claim 1, wherein the preset scheduling policy comprises at least one of:
sequencing and scheduling all the deep learning tasks according to the task scheduling priority level;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all the deep learning tasks according to a high-priority queue and a high-priority task priority scheduling principle.
7. The method according to claim 1, further comprising, after the deep learning task is sent to the target GPU node according to a preset scheduling policy:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
8. A resource management scheduling device of an artificial intelligence cluster, for implementing a resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7, the resource management scheduling device comprising:
The GPU management module is used for deploying the GPU driving installation service of the GPU node to the GPU node, acquiring the GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing the GPU driver of the GPU node on a physical machine in a containerization mode;
The node management module is used for sending the GPU resource configuration information of the GPU node to the information storage module for storage;
And the resource scheduling module is used for sending the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received and the preset scheduling strategy.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the resource management scheduling method of an artificial intelligence cluster according to any one of claims 1-7.
CN202210609937.3A 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster Active CN115048216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Publications (2)

Publication Number Publication Date
CN115048216A CN115048216A (en) 2022-09-13
CN115048216B true CN115048216B (en) 2024-06-04

Family

ID=83158949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210609937.3A Active CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Country Status (1)

Country Link
CN (1) CN115048216B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20240082474A (en) * 2022-12-01 2024-06-11 삼성전자주식회사 Electronic device to provide artificial intelligence service and method for controlling thereof
CN115617364B (en) * 2022-12-20 2023-03-14 中化现代农业有限公司 GPU virtualization deployment method, system, computer equipment and storage medium
CN115965517B (en) * 2023-01-09 2023-10-20 摩尔线程智能科技(北京)有限责任公司 Graphics processor resource management method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
WO2022033024A1 (en) * 2020-08-12 2022-02-17 ***股份有限公司 Distributed training method and apparatus of deep learning model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220083389A1 (en) * 2020-09-16 2022-03-17 Nutanix, Inc. Ai inference hardware resource scheduling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
WO2022033024A1 (en) * 2020-08-12 2022-02-17 ***股份有限公司 Distributed training method and apparatus of deep learning model
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115048216A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN115048216B (en) Resource management scheduling method, device and equipment of artificial intelligent cluster
CN110888743B (en) GPU resource using method, device and storage medium
CN110489213B (en) Task processing method and processing device and computer system
CN108337109B (en) Resource allocation method and device and resource allocation system
CN110704186A (en) Computing resource allocation method and device based on hybrid distribution architecture and storage medium
US20130219385A1 (en) Batch scheduler management of virtual machines
CN111597042A (en) Service thread running method and device, storage medium and electronic equipment
CN112905326B (en) Task processing method and device
CN111274033B (en) Resource deployment method, device, server and storage medium
CN111190712A (en) Task scheduling method, device, equipment and medium
CN109117244B (en) Method for implementing virtual machine resource application queuing mechanism
CN111338779A (en) Resource allocation method, device, computer equipment and storage medium
CN115686805A (en) GPU resource sharing method and device, and GPU resource sharing scheduling method and device
US20220229695A1 (en) System and method for scheduling in a computing system
CN117311990B (en) Resource adjustment method and device, electronic equipment, storage medium and training platform
CN114598665A (en) Resource scheduling method and device, computer readable storage medium and electronic equipment
CN112148481B (en) Method, system, equipment and medium for executing simulation test task
CN113986539A (en) Method, device, electronic equipment and readable storage medium for realizing pod fixed IP
CN112114958A (en) Resource isolation method, distributed platform, computer device, and storage medium
CN116578416A (en) Signal-level simulation acceleration method based on GPU virtualization
CN115564635A (en) GPU resource scheduling method and device, electronic equipment and storage medium
CN115756756A (en) Video memory resource allocation method, device and equipment based on GPU virtualization technology
CN112114959B (en) Resource scheduling method, distributed system, computer device and storage medium
CN113742059A (en) Task allocation method and device, computer equipment and storage medium
CN106055410A (en) Cloud computing memory resource allocation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant