CN115048216A - Resource management scheduling method, device and equipment for artificial intelligence cluster - Google Patents

Resource management scheduling method, device and equipment for artificial intelligence cluster Download PDF

Info

Publication number
CN115048216A
CN115048216A CN202210609937.3A CN202210609937A CN115048216A CN 115048216 A CN115048216 A CN 115048216A CN 202210609937 A CN202210609937 A CN 202210609937A CN 115048216 A CN115048216 A CN 115048216A
Authority
CN
China
Prior art keywords
gpu
node
resource
scheduling
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210609937.3A
Other languages
Chinese (zh)
Other versions
CN115048216B (en
Inventor
李铭琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210609937.3A priority Critical patent/CN115048216B/en
Publication of CN115048216A publication Critical patent/CN115048216A/en
Application granted granted Critical
Publication of CN115048216B publication Critical patent/CN115048216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a resource management scheduling method, a device and equipment of an artificial intelligence cluster, wherein the resource management scheduling method comprises the following steps: after the GPU management module deploys the GPU driver installation service of the GPU node on the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that a GPU driver is installed on a physical machine in a containerization mode; the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage; and when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes and a preset scheduling strategy. By the technical scheme, the problem that GPU resources and network resources in the existing artificial intelligence cluster cannot be effectively configured and utilized can be solved.

Description

Resource management scheduling method, device and equipment for artificial intelligence cluster
Technical Field
The present invention relates to the technical field of artificial intelligence clusters, and in particular, to a method, an apparatus, and a device for resource management scheduling of an artificial intelligence cluster.
Background
A Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a microprocessor that is dedicated to image and Graphics related operations on personal computers, workstations, game machines, and mobile devices (e.g., tablet computers, smart phones, etc.).
In the artificial intelligence development process, the continuous updating iterative technology of the GPU accelerates the training speed and scale of deep learning, and the traditional single-node training mode in the deep learning training is gradually replaced by a multi-machine multi-card training mode.
In an artificial intelligence cluster, a GPU generally refers to a GPU accelerator card for deep learning. In a large-scale artificial intelligence cluster, effective configuration and utilization of GPU resources often cannot be achieved. How to ensure the efficient utilization rate of GPU resources gradually becomes a key problem of deep learning training, so that the utilization rate of cluster resources is improved, and the deep learning training efficiency is improved.
Meanwhile, the network transmission speed has an increasingly large influence on the artificial intelligence training task. How to reasonably manage and schedule the GPU resources and the network resources and realize the effective configuration and utilization of various resources becomes a problem to be solved urgently in the prior art.
Disclosure of Invention
In order to solve the technical problems, the invention provides a resource management scheduling method, device and equipment of an artificial intelligence cluster, which are used for solving the problem that GPU resources and network resources in the existing artificial intelligence cluster cannot be effectively configured and utilized.
In order to achieve the above object, the present invention provides a resource management scheduling method for an artificial intelligence cluster, wherein the artificial intelligence cluster is provided with an information storage module, a resource scheduling module, and a plurality of GPU nodes; the GPU node is internally provided with a node management module and a GPU management module;
the resource management scheduling method comprises the following steps:
after the GPU management module deploys GPU drive installation services of the GPU nodes to the GPU nodes, the GPU management module acquires GPU resource configuration information of the GPU nodes and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that GPU drivers of the GPU nodes are installed on a physical machine in a containerized mode;
the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
Further, the resource management scheduling method further includes:
after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the step of installing a network card driver of the GPU node on a physical machine in a containerization mode;
the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the preset scheduling strategy according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module.
Further, before sending the deep learning task to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; the communication connection modes of all GPUs in the target GPU node are the same;
and the resource scheduling module selects network cards with the same communication connection mode in the target GPU node to schedule network resources.
Further, before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource management scheduling method further includes:
the resource scheduling module deploys GPU virtualization services of the GPU nodes on the GPU nodes;
and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
Further, before sending the deep learning task to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
and when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, the resource scheduling module takes the candidate GPU node as the target GPU node.
Further, the preset scheduling policy includes at least one of:
sequencing and scheduling all deep learning tasks according to task scheduling priority levels;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all deep learning tasks according to the high-priority queue and the high-priority task priority scheduling principle.
Further, after the deep learning task is sent to the target GPU node according to a preset scheduling policy, the resource management scheduling method further includes:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
The invention also provides a resource management scheduling device of the artificial intelligence cluster, which is used for realizing the resource management scheduling method of the artificial intelligence cluster, and the resource management scheduling device comprises:
the GPU management module is used for deploying the GPU drive installation service of the GPU node to the GPU node, acquiring GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the step of installing a GPU driver of the GPU node on a physical machine in a containerization mode;
the node management module is used for sending GPU resource configuration information of the GPU node to the information storage module for storage;
and the resource scheduling module is used for sending the deep learning task to the target GPU node according to the preset scheduling strategy according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received.
The present invention also provides a computer device, comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and can run on the processor, and the processor executes the computer program to realize the following steps:
after the GPU management module deploys GPU drive installation services of the GPU nodes to the GPU nodes, the GPU management module acquires GPU resource configuration information of the GPU nodes and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the step of installing a GPU driver of the GPU node on a physical machine in a containerization mode;
the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
The present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of:
after the GPU management module deploys GPU drive installation services of the GPU nodes to the GPU nodes, the GPU management module acquires GPU resource configuration information of the GPU nodes and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the step of installing a GPU driver of the GPU node on a physical machine in a containerization mode;
the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
Compared with the prior art, the technical scheme of the invention has the following technical effects:
the artificial intelligence cluster is provided with a plurality of GPU nodes, an information storage module and a resource scheduling module; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing corresponding GPU nodes, and the GPU management module is used for managing GPU resources in the GPU nodes; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;
firstly, in a single GPU node, a GPU management module firstly deploys GPU drive installation service of the GPU node to the GPU node, so that the GPU drive of the GPU node is installed on a physical machine in a containerization mode, and GPU drive mounting is realized;
after the mounting is finished, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the node management module sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;
each GPU node can send own GPU resource configuration information to the information storage module for unified storage, and the GPU resource configuration information of all GPU nodes in the artificial intelligence cluster can be stored in the information storage module finally;
when the artificial intelligence cluster receives the deep learning task, the resource scheduling module firstly acquires resource information requested by the deep learning task, then schedules and manages the deep learning task according to a preset scheduling strategy by combining GPU resource configuration information of all GPU nodes in the information storage module, sends the deep learning task to a target GPU node, and processes the deep learning task through the target GPU node;
therefore, the GPU driver of the GPU node is installed on the physical machine through the container to realize mounting, GPU resources of the GPU node can be shared, and GPU resource utilization efficiency is improved;
meanwhile, all GPU resource configuration information is uniformly stored through the information storage module, GPU resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, and therefore GPU resource configuration efficiency is improved, and the utilization rate of the cluster resources is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a resource management scheduling method for an artificial intelligence cluster according to an embodiment of the present invention;
FIG. 2 is a block diagram of a resource management scheduling apparatus of an artificial intelligence cluster in an embodiment of the present invention;
FIG. 3 is a flow chart of a method for resource management scheduling for artificial intelligence clusters in an actual embodiment of the present invention;
fig. 4 is an internal structural diagram of a computer device according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
as shown in fig. 1, an embodiment of the present invention provides a resource management scheduling method for an artificial intelligence cluster, where an information storage module, a resource scheduling module, and multiple GPU nodes are arranged in the artificial intelligence cluster; a node management module and a GPU management module are arranged in the GPU node;
the resource management scheduling method comprises the following steps:
s1, after the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that GPU drivers of GPU nodes are installed on a physical machine in a containerization mode;
s2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and S3, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a specific embodiment, a plurality of GPU nodes, an information storage module and a resource scheduling module are arranged in an artificial intelligence cluster; the GPU node is provided with a node management module and a GPU management module; the node management module is used for managing corresponding GPU nodes, and the GPU management module is used for managing GPU resources in the GPU nodes; the information storage module is used for uniformly storing resource configuration information in the cluster, and the resource scheduling module is used for uniformly managing and scheduling each resource;
firstly, in a single GPU node, a GPU management module firstly deploys GPU drive installation service of the GPU node to the GPU node, so that the GPU drive of the GPU node is installed on a physical machine in a containerization mode, and GPU drive mounting is realized;
after the mounting is finished, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the node management module sends the GPU resource configuration information of the GPU node to an information storage module in the cluster for storage;
each GPU node can send own GPU resource configuration information to the information storage module for unified storage, and GPU resource configuration information of all GPU nodes in the artificial intelligence cluster can be stored in the information storage module;
when the artificial intelligence cluster receives the deep learning task, the resource scheduling module firstly acquires resource information requested by the deep learning task, then schedules and manages the deep learning task according to a preset scheduling strategy by combining GPU resource configuration information of all GPU nodes in the information storage module, sends the deep learning task to a target GPU node, and processes the deep learning task through the target GPU node;
therefore, the GPU drive of the GPU node is installed on the physical machine through the container to realize mounting, GPU resources of the GPU node can be shared, and GPU resource use efficiency is improved;
meanwhile, all GPU resource configuration information is uniformly stored through the information storage module, GPU resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, and therefore GPU resource configuration efficiency is improved, and the utilization rate of the cluster resources is improved.
As shown in fig. 2, in an actual embodiment, a deployment module is further disposed in the artificial intelligence cluster, and the deployment module is configured to deploy each module of the GPU management module, the network management module, the node management module, the information storage module, and the scheduling module on the whole cluster.
The deployment module may also deploy kubernets into the entire cluster. Kubernetes, K8S cluster, which orchestrates the scheduling engine for Google based on Borg open source containers; the K8S cluster is generally distributed and comprises a master node and a node; the master node is mainly responsible for cluster control, task scheduling and resource scheduling; the node is a workload node.
In addition, the storage module can adopt a single-point service or a high-availability service to ensure the stability of the function.
In a preferred embodiment, in S4, the preset scheduling policy includes at least one of:
sequencing and scheduling all deep learning tasks according to task scheduling priority levels;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all deep learning tasks according to the high-priority queue and the high-priority task priority scheduling principle.
In a specific embodiment, the scheduling scheme may select a scheduling scheme according to actual requirements, for example, a priority queue, a first-in first-out queue, a maximum resource utilization rate, and the like.
The three scheduling conditions are specifically as follows:
the scheduling module can put the scheduling tasks into the scheduling queue, perform scheduling priority ordering on each task scheduling according to the scheduling priority, and select the scheduling with the highest priority;
if the scheduling queue is a first-in first-out queue, all tasks are scheduled according to a first-in first-out scheduling principle;
if the high-priority queue task priority processing is adopted, the queue with the highest priority is selected according to the queue priority, and then the task with the highest priority in the queue is selected for scheduling.
Meanwhile, in order to meet the requirement of affinity of the GPU or the network card, the scheduling scheme may adopt a scheduling scheme that the GPU and the network card with affinity are preferentially used, so as to increase the calling speed of resources such as the GPU or the network card and increase the calling efficiency.
In a preferred embodiment, the resource management scheduling method further includes:
s5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps that a network card driver of a GPU node is installed on a physical machine in a containerization mode;
s6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
and S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a specific embodiment, similarly, the network card/network resource of the GPU node can be shared and the utilization efficiency of the network card/network resource can be improved by installing the network card drive of the GPU node on the physical machine through the container to implement mounting.
Meanwhile, all network resource configuration information is uniformly stored through the information storage module, and network resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, so that the network resource configuration efficiency is improved, and the utilization rate of the cluster resources is improved.
In a preferred embodiment, in S4, before sending the deep learning task to the target GPU node according to the preset scheduling policy, the resource management scheduling method further includes:
s311, screening out a plurality of candidate GPU nodes by the resource scheduling module according to the residual GPU resource information of each GPU node, and selecting the candidate GPU node with GPU resource affinity as a target GPU node; the communication connection modes of all GPUs in the target GPU node are the same;
s312, the resource scheduling module selects network cards with the same communication connection mode from the target GPU nodes to schedule network resources.
In a particular embodiment, after determining the particular scheduling task, the scheduling module may select candidate nodes according to the remaining amount of node resources, and traverse the nodes.
In order to increase the resource calling speed and improve the resource calling efficiency, the scheduling module can preferentially select the nodes with the affinity GPU and the network card resources as the target GPU nodes. The GPU resource affinity means that the GPU communication connection modes are the same, and the GPU with the affinity is preferentially used, so that the GPU communication is faster; the network resource affinity means that the network card communication connection mode is the same as the GPU communication connection mode, and the communication efficiency between GPUs can be further improved.
Therefore, after the candidate GPU node with the affinity of the GPU and the network card is used as the target GPU node, the deep learning task is processed through the target GPU node, the task processing efficiency can be effectively improved, and the utilization efficiency of all resources in the cluster can be improved.
In a preferred embodiment, before S1, the method for scheduling resource management further includes:
the resource scheduling module deploys GPU virtualization services of the GPU nodes to the GPU nodes;
and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
GPU virtualization can divide GPU resources into a plurality of virtual resources so as to distribute and configure and improve the utilization rate of the GPU resources;
similarly, the network card virtualization network card/network resource can be divided into a plurality of virtual resources so as to allocate and configure and improve the utilization rate of the network resource.
Through GPU virtualization and network card virtualization, effective configuration and utilization of resources in each node can be improved, and therefore the utilization rate of cluster resources is improved.
In a preferred embodiment, in S4, before sending the deep learning task to the target GPU node according to the preset scheduling policy, the resource management scheduling method further includes:
s321, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
and S322, when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
When the GPU or the network card in the candidate nodes is subjected to virtualization processing in advance, virtual GPU and virtual network card resources belonging to the same communication connection mode can be selected from the candidate nodes, and virtual resource groups matched with the required number are selected from the virtual GPU and the virtual network card resources and used for processing deep learning tasks.
In a preferred embodiment, after S4, the method for scheduling resource management further includes:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
As shown in fig. 3, in an actual embodiment, a specific implementation process of the resource management scheduling method for an artificial intelligence cluster is as follows:
the deployment module deploys kubernets into the whole cluster and deploys other related modules into the cluster.
The GPU management module deploys related services to the GPU nodes according to whether the nodes are GPU nodes or not; related services include, but are not limited to: GPU driving, continort, monitoring, etc., GPU virtualization, etc. Meanwhile, the information is reported to the node management module.
Meanwhile, the network management module configures the network on the node according to the network card type and the configuration file of the node, and deploys the related service to the related node, such as network card virtualization. Meanwhile, the information is reported to the node management module.
The node management module may store all the information described above in the information storage module.
After the information storage module stores all relevant GPU and network information, the scheduling module may perform task scheduling according to the resource information stored by the information storage module and the resource requested by the deep-learning task. The scheduling scheme may be scheduled according to a commonly used scheduling scheme, such as a priority queue, a first-in first-out queue, a maximum resource utilization, and the like. Meanwhile, in order to meet the requirement of affinity of the GPU and the network card, the scheduling scheme may adopt a scheduling scheme in which the GPU and the network card having affinity are preferentially used.
And the scheduling module issues the scheduling task to a node management module of the node to be scheduled, and the node management module starts a related training task according to the resource usage and updates the residual resource information to an information storage module for subsequent use.
It should be noted that, although the steps in the flowchart are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Example two:
the embodiment of the present invention further provides a resource management scheduling device for an artificial intelligence cluster, which is used for implementing the resource management scheduling method for the artificial intelligence cluster, and the resource management scheduling device includes:
the GPU management module is used for deploying the GPU drive installation service of the GPU node to the GPU node, acquiring GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that GPU drivers of GPU nodes are installed on a physical machine in a containerization mode;
the node management module is used for sending the GPU resource configuration information of the GPU node to the information storage module for storage;
and the resource scheduling module is used for sending the deep learning task to the target GPU node according to a preset scheduling strategy according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received.
In a preferred embodiment, the GPU management module is further configured to: deploying the network card drive installation service of the GPU node to the GPU node, acquiring network resource configuration information of the GPU node and sending the network resource configuration information to the node management module; the network card driver installation service comprises the steps that a network card driver of a GPU node is installed on a physical machine in a containerization mode;
the node management module is further configured to: sending the network resource configuration information of the GPU node to an information storage module for storage;
the resource scheduling module is further configured to: and when the deep learning task is received, the deep learning task is sent to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a preferred embodiment, the resource scheduling module is further configured to:
screening out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selecting the candidate GPU node with GPU resource affinity as a target GPU node; the communication connection modes of all GPUs in the target GPU node are the same;
and selecting network cards with the same communication connection mode from the target GPU nodes to schedule network resources.
In a preferred embodiment, the resource scheduling module is further configured to:
deploying GPU virtualization services of GPU nodes to the GPU nodes;
and/or deploying the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the resource scheduling module is further configured to:
selecting a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
and when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, taking the candidate GPU node as a target GPU node.
In a preferred embodiment, the node management module is further configured to: and sending the residual GPU resource information of the target GPU node to an information storage module for updating.
For the specific limitations of the above apparatus, reference may be made to the limitations of the above method, which are not described herein again.
The modules in the above device can be implemented wholly or partially by software, hardware and their combination. The modules can be embedded in a hardware form or independent from a processor in the computer device, or can be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
As shown in fig. 4, the computer device may be a terminal including a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated that the arrangements shown in the above figures are merely block diagrams of some of the arrangements relevant to the inventive arrangements and do not constitute limitations on the computer apparatus to which the inventive arrangements may be applied, as a particular computer apparatus may comprise more or less components than those shown in the figures, or some of the components may be combined, or have a different arrangement of components.
Example three:
an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program, where the computer program is stored in the memory and can be run on the processor, and the processor implements the following steps when executing the computer program:
s1, after the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that GPU drivers of GPU nodes are installed on a physical machine in a containerization mode;
s2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and S4, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
s5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps that a network card driver of a GPU node is installed on a physical machine in a containerization mode;
s6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
and S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
before the deep learning task is sent to the target GPU node according to a preset scheduling strategy, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; the communication connection modes of all GPUs in the target GPU node are the same; and the resource scheduling module selects network cards with the same communication connection mode in the target GPU node to schedule network resources.
In a preferred embodiment, the processor, when executing the computer program, further performs the steps of:
before the GPU management module acquires GPU resource configuration information of a GPU node and sends the GPU resource configuration information to the node management module, the resource scheduling module deploys GPU virtualization services of the GPU node to the GPU node; and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the processor, when executing the computer program, further performs the steps of:
before the deep learning task is sent to a target GPU node according to a preset scheduling strategy, a resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode; and when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
In a preferred embodiment, the processor when executing the computer program further performs the steps of:
after the deep learning task is sent to the target GPU node according to a preset scheduling strategy, the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
Example four:
an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:
s1, after the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps that GPU drivers of GPU nodes are installed on a physical machine in a containerization mode;
s2, the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and S4, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a preferred embodiment, the computer program when executed by the processor further performs the steps of:
s5, after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the steps that a network card driver of a GPU node is installed on a physical machine in a containerization mode;
s6, the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
and S7, when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
In a preferred embodiment, the computer program when executed by the processor further performs the steps of:
before the deep learning task is sent to the target GPU node according to a preset scheduling strategy, the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; the communication connection modes of all GPUs in the target GPU node are the same; and the resource scheduling module selects network cards with the same communication connection mode in the target GPU node to schedule network resources.
In a preferred embodiment, the computer program when executed by the processor further performs the steps of:
before the GPU management module acquires GPU resource configuration information of a GPU node and sends the GPU resource configuration information to the node management module, the resource scheduling module deploys GPU virtualization services of the GPU node to the GPU node; and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
In a preferred embodiment, the computer program when executed by the processor further performs the steps of:
before the deep learning task is sent to a target GPU node according to a preset scheduling strategy, a resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode; and when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, the resource scheduling module takes the candidate GPU node as a target GPU node.
In a preferred embodiment, the computer program, when executed by the processor, further performs the steps of:
after the deep learning task is sent to the target GPU node according to a preset scheduling strategy, the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
It is understood that all or part of the processes of the methods of the embodiments can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods when executed.
Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A resource management scheduling method of an artificial intelligence cluster is characterized in that an information storage module, a resource scheduling module and a plurality of GPU nodes are arranged in the artificial intelligence cluster; the GPU node is internally provided with a node management module and a GPU management module;
the resource management scheduling method comprises the following steps:
after the GPU management module deploys the GPU driver installation service of the GPU node on the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the step of installing a GPU driver of the GPU node on a physical machine in a containerization mode;
the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to a target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module and a preset scheduling strategy.
2. The method of claim 1, wherein the method further comprises:
after the GPU management module deploys the network card drive installation service of the GPU node to the GPU node, the GPU management module acquires network resource configuration information of the GPU node and sends the network resource configuration information to the node management module; the network card driver installation service comprises the step of installing a network card driver of the GPU node on a physical machine in a containerization mode;
the node management module sends the network resource configuration information of the GPU node to the information storage module for storage;
and when a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the preset scheduling strategy according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module.
3. The method of claim 2, wherein before sending the deep learning task to the target GPU node according to a preset scheduling policy, the method further comprises:
the resource scheduling module screens out a plurality of candidate GPU nodes according to the residual GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; the communication connection modes of all GPUs in the target GPU node are the same;
and the resource scheduling module selects network cards with the same communication connection mode in the target GPU node to schedule network resources.
4. The method of claim 3, wherein before the GPU management module obtains GPU resource configuration information of the GPU nodes and sends the GPU resource configuration information to the node management module, the method further comprises:
the resource scheduling module deploys GPU virtualization services of the GPU nodes to the GPU nodes;
and/or the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.
5. The method of claim 4, wherein before sending the deep learning task to the target GPU node according to a preset scheduling policy, the method further comprises:
the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; the virtual GPU and the virtual network card in the virtual resource candidate group belong to the same communication connection mode;
and when the number of the virtual resource candidate groups reaches the resource demand number of the deep learning task, the resource scheduling module takes the candidate GPU node as the target GPU node.
6. The method of claim 1, wherein the predetermined scheduling policy comprises at least one of:
sequencing and scheduling all deep learning tasks according to task scheduling priority levels;
scheduling all deep learning tasks according to a first-in first-out principle;
and scheduling all deep learning tasks according to the high-priority queue and the high-priority task priority scheduling principle.
7. The method of claim 1, wherein after sending the deep learning task to the target GPU node according to a preset scheduling policy, the method further comprises:
and the node management module in the target GPU node sends the residual GPU resource information of the target GPU node to the information storage module for updating.
8. A resource management scheduling apparatus of an artificial intelligence cluster, for implementing the resource management scheduling method of the artificial intelligence cluster according to any one of claims 1 to 7, the resource management scheduling apparatus comprising:
the GPU management module is used for deploying the GPU driver installation service of the GPU node to the GPU node, acquiring GPU resource configuration information of the GPU node and sending the GPU resource configuration information to the node management module; the GPU driver installation service comprises the step of installing a GPU driver of the GPU node on a physical machine in a containerization mode;
the node management module is used for sending GPU resource configuration information of the GPU node to the information storage module for storage;
and the resource scheduling module is used for sending the deep learning task to the target GPU node according to the preset scheduling strategy according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when the deep learning task is received.
9. A computer arrangement comprising a memory, a processor and a computer program, said computer program being stored on said memory and being executable on said processor, characterized in that said processor, when executing said computer program, is adapted to carry out the steps of the method for resource management scheduling of an artificial intelligence cluster according to any of claims 1-7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when being executed by a processor, performs the steps of the method for resource management scheduling of an artificial intelligence cluster according to any one of claims 1-7.
CN202210609937.3A 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster Active CN115048216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Publications (2)

Publication Number Publication Date
CN115048216A true CN115048216A (en) 2022-09-13
CN115048216B CN115048216B (en) 2024-06-04

Family

ID=83158949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210609937.3A Active CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Country Status (1)

Country Link
CN (1) CN115048216B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617364A (en) * 2022-12-20 2023-01-17 中化现代农业有限公司 GPU virtualization deployment method, system, computer equipment and storage medium
CN115965517A (en) * 2023-01-09 2023-04-14 摩尔线程智能科技(北京)有限责任公司 Graphics processor resource management method and device, electronic device and storage medium
WO2024117552A1 (en) * 2022-12-01 2024-06-06 삼성전자주식회사 Electronic device providing artificial intelligence service and control method therefor

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
WO2022033024A1 (en) * 2020-08-12 2022-02-17 ***股份有限公司 Distributed training method and apparatus of deep learning model
US20220083389A1 (en) * 2020-09-16 2022-03-17 Nutanix, Inc. Ai inference hardware resource scheduling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
WO2022033024A1 (en) * 2020-08-12 2022-02-17 ***股份有限公司 Distributed training method and apparatus of deep learning model
US20220083389A1 (en) * 2020-09-16 2022-03-17 Nutanix, Inc. Ai inference hardware resource scheduling
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024117552A1 (en) * 2022-12-01 2024-06-06 삼성전자주식회사 Electronic device providing artificial intelligence service and control method therefor
CN115617364A (en) * 2022-12-20 2023-01-17 中化现代农业有限公司 GPU virtualization deployment method, system, computer equipment and storage medium
CN115965517A (en) * 2023-01-09 2023-04-14 摩尔线程智能科技(北京)有限责任公司 Graphics processor resource management method and device, electronic device and storage medium
CN115965517B (en) * 2023-01-09 2023-10-20 摩尔线程智能科技(北京)有限责任公司 Graphics processor resource management method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115048216B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
CN115048216B (en) Resource management scheduling method, device and equipment of artificial intelligent cluster
CN110704186B (en) Computing resource allocation method and device based on hybrid distribution architecture and storage medium
CN110888743B (en) GPU resource using method, device and storage medium
CN110489213B (en) Task processing method and processing device and computer system
CN108337109B (en) Resource allocation method and device and resource allocation system
CN111274033B (en) Resource deployment method, device, server and storage medium
CN111597042A (en) Service thread running method and device, storage medium and electronic equipment
CN110389843B (en) Service scheduling method, device, equipment and readable storage medium
CN110162397B (en) Resource allocation method, device and system
CN112486642B (en) Resource scheduling method, device, electronic equipment and computer readable storage medium
CN111090456A (en) Construction method, device, equipment and medium for deep learning development environment
CN115292014A (en) Image rendering method and device and server
CN111338779A (en) Resource allocation method, device, computer equipment and storage medium
CN115686805A (en) GPU resource sharing method and device, and GPU resource sharing scheduling method and device
CN117311990B (en) Resource adjustment method and device, electronic equipment, storage medium and training platform
CN113986539A (en) Method, device, electronic equipment and readable storage medium for realizing pod fixed IP
CN112148481B (en) Method, system, equipment and medium for executing simulation test task
CN112114958A (en) Resource isolation method, distributed platform, computer device, and storage medium
CN116578416A (en) Signal-level simulation acceleration method based on GPU virtualization
CN115809126A (en) Job scheduling method and device in mixed deployment scene and electronic equipment
CN115686782A (en) Resource scheduling method and device based on solid state disk, electronic equipment and storage medium
CN112114959B (en) Resource scheduling method, distributed system, computer device and storage medium
CN106055410A (en) Cloud computing memory resource allocation method
CN112817691B (en) Resource allocation method, device, equipment and medium
CN117992239B (en) Resource management allocation method, intelligent computing cloud operating system and computing platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant