WO2024113836A1 - 一种资源的分配方法、装置和一种人工智能训练*** - Google Patents

一种资源的分配方法、装置和一种人工智能训练*** Download PDF

Info

Publication number
WO2024113836A1
WO2024113836A1 PCT/CN2023/104043 CN2023104043W WO2024113836A1 WO 2024113836 A1 WO2024113836 A1 WO 2024113836A1 CN 2023104043 W CN2023104043 W CN 2023104043W WO 2024113836 A1 WO2024113836 A1 WO 2024113836A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
graphics processor
graphics
development environment
task
Prior art date
Application number
PCT/CN2023/104043
Other languages
English (en)
French (fr)
Inventor
刘慧兴
陈培
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024113836A1 publication Critical patent/WO2024113836A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of artificial intelligence training, and in particular to a resource allocation method and device and an artificial intelligence training system.
  • k8s Kerbenetes, an open source system for automatically deploying, scaling and managing containerized applications
  • container management systems to manage GPU resources, and use them in combination with pods (nodes) or containers.
  • the management system will first determine whether the GPU idle resources meet the GPU resources required by the task. If they do, it will create a running container on the corresponding node and allocate GPU resources to the container according to the scheduling strategy, and then run the GPU task. After the task is completed, the container and the corresponding GPU resources will be recycled.
  • This mode can only start tasks through the interface, not through the underlying command line. Once the user operates at the bottom level, it will cause confusion in GPU resources. In addition, each time a task is started in this mode, a container needs to be created, which is not conducive to user editing and debugging scripts. For this reason, the industry usually uses the development environment method to share GPUs.
  • the user needs to create a development environment first, and bind the CPU (Central Processing Unit) and GPU resources to the user's development environment. Once the development environment is created successfully, the corresponding GPU resources will be bound, which is the pre-allocation mode. In this mode, the user can log in to the development environment in shell mode (referring to the script program written in the shell), and development and debugging are relatively convenient. However, the biggest disadvantage of this mode is its poor flexibility and a high probability of GPU resources being idle. The reason is that even if the user does not use the pre-allocated GPU resources, other users cannot use them, resulting in low utilization of the overall GPU resources of the node.
  • CPU Central Processing Unit
  • a resource allocation method, device and artificial intelligence training system are proposed to overcome the above problems or at least partially solve the above problems, including:
  • a resource allocation method is applied to an artificial intelligence training system, wherein the artificial intelligence training system includes a client plug-in library and at least one node, and multiple development environments are created in each node.
  • the method includes:
  • a target graphics processor request is obtained from a client plug-in library; wherein the target graphics processor request is generated by the client plug-in library after redirecting a loading process of a target deep learning framework of the target training task when it is detected that a target training task in a target development environment is started;
  • GPU resources are allocated to the target training task.
  • allocating graphics processor resources to a target training task includes:
  • Allocate GPU resources for the target training task based on the target GPU quota and target GPU request.
  • the method further comprises:
  • a target node corresponding to a target development environment deploys corresponding graphics processor resources, a target graphics processor quota includes the number of target graphics processor quotas, and according to the target graphics processor quota and the target graphics processor request, a graphics processor resource is allocated to a target training task, including:
  • GPU resources are allocated to the target training task from the GPU resources corresponding to the target node that are not currently used by the target development environment.
  • GPU resources are allocated to the target training task from the GPUs currently used by the target development environment.
  • allocating graphics processor resources to a target training task from a graphics processor currently used by a target development environment includes:
  • the GPUs currently used by the target development environment are sorted according to the number of tasks, and GPU resources are allocated to the target training task from the top N GPUs with the smallest number of tasks;
  • N is the number of GPUs required for the target training task.
  • allocating a graphics processor resource to a target training task from a graphics processor currently not used by a target development environment among graphics processor resources corresponding to a target node includes:
  • graphics processor resources are allocated to the target training task from the graphics processors with a current task number less than the task number threshold and the currently used graphics processors of the target development environment.
  • GPU resources are allocated to the target training task from the GPUs whose current task number is less than the task number threshold.
  • Allocate GPU resources to the target training task from the currently used GPU in the environment including:
  • the first M GPUs with the smallest number of tasks among the currently used GPUs in the target development environment and the GPUs with the current number of tasks less than the task number threshold are allocated GPU resources for the target training task;
  • M is the number of graphics processors reused.
  • the method further comprises:
  • the target GPU quota also includes the target GPU memory quota capacity, and the GPU resources are allocated to the target training task according to the target GPU quota and the target GPU request, further comprising:
  • Allocate GPU memory capacity to the target training task based on the target GPU memory quota capacity Allocate GPU memory capacity to the target training task based on the target GPU memory quota capacity.
  • allocating graphics processor memory capacity to a target training task according to the target graphics processor memory quota capacity includes:
  • the GPU memory capacity is allocated to the target training task according to the target GPU memory quota capacity
  • the remaining GPU memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining GPU memory quota capacity of the target development environment.
  • the method further comprises:
  • the GPU resources allocated to the target training task are recovered.
  • the method further comprises:
  • the embodiment of the present application also provides an artificial intelligence training system, including at least one node, a node manager and a client plug-in library, wherein multiple development environments are created in each node;
  • a client plug-in library is used to generate a target graphics processor request after redirecting the loading process of the target deep learning framework of the target training task when detecting that the target training task in the target development environment is started; and send the target graphics processor request to the node manager;
  • a node manager configured to allocate graphics processor resources to a target training task in response to a target graphics processor request
  • the target deep learning framework uses GPU resources allocated by the node manager for AI training.
  • the node manager is used to determine a target graphics processor quota pre-configured for a target development environment; and allocate graphics processor resources to a target training task according to the target graphics processor quota and a target graphics processor request.
  • the node manager includes a graphics processor management module
  • the graphics processor management module is used to store the graphics processor quotas entered by the user for each development environment.
  • a target node corresponding to a target development environment deploys corresponding graphics processor resources, and the target graphics processor quota includes the number of target graphics processor quotas;
  • the node manager is used to determine the number of graphics processors currently in use in the target development environment; if the number of graphics processors currently in use is less than the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently not in use by the target development environment in the graphics processor resources corresponding to the target node; if the number of graphics processors currently in use is equal to the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently in use by the target development environment.
  • the node manager is used to determine the number of graphics processors required for a target training task; if the required number of graphics processors is greater than the number of graphics processors currently in use, a scheduling failure message is generated; if the required number of graphics processors is not greater than the number of graphics processors currently in use, the graphics processors currently used by the target development environment are sorted according to the number of tasks, and graphics processor resources are allocated to the target training task from the top N graphics processors with the smallest number of tasks; wherein N is the number of graphics processors required for the target training task.
  • the node manager is used to determine the current number of tasks of a graphics processor that is not currently used by a target development environment, and determine the number of graphics processors whose current task number is less than a task number threshold; if the number of graphics processors whose current task number is less than the task number threshold is less than the number of graphics processors required for a target training task, then allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors of the target development environment; if the number of graphics processors whose current task number is less than the task number threshold is not less than the number of graphics processors required for the target training task, then allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold.
  • the node manager is used to determine the number of graphics processors to be reused based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently used is less than the number of graphics processors to be reused, a scheduling failure message is generated; if the number of graphics processors currently used is not less than the number of graphics processors to be reused, the first M graphics processors with the smallest number of tasks among the currently used graphics processors of the target development environment and the graphics processors whose current number of tasks is less than a task number threshold are allocated graphics processor resources for the target training task; wherein M is the number of graphics processors to be reused.
  • the node manager includes an auxiliary module
  • the auxiliary module is used to record the allocation process of the node manager, generate and store the allocation information of the graphics processor resources for each development environment.
  • the target graphics processor quota also includes the target graphics processor video memory quota capacity
  • the node manager is also used to allocate the graphics processor memory capacity to the target training task according to the target graphics processor memory quota capacity.
  • the node manager is used to determine whether there are other training tasks in the target development environment when starting the target training task; if there are no other training tasks in the target development environment, the graphics processor video memory capacity is allocated to the target training task according to the target graphics processor video memory quota capacity; if there are other training tasks in the target development environment, the remaining graphics processor video memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining graphics processor video memory quota capacity of the target development environment.
  • the node manager is further configured to reclaim the graphics processor resources allocated to the target training task when it is detected that the target training task is finished.
  • the node manager is further used to reclaim the graphics processor resources allocated to the target training task when no heartbeat information for the target training task is received within a preset period.
  • the embodiment of the present application also provides a resource allocation device, which is applied to an artificial intelligence training system.
  • the artificial intelligence training system includes a client plug-in library and at least one node. Multiple development environments are created in each node.
  • the device includes:
  • a request module is used to obtain a target graphics processor request from a client plug-in library when it is detected that a target training task in a target development environment is started; wherein the target graphics processor request is generated by the client plug-in library after redirecting the loading process of the target deep learning framework of the target training task when it is detected that a target training task in a target development environment is started;
  • the allocation module is used to allocate graphics processor resources to the target training task in response to the target graphics processor request.
  • the allocation module is used to determine a target graphics processor quota pre-configured for a target development environment; and allocate graphics processor resources to a target training task according to the target graphics processor quota and a target graphics processor request.
  • the device further comprises:
  • the quota receiving module is used to receive the GPU quota input by the user for each development environment.
  • the target node corresponding to the target development environment deploys corresponding graphics processor resources
  • the target graphics processor quota includes the target graphics processor quota number
  • an allocation module is used to determine the number of graphics processors currently used by the target development environment; if the number of graphics processors currently used is less than the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently not used by the target development environment in the graphics processor resources corresponding to the target node; if the number of graphics processors currently used is equal to the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently used by the target development environment.
  • an allocation module is used to determine the number of graphics processors required for a target training task; if the required number of graphics processors is greater than the number of graphics processors currently in use, a scheduling failure message is generated; if the required number of graphics processors is not greater than the number of graphics processors currently in use, the graphics processors currently used by the target development environment are sorted according to the number of tasks, and graphics processor resources are allocated to the target training task from the top N graphics processors with the smallest number of tasks; wherein N is the number of graphics processors required for the target training task.
  • an allocation module is used to determine the current number of tasks of a graphics processor that is not currently used by a target development environment, and to determine the number of graphics processors whose current task number is less than a task number threshold; if the number of graphics processors whose current task number is less than the task number threshold is less than the number of graphics processors required for a target training task, then graphics processor resources are allocated to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors of the target development environment; if the number of graphics processors whose current task number is less than the task number threshold is not less than the number of graphics processors required for the target training task, then graphics processor resources are allocated to the target training task from the graphics processors whose current task number is less than the task number threshold.
  • an allocation module is used to determine the number of graphics processors to be reused based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently used is less than the number of graphics processors to be reused, a scheduling failure message is generated; if the number of graphics processors currently used is not less than the number of graphics processors to be reused, the first M graphics processors with the smallest number of tasks among the currently used graphics processors of the target development environment and the graphics processors whose current number of tasks is less than a task number threshold are allocated graphics processor resources for the target training task; wherein M is the number of graphics processors to be reused.
  • the device further comprises:
  • the allocation information generation module is used to generate and store allocation information of graphics processor resources for each development environment.
  • the target graphics processor quota also includes a target graphics processor video memory quota capacity, and according to the allocation module, it is also used to allocate the graphics processor video memory capacity to the target training task according to the target graphics processor video memory quota capacity.
  • an allocation module is used to determine whether there are other training tasks in the target development environment when starting the target training task; if there are no other training tasks in the target development environment, the graphics processor video memory capacity is allocated to the target training task according to the target graphics processor video memory quota capacity; if there are other training tasks in the target development environment, the remaining graphics processor video memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining graphics processor video memory quota capacity of the target development environment.
  • the device further comprises:
  • the first recycling module is used to recycle the graphics processor resources allocated to the target training task when it is detected that the target training task is finished.
  • the device further comprises:
  • the second recycling module is used to recycle the graphics processor resources allocated to the target training task when no heartbeat information for the target training task is received within a preset period.
  • An embodiment of the present application also provides an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and the computer program implements the above resource allocation method when executed by the processor.
  • the embodiment of the present application also provides a non-volatile computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above resource allocation method is implemented.
  • the embodiment of the present application when the target training task in the target development environment is detected to be started, the loading of the client plug-in library will be triggered; thus, the client plug-in library can redirect the loading process of the target deep learning framework of the target training task to hijack the startup process of the deep learning framework, and generate a target graphics processor request in this process to request the allocation of graphics processor resources for the target training task.
  • the embodiment of the present application starts from the perspective of the deep learning framework, analyzes the loading logic of the deep learning framework when the training task is started, and realizes dynamic graphics processor sharing by hijacking the framework.
  • the method is simple to implement and does not require modification of the framework.
  • the user has no perception when using it, and it is as flexible as the default graphics processor sharing mode.
  • the graphics processor resources are untied from the development environment, and the graphics processor resources are allocated only when the user actually starts the training task, which solves the problem of users occupying graphics processor resources in the pre-allocation mode, efficiently utilizes the node graphics processor resources, and improves the utilization rate of the overall graphics processor resources of the node.
  • FIG1 is a flowchart of a method for allocating resources according to an embodiment of the present application.
  • FIG2 is a flowchart of another method for allocating resources according to an embodiment of the present application.
  • FIG3 is a flowchart of a step of scheduling a graphics processor according to an embodiment of the present application.
  • FIG4 is a flowchart of another step of scheduling a graphics processor according to an embodiment of the present application.
  • FIG5 is a schematic diagram of a flow chart of another graphics processor scheduling according to an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an artificial intelligence training system according to an embodiment of the present application.
  • FIG7 is a partial structural diagram of an artificial intelligence training system according to an embodiment of the present application.
  • FIG9 is a schematic diagram of the structure of an electronic device according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a non-volatile computer-readable storage medium according to an embodiment of the present application.
  • an embodiment of the present application provides a resource allocation method, which can be applied to an artificial intelligence training system; the artificial intelligence training system can be used for task training, such as image recognition training, etc.
  • the loading of the client plug-in library When the start of a target training task in the target development environment is detected, the loading of the client plug-in library will be triggered; thus, the client plug-in library can redirect the loading process of the target deep learning framework of the target training task to hijack the startup process of the deep learning framework, and generate a target graphics processor request in this process to request the allocation of graphics processor resources for the target training task.
  • the embodiment of the present application starts from the perspective of the deep learning framework, analyzes the loading logic of the deep learning framework when the training task is started, and realizes dynamic graphics processor sharing by hijacking the framework.
  • the method is simple to implement and does not require modifying the framework.
  • the user has no perception when using it, and it is as flexible as the default graphics processor sharing mode.
  • the graphics processor resources are untied from the development environment.
  • the graphics processor resources are allocated only when the user actually starts the training task. This solves the problem of users occupying graphics processor resources in vain in the pre-allocation mode, efficiently utilizes the node graphics processor resources, and improves the overall utilization rate of the node's graphics processor resources.
  • the method can be applied to an artificial intelligence training system.
  • the artificial intelligence training system may include a client plug-in library and at least one node, and multiple development environments may be created in each node.
  • Step 101 When it is detected that a target training task in a target development environment is started, a target graphics processor request is obtained from a client plug-in library; wherein the target graphics processor request is generated by the client plug-in library after redirecting the loading process of the target deep learning framework of the target training task when it is detected that a target training task in a target development environment is started.
  • the client plug-in library can be set according to actual conditions. It can be used to request graphics processor resources and keep tasks alive, and can also be used for other functions, such as sending training task completion messages, etc.
  • the embodiments of the present application do not limit this.
  • the target training task may be a training task initiated by a user in a target development environment, for example, it may be an image recognition training task.
  • the target deep learning framework can be a deep learning framework selected by the user based on the needs, such as: Caffe (a deep learning framework written in C++), TensorFlow (a deep learning framework, it is an open source software designed entirely based on Python), PyTorch (a full-featured deep learning framework for building deep learning models).
  • framework which is a type of machine learning commonly used in applications such as image recognition and language processing
  • mxnet an open source deep learning framework that allows users to define, train and deploy deep neural networks on a variety of devices (whether cloud infrastructure or mobile devices), etc., and the embodiments of the present application are not limited to this.
  • a client plug-in library can be set up in the artificial intelligence training system; when the target training task in the target development environment is detected to be started, the loading of the client plug-in library will be triggered.
  • the client plugin library can redirect the loading process of the target deep learning framework and generate the corresponding target GPU request.
  • the target graphics processor request may include graphics processor resources required for the target training task, an identifier of the target development environment, etc., which is not limited in the embodiments of the present application.
  • the client plug-in library can send it to the node manager in the artificial intelligence training system; the node manager can be used to allocate GPU resources.
  • Step 102 In response to a target GPU request, allocate GPU resources for a target training task.
  • the node manager can respond to the target graphics processor request and allocate graphics processor resources to the target training task; for example: the graphics processor and the number of graphics processors to be allocated to the target training task, the video memory capacity to be allocated to the target training task, etc., which is not limited in this embodiment of the present application.
  • GPU resources are allocated only when the user actually starts a training task, which solves the problem of users occupying GPU resources in vain in the pre-allocation mode, efficiently utilizes the node GPU resources, and improves the overall GPU resource utilization of the node.
  • a target deep learning framework may use a node manager to allocate graphics processor resources to the target training task to perform artificial intelligence training.
  • the embodiment of the present application starts from the perspective of the deep learning framework, analyzes the loading logic of the deep learning framework when the training task is started, and realizes dynamic graphics processor sharing by hijacking the framework.
  • the method is simple to implement and does not require modification of the framework. The user has no perception when using it, and it is as flexible as the default graphics processor sharing mode.
  • the graphics processor resources are untied from the development environment, and the graphics processor resources are allocated only when the user actually starts the training task, which solves the problem of users occupying graphics processor resources in the pre-allocation mode, efficiently utilizes the node graphics processor resources, and improves the utilization rate of the overall graphics processor resources of the node.
  • Step 201 When it is detected that a target training task in a target development environment is started, a target graphics processor request is obtained from a client plug-in library.
  • a client plug-in library can be set up in the artificial intelligence training system; when the target training task in the target development environment is detected to be started, the loading of the client plug-in library will be triggered.
  • the client plug-in library can send it to the node manager in the artificial intelligence training system.
  • Step 202 Determine the target graphics processor quota pre-configured for the target development environment.
  • the target GPU quota may be the maximum GPU resources that can be used in the development environment and are set by the user in advance for the target development environment, such as the maximum number of GPUs that can be used, the size of video memory, etc., and this embodiment of the present application does not limit this.
  • a graphics processor quota input by a user for each development environment may be received in advance.
  • one of the prerequisites for GPU scheduling is to know the GPU resources required for the training task, such as the number of GPUs and the size of the GPU memory, etc.
  • this information may not be known to the node manager, especially the required GPU memory size.
  • the node manager can store it, so that when it is necessary to allocate graphics processor resources for the target training task in the target development environment in the future, the node manager can also allocate graphics processor resources based on the graphics processor quota pre-configured for the development environment.
  • Step 203 Allocate GPU resources for the target training task according to the target GPU quota and the target GPU request.
  • GPU resources may be allocated to the target training task based on the target GPU quota and the target GPU request.
  • the target graphics processor request includes information about the graphics processor resources required for the target training task
  • the graphics processor resources can be directly allocated to the target training task based on the target graphics processor request; however, it should be noted that if the graphics processor resources required for the target training task exceed the target graphics processor quota, it may indicate that the target development environment is unable to perform the target training task. At this time, a scheduling failure message may be generated and returned to the user.
  • the GPU resources may be allocated to the target training task based on the target GPU quota, which is not limited in this embodiment of the present application.
  • the target node corresponding to the target development environment may be deployed with corresponding graphics processor resources (for example, 3 graphics processors are deployed, and the video memory size is 5G), and the target graphics processor quota includes the number of target graphics processor quotas.
  • the allocation process may be implemented through the following sub-steps:
  • Sub-step 11 Determine the number of currently used graphics processors in the target development environment.
  • the target development environment may not only be performing the target training task, but may also be performing other training tasks; in this case, you can first determine the number of graphics processors that the target development environment has used, that is, the number of graphics processors used by the training task currently being performed by the target development environment.
  • Sub-step 12 If the number of currently used GPUs is less than the target GPU quota number, GPU resources are allocated to the target training task from the GPU resources corresponding to the target node that are not currently used by the target development environment.
  • the target development environment After determining the number of GPUs currently used by the target development environment, you can compare it with the target GPU configuration. If the number of currently used GPUs is less than the target GPU quota, it means that there is still a quota of GPUs remaining for the target development node.
  • a graphics processor resource can be allocated to the target training task from the graphics processor resources corresponding to the target node that are not currently used by the target development environment, so that the target training task uses a new graphics processor for training.
  • the allocation process of sub-step 12 may be implemented by the following steps:
  • the current task number of the graphics processor that is not currently used by the target development environment can be determined first, and the number of graphics processors whose current task number is less than the task number threshold can be determined; then, the number of graphics processors whose current task number is less than the task number threshold can be compared with the number of graphics processors required for the target training task.
  • the task data threshold can be set according to actual conditions, and the embodiments of the present application do not limit this.
  • the current number of tasks is less than the number of GPUs that meets the task number threshold, and is less than the number of GPUs required for the target training task, this may indicate that the GPUs not used by the target development environment cannot meet the GPU resource requirements of the target training task.
  • graphics processor resources can be allocated to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors of the target development environment, so as to complete the target training task through multiple graphics processors as much as possible.
  • the current number of tasks is less than the number of GPUs that meet the task number threshold, but not less than the number of GPUs required for the target training task, it means that the GPUs not used by the target development environment can meet the GPU resource requirements of the target training task.
  • GPU resources can be allocated to the target training task from GPUs whose current task number is less than the task number threshold, so that the new GPU is allocated to the target training task.
  • Sub-step 13 If the number of currently used graphics processors is equal to the target graphics processor quota number, then allocate graphics processor resources to the target training task from the graphics processors currently used by the target development environment.
  • the number of currently used graphics processors is equal to the target graphics processor quota number, it can be indicated that there is no quota left for the number of graphics processors of the target development node.
  • GPU resources can be allocated to the target training task from the GPU currently used by the target development environment.
  • the allocation process of sub-step 13 may be implemented by the following steps:
  • the number of graphics processors required for the target training task may be determined first; then, the number of graphics processors required for the target training task and the number of graphics processors currently used in the target development environment may be compared.
  • the number of graphics processors required for the target training task is greater than the number of graphics processors currently in use in the target development environment, it may indicate that the target development environment cannot meet the resource requirements of the graphics processor of the target training task.
  • scheduling failure information may be generated and returned to the user.
  • the number of graphics processors required for the target training task is not greater than the number of graphics processors currently in use in the target development environment, it can be indicated that the target development environment can meet the resource requirements of the graphics processor of the target training task.
  • the graphics processors currently used by the target development environment can be sorted according to the number of tasks, and graphics processor resources can be allocated to the target training task from the top N graphics processors with the smallest number of tasks; where N refers to the number of graphics processors required for the target training task.
  • the allocation process may be performed through the following steps:
  • the number of graphics processors to be reused is determined based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently in use is less than the number of graphics processors to be reused, a scheduling failure message is generated; if the number of graphics processors currently in use is not less than the number of graphics processors to be reused, the first M graphics processors with the smallest number of tasks among the currently used graphics processors of the target development environment and the graphics processors whose current number of tasks is less than the task number threshold are allocated graphics processor resources for the target training task; wherein M is the number of graphics processors to be reused.
  • the number of graphics processors that need to be reused can be determined based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment. For example, the number of graphics processors required for the target training task can be subtracted from the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment to obtain the number of graphics processors that need to be reused.
  • the target development environment may indicate that the target development environment cannot perform the target training task; at this time, a scheduling failure message may be generated and sent to the user.
  • the target development environment cannot perform the target training task. After a new graphics processor is allocated, the target training task can be performed.
  • the first M graphics processors with the smallest number of tasks among the graphics processors currently used in the target development environment can be determined first; then, graphics processor resources can be allocated to the target training task from the first M graphics processors with the smallest number of tasks among the graphics processors currently used in the target development environment, and the graphics processors whose current task numbers are less than the task number threshold.
  • allocation information of the graphics processor resources for each development environment may also be generated and stored, so that the node manager may determine the allocation status of the graphics processor for each development environment based on the allocation information.
  • FIG3 shows a flowchart of the steps of scheduling a graphics processor according to an embodiment of the present application:
  • the scheduling fails. If the first number is not less than the third number, the GPUs currently used by the target development environment are sorted from small to large according to the number of tasks to obtain the first set; then, the first third number of GPUs are selected from the first set to allocate GPU resources for the training task; at this point, the scheduling is successful.
  • the first second number of new GPUs that meet the condition can be obtained and recorded as the second set. If the number of GPUs in the second set is greater than the third number, the first third number of GPUs are directly selected from the second set to allocate GPU resources for the training task; at this point, the scheduling is successful. If the number of GPUs in the second set is not greater than the third number, GPU resources are allocated to the training task from the GPUs whose current number of tasks is less than the task number threshold and the currently used GPUs in the development environment.
  • the graphics processors currently used by the target development environment can be sorted from small to large according to the number of tasks to obtain a third set; then, based on the number of graphics processors required for the training task and the number of graphics processors in the graphics processor resources corresponding to the node that are not currently used by the development environment, the number of reused graphics processors (recorded as the fourth number) is determined to be the third number - the second number.
  • the scheduling fails; if the fourth number is not greater than the number of GPUs in the third set, GPU resources are allocated to the training task from the GPUs in the second set and the first fourth number of GPUs in the third set.
  • the target graphics processor quota also includes the target graphics processor video memory quota capacity, and the target graphics processor video memory quota capacity may be pre-input by the user.
  • Sub-step 21 Allocate GPU memory capacity to the target training task according to the target GPU memory quota capacity.
  • the GPU memory capacity can be directly allocated to the target training task based on the target GPU memory quota capacity.
  • the allocation process can be implemented through the following steps:
  • the GPU memory capacity can be directly allocated to the target training task according to the target GPU memory quota capacity.
  • the allocation is directly based on the target GPU video memory quota capacity, and the target GPU video memory quota capacity is the maximum GPU video memory capacity that the target development environment can use, so it can no longer be allocated.
  • the graphics processor memory capacity of the target development environment is fixed; therefore, after the graphics processor memory capacity of the target development environment is allocated to the target training task, the remaining graphics processor memory quota capacity of the target development environment can be determined according to the target training task, and the allocation information can be updated according to the remaining graphics processor memory quota capacity of the target development environment.
  • the remaining GPU memory quota capacity of the target development environment can be determined based on the allocation information, and based on the remaining GPU memory quota capacity, it can be determined whether the target development environment can run the new training task normally.
  • the GPUs can be sorted from large to small according to the remaining resources of the GPUs, and the task information can be added to the timeout mechanism to determine whether the training task has timed out.
  • one or more of the following steps of recovering GPU resources may be included:
  • the first one is: when it is detected that the target training task is finished, the GPU resources allocated to the target development environment are recovered.
  • the GPU resources allocated to the target training task can be recovered; at the same time, the allocation information corresponding to the target development environment can be updated.
  • the second method is to recycle the GPU resources allocated to the target training task when no heartbeat information for the target training task is received within a preset period.
  • the client plug-in library When the client plug-in library detects that the target training task has started running, it can send heartbeat information for the target training task to the node manager according to a preset period.
  • the node manager does not receive the heartbeat information for the target training task within the preset period, it may indicate that the target training task may have an abnormality.
  • the graphics processor resources allocated to the target training task can be recovered; at the same time, the allocation information corresponding to the target development environment can be updated.
  • FIG. 5 shows a schematic diagram of another process flow of scheduling a graphics processor according to an embodiment of the present application:
  • the GPU quota is 0, it means that the training tasks in the development environment cannot use GPU resources, and the operation on the GPU will exit with an error; or, only the CPU can be used for training.
  • the number of GPUs allocated for the training task can be determined.
  • the client plug-in library can establish a communication and heartbeat reporting handle with the node manager, request a graphics processor from the node manager, and wait for a response from the node manager.
  • the allocation is successful, the training task is run, and when the training task ends, a task completion message is sent to the node manager.
  • a target graphics processor request is obtained from a client plug-in library; a target graphics processor quota pre-configured for the target development environment is determined; graphics processor resources are allocated to the target training task based on the target graphics processor quota and the target graphics processor request; based on the over-allocation of node graphics processor resources (i.e., pre-configuring the graphics processor quota for the development environment), the problem of the user development environment occupying vacant graphics processor resources is solved, and the utilization rate of the node graphics processor resources is improved.
  • FIG. 6 a schematic diagram of the structure of an artificial intelligence training system according to an embodiment of the present application is shown, which may include at least one node, a node manager, and a client plug-in library, and multiple development environments are created in each node.
  • the client plug-in library is used to generate a target graphics processor request after redirecting the loading process of the target deep learning framework of the target training task when detecting that the target training task in the target development environment is started; and send the target graphics processor request to the node manager.
  • a client plug-in library can be set up in the artificial intelligence training system; when the target training task in the target development environment is detected to be started, the loading of the client plug-in library will be triggered.
  • the client plugin library can redirect the loading process of the target deep learning framework and generate the corresponding target GPU request.
  • the client plug-in library can send it to the node manager in the artificial intelligence training system.
  • the node manager is used to allocate graphics processor resources to a target training task in response to a target graphics processor request.
  • the node manager may allocate GPU resources for the target training task in response to the target GPU request.
  • the node manager can also be used to determine a target graphics processor quota pre-configured for a target development environment; and allocate graphics processor resources to a target training task based on the target graphics processor quota and the target graphics processor request.
  • the node manager may include an auxiliary module; the auxiliary module is used to allocate the node manager Specifically, for the allocation process, the auxiliary module can generate and store allocation information of the graphics processor resources for each development environment, so that the node manager can determine the allocation of the graphics processor of each development environment based on the allocation information.
  • the target deep learning framework uses graphics processor resources allocated by the node manager to perform artificial intelligence training.
  • the target deep learning framework can use the node manager to allocate GPU resources to the target training task for artificial intelligence training.
  • one of the prerequisites for GPU scheduling is to know the GPU resources required for the training task, such as the number of GPUs and the size of the GPU memory, etc.
  • this information may not be known to the node manager, especially the required GPU memory size.
  • the node manager can store it, so that when it is necessary to allocate graphics processor resources for the target training task in the target development environment in the future, the node manager can also allocate graphics processor resources based on the graphics processor quota pre-configured for the development environment.
  • the node manager may include a graphics processor management module; the graphics processor management module is used to store the graphics processor quota input by the user for each development environment.
  • the node manager can allocate GPU resources for the target training task based on the target GPU quota and the target GPU request.
  • the node manager may allocate GPU resources to the target training task based on the target GPU quota.
  • the target node corresponding to the target development environment deploys corresponding graphics processor resources
  • the target graphics processor quota includes the target graphics processor quota number
  • the node manager is used to determine the current number of used graphics processors of the target development environment; if the current number of used graphics processors is less than the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the current number of used graphics processors is equal to the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently used by the target development environment.
  • the target development environment may not only be performing the target training task, but may also be performing other training tasks; in this case, the node manager can first determine the number of graphics processors already used by the target development environment, that is, the number of graphics processors used by the training task currently being performed by the target development environment.
  • the node manager can compare it with the target graphics processor quota number; if the current number of used graphics processors is less than the target graphics processor quota number, it can indicate that there is still a quota of graphics processors remaining for the target development node.
  • the node manager can allocate graphics processor resources to the target training task from the graphics processor resources corresponding to the target node that are not currently used by the target development environment, so that the target training task uses the new graphics processor for training.
  • the node manager is used to determine the current number of tasks of the graphics processor that is not currently used by the target development environment, and determine the number of graphics processors whose current number of tasks is less than the task number threshold; if the number of graphics processors whose current number of tasks is less than the task number threshold is less than the number of graphics processors required for the target training task, then the current task is removed from the graphics processor.
  • the graphics processor resources are allocated to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors in the target development environment; if the number of graphics processors whose current task number is less than the task number threshold is not less than the number of graphics processors required for the target training task, then the graphics processor resources are allocated to the target training task from the graphics processors whose current task number is less than the task number threshold.
  • the node manager can determine the current number of tasks of the graphics processor that is not currently used by the target development environment, and determine the number of graphics processors whose current number of tasks is less than the task number threshold; if the number of graphics processors whose current number of tasks is less than the task number threshold is less than the number of graphics processors required for the target training task, then the graphics processor resources are allocated to the target training task from the graphics processors whose current number of tasks is less than the task number threshold and the currently used graphics processors of the target development environment.
  • the graphics processor resources are allocated to the target training task from the graphics processors whose current number of tasks is less than the task number threshold.
  • the node manager can first determine the current task number of the graphics processors that are not currently used by the target development environment, and determine the number of graphics processors whose current task number is less than the task number threshold; then, the node manager can compare the number of graphics processors whose current task number is less than the task number threshold with the number of graphics processors required for the target training task.
  • the current number of tasks is less than the number of GPUs that meets the task number threshold, and is less than the number of GPUs required for the target training task, this may indicate that the GPUs not used by the target development environment cannot meet the GPU resource requirements of the target training task.
  • the node manager can allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors of the target development environment, so as to complete the target training task through multiple graphics processors as much as possible.
  • the current number of tasks is less than the number of GPUs that meet the task number threshold, but not less than the number of GPUs required for the target training task, it means that the GPUs not used by the target development environment can meet the GPU resource requirements of the target training task.
  • the node manager can allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold, so as to allocate the new graphics processor to the target training task.
  • the node manager can allocate GPU resources for the target training task from the GPUs currently used by the target development environment.
  • the number of currently used graphics processors is equal to the target graphics processor quota number, it can be indicated that there is no quota left for the number of graphics processors of the target development node.
  • the node manager can allocate GPU resources for the target training task from the GPUs currently used by the target development environment.
  • the node manager is used to determine the number of graphics processors required for a target training task; if the required number of graphics processors is greater than the number of graphics processors currently in use, a scheduling failure message is generated; if the required number of graphics processors is not greater than the number of graphics processors currently in use, the graphics processors currently used by the target development environment are sorted according to the number of tasks, and graphics processor resources are allocated to the target training task from the top N graphics processors with the smallest number of tasks; wherein N is the number of graphics processors required for the target training task.
  • the node controller can determine the number of GPUs required for the target training task; if the required number of GPUs is greater than the number of GPUs currently in use, a scheduling failure message is generated; if the required number of GPUs is less than the number of GPUs currently in use, a scheduling failure message is generated.
  • the graphics processors currently used by the target development environment are sorted according to the number of tasks, and graphics processor resources are allocated to the target training task from the top N graphics processors with the smallest number of tasks; where N is the number of graphics processors required for the target training task.
  • the node manager may first determine the number of graphics processors required for the target training task; and then compare the number of graphics processors required for the target training task with the number of graphics processors currently in use in the target development environment.
  • the number of graphics processors required for the target training task is greater than the number of graphics processors currently in use in the target development environment, it may indicate that the target development environment cannot meet the resource requirements of the graphics processor of the target training task.
  • the node manager may generate scheduling failure information and return it to the user.
  • the number of graphics processors required for the target training task is not greater than the number of graphics processors currently in use in the target development environment, it can be indicated that the target development environment can meet the resource requirements of the graphics processor of the target training task.
  • the node manager can sort the graphics processors currently used by the target development environment according to the number of tasks, and allocate graphics processor resources to the target training task from the top N graphics processors with the smallest number of tasks; where N may refer to the number of graphics processors required for the target training task.
  • the node manager is used to determine the number of graphics processors to be reused based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently used is less than the number of graphics processors to be reused, a scheduling failure message is generated; if the number of graphics processors currently used is not less than the number of graphics processors to be reused, the first M graphics processors with the smallest number of tasks among the currently used graphics processors of the target development environment and the graphics processors whose current number of tasks is less than a task number threshold are allocated graphics processor resources for the target training task; wherein M is the number of graphics processors to be reused.
  • the node manager can first determine the number of graphics processors that need to be reused based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment. For example, the number of graphics processors required for the target training task can be subtracted from the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment to obtain the number of graphics processors that need to be reused.
  • the node manager may compare the relationship between the number of graphics processors that need to be reused and the number of currently used graphics processors of the target development environment.
  • the node manager can generate a scheduling failure message and send it to the user.
  • the target development environment cannot perform the target training task. After a new graphics processor is allocated, the target training task can be performed.
  • the node manager can first determine the top M graphics processors with the smallest number of tasks among the graphics processors currently used in the target development environment; then, the node manager can allocate graphics processor resources to the target training task from the top M graphics processors with the smallest number of tasks among the graphics processors currently used in the target development environment, and the graphics processors whose current number of tasks is less than the task number threshold.
  • the node manager may include an auxiliary module; the auxiliary module is used to record the allocation process of the node manager, generate and store allocation information of graphics processor resources for each development environment.
  • the auxiliary module can generate and store allocation information of the graphics processor resources for each development environment, so that the node manager can determine the allocation status of the graphics processor of each development environment based on the allocation information.
  • the target GPU quota also includes the target GPU memory quota capacity, which can be pre-entered by the user.
  • the node manager is further configured to allocate GPU memory capacity to the target training task according to the target GPU memory quota capacity.
  • allocating graphics processor resources for a target training task according to a target graphics processor quota and a target graphics processor request the following sub-steps may also be included: allocating graphics processor memory capacity for the target training task according to the target graphics processor memory quota capacity.
  • the GPU memory capacity can be directly allocated to the target training task based on the target GPU memory quota capacity.
  • the allocation process can be implemented through the following steps:
  • the node manager is used to determine whether there are other training tasks in the target development environment when starting the target training task; if there are no other training tasks in the target development environment, the graphics processor video memory capacity is allocated to the target training task according to the target graphics processor video memory quota capacity; if there are other training tasks in the target development environment, the remaining graphics processor video memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining graphics processor video memory quota capacity of the target development environment.
  • the node manager determines whether there are other training tasks in the target development environment when starting the target training task; if there are no other training tasks in the target development environment, the graphics processor video memory capacity is allocated to the target training task according to the target graphics processor video memory quota capacity; if there are other training tasks in the target development environment, the remaining graphics processor video memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining graphics processor video memory quota capacity of the target development environment.
  • the node manager may first determine whether there are other training tasks in the target development environment.
  • the node manager can directly allocate GPU memory capacity to the target training task based on the target GPU memory quota capacity.
  • the allocation is directly based on the target GPU video memory quota capacity, and the target GPU video memory quota capacity is the maximum GPU video memory capacity that the target development environment can use, so it can no longer be allocated.
  • the node manager may no longer allocate new GPU memory capacity for the target training task, but directly call the GPU memory capacity that has been previously allocated to the target development environment to run the target training task.
  • the graphics processor memory capacity of the target development environment is fixed; therefore, after the graphics processor memory capacity of the target development environment is allocated to the target training task, the remaining graphics processor memory quota capacity of the target development environment can be determined according to the target training task, and the allocation information can be updated according to the remaining graphics processor memory quota capacity of the target development environment.
  • the node manager can first determine the remaining GPU memory quota capacity of the target development environment based on the allocation information, and determine whether the target development environment can run the new training task normally based on the remaining GPU memory quota capacity.
  • the node manager is also used to recover the graphics processor resources allocated to the target training task when it detects that the target training task has ended.
  • the node manager can reclaim the graphics processor resources allocated for the target training task; at the same time, the auxiliary module can update the allocation information corresponding to the target development environment.
  • the node manager is further used to recycle the graphics processor resources allocated to the target training task when no heartbeat information for the target training task is received within a preset period.
  • the client plug-in library When the client plug-in library detects that the target training task has started running, it can send heartbeat information for the target training task to the node manager according to a preset period.
  • the node manager does not receive the heartbeat information for the target training task within the preset period, it may indicate that the target training task may have an abnormality.
  • the graphics processor resources allocated to the target training task can be recovered; at the same time, the allocation information corresponding to the target development environment can be updated.
  • FIG. 7 shows a partial structural diagram of an artificial intelligence training system according to an embodiment of the present application:
  • the node manager and container communicate through the communication module; the two can communicate via UDP (User Datagram Protocol) or IPC (Inter-Process Communication).
  • UDP User Datagram Protocol
  • IPC Inter-Process Communication
  • the deep learning framework for example: tf, pytorch
  • the startup of the deep learning framework will trigger the loading of the client-plugin (client control library); the client-plugin can send a message to the node manager to request the graphics processor and wait for the node manager to allocate it.
  • the training task can be trained based on the allocated GPU resources; at this time, the client-plugin can report the heartbeat to the node manager and continuously update the duration of the task.
  • the client-plugin can perform subsequent operations of the task, such as sending a task done message to the node manager.
  • the node manager After receiving the message requesting the graphics processor, the node manager can process the information and manage the graphics processor resources to allocate graphics processor resources for training tasks. After obtaining the allocation strategy, it can respond to the message and send a response graphics processor request to the client-plugin.
  • the node manager can also have timeout management. Specifically, the node manager can determine whether it is necessary to reclaim the graphics processor resources allocated for the training task based on whether it receives the heartbeat information for the training task within the preset period. For example, the node manager presets a preset period of 5s. If the node manager detects that it has not received the heartbeat information reported by the training task within 5s, it is considered to have timed out. At this time, the graphics processor resources allocated for the training task can be reclaimed.
  • the embodiment of the present application provides an artificial intelligence training system, including at least one node, a node manager and a client plug-in library, wherein multiple development environments are created in each node; the client plug-in library is used to generate a target graphics processor request after redirecting the loading process of the target deep learning framework of the target training task when detecting the start of the target training task in the target development environment; the target graphics processor request is sent to the node manager; the node manager is used to allocate graphics processor resources to the target training task in response to the target graphics processor request; the target deep learning framework uses the graphics processor resources allocated by the node manager to perform artificial intelligence training.
  • the present application it is realized to start from the perspective of the deep learning framework, analyze the loading logic of the deep learning framework when the training task is started, and realize dynamic graphics by hijacking the framework.
  • the method is simple to implement and does not require framework modification. Users will not notice it when using it. It is as flexible as the default GPU sharing mode.
  • the graphics processor resources are untied from the development environment.
  • the graphics processor resources are allocated only when the user actually starts the training task. This solves the problem of users occupying graphics processor resources in vain in the pre-allocation mode, efficiently utilizes the node graphics processor resources, and improves the overall utilization rate of the node's graphics processor resources.
  • the artificial intelligence training system includes a client plug-in library and at least one node, and multiple development environments are created in each node.
  • modules may be included:
  • the request module 801 is used to obtain a target graphics processor request from the client plug-in library when it is detected that a target training task in the target development environment is started; wherein the target graphics processor request is generated by the client plug-in library after redirecting the loading process of the target deep learning framework of the target training task when it is detected that the target training task in the target development environment is started;
  • the allocation module 802 is used to allocate GPU resources to the target training task in response to the target GPU request.
  • the allocation module 802 is used to determine the target GPU quota pre-configured for the target development environment; and allocate GPU resources to the target training task according to the target GPU quota and the target GPU request.
  • the device further comprises:
  • the quota receiving module is used to receive the GPU quota input by the user for each development environment.
  • the target node corresponding to the target development environment deploys corresponding graphics processor resources
  • the target graphics processor quota includes the target graphics processor quota number
  • the allocation module 802 is used to determine the number of graphics processors currently used by the target development environment; if the number of graphics processors currently used is less than the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently used is equal to the target graphics processor quota number, then the graphics processor resources are allocated to the target training task from the graphics processors currently used by the target development environment.
  • the allocation module 802 is used to determine the number of graphics processors required for the target training task; if the required number of graphics processors is greater than the number of graphics processors currently in use, a scheduling failure message is generated; if the required number of graphics processors is not greater than the number of graphics processors currently in use, the graphics processors currently used by the target development environment are sorted according to the number of tasks, and graphics processor resources are allocated to the target training task from the top N graphics processors with the smallest number of tasks; wherein N is the number of graphics processors required for the target training task.
  • the allocation module 802 is used to determine the current task number of the graphics processor that is not currently used by the target development environment, and determine the number of graphics processors whose current task number is less than the task number threshold; if the number of graphics processors whose current task number is less than the task number threshold is less than the number of graphics processors required for the target training task, then allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold and the currently used graphics processors of the target development environment; if the number of graphics processors whose current task number is less than the task number threshold is not less than the number of graphics processors required for the target training task, then allocate graphics processor resources to the target training task from the graphics processors whose current task number is less than the task number threshold.
  • the allocation module 802 is used to determine the number of graphics processors to be reused based on the number of graphics processors required for the target training task and the number of graphics processors in the graphics processor resources corresponding to the target node that are not currently used by the target development environment; if the number of graphics processors currently used is less than the number of graphics processors to be reused, a scheduling failure message is generated; if the number of graphics processors currently used is not less than the number of graphics processors to be reused, the first M graphics processors with the smallest number of tasks among the currently used graphics processors of the target development environment and the graphics processors whose current number of tasks is less than the task number threshold are allocated graphics processor resources for the target training task; wherein M is the number of graphics processors to be reused.
  • the device further comprises:
  • the allocation information generation module is used to generate and store allocation information of graphics processor resources for each development environment.
  • the target graphics processor quota also includes the target graphics processor video memory quota capacity, and according to the allocation module 802, it is also used to allocate the graphics processor video memory capacity to the target training task according to the target graphics processor video memory quota capacity.
  • the allocation module 802 is used to determine whether there are other training tasks in the target development environment when the target training task is started; if there are no other training tasks in the target development environment, the graphics processor video memory capacity is allocated to the target training task according to the target graphics processor video memory quota capacity; if there are other training tasks in the target development environment, the remaining graphics processor video memory quota capacity of the target development environment is determined according to the target training task, and the allocation information is updated according to the remaining graphics processor video memory quota capacity of the target development environment.
  • the device further comprises:
  • the first recycling module is used to recycle the graphics processor resources allocated to the target training task when it is detected that the target training task is finished.
  • the device further comprises:
  • the second recycling module is used to recycle the graphics processor resources allocated to the target training task when no heartbeat information for the target training task is received within a preset period.
  • the embodiment of the present application when the target training task in the target development environment is detected to be started, the loading of the client plug-in library will be triggered; thus, the client plug-in library can redirect the loading process of the target deep learning framework of the target training task to hijack the startup process of the deep learning framework, and generate a target graphics processor request in this process to request the allocation of graphics processor resources for the target training task.
  • the embodiment of the present application starts from the perspective of the deep learning framework, analyzes the loading logic of the deep learning framework when the training task is started, and realizes dynamic graphics processor sharing by hijacking the framework.
  • the method is simple to implement and does not require modification of the framework.
  • the user has no perception when using it, and it is as flexible as the default graphics processor sharing mode.
  • the graphics processor resources are untied from the development environment, and the graphics processor resources are allocated only when the user actually starts the training task, which solves the problem of users occupying graphics processor resources in the pre-allocation mode, efficiently utilizes the node graphics processor resources, and improves the utilization rate of the overall graphics processor resources of the node.
  • An embodiment of the present application also provides an electronic device, as shown in Figure 9, the electronic device 9 includes a processor 901, a memory 902, and a computer program stored in the memory 902 and capable of running on the processor, and the computer program implements the above resource allocation method when executed by the processor.
  • the embodiment of the present application further provides a non-volatile computer-readable storage medium, as shown in FIG10 , on which a computer program 1001 is stored, and when the computer program 1001 is executed by a processor, the resource allocation method described above is implemented.
  • the description is relatively simple. Please refer to the partial description of the method embodiment.
  • the embodiments of the present application may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the embodiments of the present application may adopt the form of a computer program product implemented in one or more non-volatile computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • non-volatile computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions.
  • These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing terminal device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device so that a series of operating steps are executed on the computer or other programmable terminal device to produce computer-implemented processing, so that the instructions executed on the computer or other programmable terminal device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Generation (AREA)

Abstract

本申请实施例提供了一种资源的分配方法、装置和一种人工智能训练***,该方法包括:当检测到目标开发环境中的目标训练任务启动时,将触发客户端插件库的加载;从而,客户端插件库可以重定向目标训练任务的目标深度学习框架的加载流程,以劫持深度学习框架的启动流程,并在此过程生成目标图形处理器请求来请求为目标训练任务分配图形处理器资源。相比于现有技术来说,本申请实施例从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。

Description

一种资源的分配方法、装置和一种人工智能训练***
相关申请的交叉引用
本申请要求于2022年11月28日提交中国专利局,申请号为202211498123.3,申请名称为“一种资源的分配方法、装置和一种人工智能训练***”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能训练的技术领域,特别是涉及一种资源的分配方法、装置和一种人工智能训练***。
背景技术
在人工智能大数据时代,随着数据量的不断增大以及深度学习算法的发展,对计算力的需求也越来越高,随之出现多种类别的高性能设备,如通用的图形处理器。图形处理器算力强但价格昂贵,且现实应用中受多种条件的限制,导致图形处理器的整体利用率较低,相比图形处理器高昂的价格,企业和用户更关心利用率问题,因此业界针对该问题进行了大量的探索,其中以图形处理器共享(复用)尤为甚之。
默认的裸图形处理器动态共享模式,在用户提交任务时,不会判断资源是否足够,多用户任务之间会抢占图形处理器资源,以图形处理器显存为例,如果剩余显存不足则会导致OOM(Out Of Memory,内(显)存不足时产生的错误码),进而可能导致用户的任务都失败退出。
为了解决资源抢占OOM问题,业界通常利用k8s(kubernetes,是用于自动部署、扩缩和管理容器化应用程序的开源***)或者容器管理***来管理图形处理器资源,并结合pod(节点)或者容器使用。启动图形处理器任务时管理***首先会判断图形处理器空闲资源是否满足任务需要的图形处理器资源,如果满足则根据调度策略会在相应节点上创建运行容器并分配图形处理器资源给该容器,然后运行图形处理器任务,结束后会回收容器和对应的图形处理器资源。
此类模式只能通过界面启动任务,不可通过底层命令行启动任务,一旦用户在底层操作会造成图形处理器资源的混乱,且此类模式每次启动任务都需要创建容器,不利于用户编辑和调试脚本。为此业界通常采用开发环境方式来共享图形处理器。
用户需要先创建开发环境,同时将CPU(Central Processing Unit,中央处理器)、图形处理器资源与用户开发环境进行绑定,开发环境创建成功,相应的图形处理器资源就会绑定,即预分配模式,该模式下用户可以以shell方式(指的是shell编写的脚本程序)登录开发环境,开发调试比较便捷,但该模式最大的弊端就是灵活性差、图形处理器资源空闲概率大,原因是预分配的图形处理器资源,即使用户不使用时,其他用户也无法使用,导致节点整体图形处理器资源的利用率低下。
发明内容
鉴于上述问题,提出了以便提供克服上述问题或者至少部分地解决上述问题的一种资源的分配方法、装置和一种人工智能训练***,包括:
一种资源的分配方法,应用于人工智能训练***,人工智能训练***包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,方法包括:
当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求;其中,目标图形处理器请求为客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后生成的;
响应于目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,响应于目标图形处理器请求,为目标训练任务分配图形处理器资源,包括:
确定预先为目标开发环境配置的目标图形处理器配额;
根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,方法还包括:
接收用户针对各开发环境输入的图形处理器配额。
在本申请的一些实施例中,目标开发环境对应的目标节点部署对应的图形处理器资源,目标图形处理器配额包括目标图形处理器配额个数,根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源,包括:
确定目标开发环境的当前已使用图形处理器个数;
若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源;
若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源,包括:
确定目标训练任务所需的图形处理器个数;
若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;
若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;
其中,N为目标训练任务所需的图形处理器个数。
在本申请的一些实施例中,从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源,包括:
确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;
若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源。
若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,从当前任务数小于任务数阈值的图形处理器,和目标开发环 境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源,包括:
根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;
若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;
若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;
其中,M为重复使用图形处理器个数。
在本申请的一些实施例中,方法还包括:
生成并存储针对各开发环境的图形处理器资源的分配信息。
在本申请的一些实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量,根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源,还包括:
根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在本申请的一些实施例中,根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量,包括:
判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;
若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;
若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在本申请的一些实施例中,方法还包括:
当检测到目标训练任务结束时,回收为目标训练任务分配的图形处理器资源。
在本申请的一些实施例中,方法还包括:
在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
本申请实施例还提供了一种人工智能训练***,包括至少一个节点、节点管理器和客户端插件库,每个节点中创建有多个开发环境;
客户端插件库,用于在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后,生成目标图形处理器请求;将目标图形处理器请求发送给节点管理器;
节点管理器,用于响应于目标图形处理器请求,为目标训练任务分配图形处理器资源;
目标深度学习框架,采用节点管理器分配的图形处理器资源,进行人工智能训练。
在本申请的一些实施例中,节点管理器,用于确定预先为目标开发环境配置的目标图形处理器配额;根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,节点管理器包括图形处理器管理模块;
图形处理器管理模块,用于存储用户针对各开发环境输入的图形处理器配额。
在本申请的一些实施例中,目标开发环境对应的目标节点部署对应的图形处理器资源,目标图形处理器配额包括目标图形处理器配额个数;
节点管理器,用于确定目标开发环境的当前已使用图形处理器个数;若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源;若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,节点管理器,用于确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
在本申请的一些实施例中,节点管理器,用于确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源;若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,节点管理器,用于根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;其中,M为重复使用图形处理器个数。
在本申请的一些实施例中,节点管理器包括辅助模块;
辅助模块,用于对节点管理器的分配过程进行记录,生成并存储针对各开发环境的图形处理器资源的分配信息。
在本申请的一些实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量;
节点管理器,还用于根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在本申请的一些实施例中,节点管理器,用于判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在本申请的一些实施例中,节点管理器,还用于当检测到目标训练任务结束时,回收为目标训练任务分配的图形处理器资源。
在本申请的一些实施例中,节点管理器,还用于在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
本申请实施例还提供了一种资源的分配装置,应用于人工智能训练***,人工智能训练***包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,装置包括:
请求模块,用于当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求;其中,目标图形处理器请求为客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后生成的;
分配模块,用于响应于目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,分配模块,用于确定预先为目标开发环境配置的目标图形处理器配额;根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,装置还包括:
配额接收模块,用于接收用户针对各开发环境输入的图形处理器配额。
在本申请的一些实施例中,目标开发环境对应的目标节点部署对应的图形处理器资源,目标图形处理器配额包括目标图形处理器配额个数,分配模块,用于确定目标开发环境的当前已使用图形处理器个数;若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源;若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,分配模块,用于确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
在本申请的一些实施例中,分配模块,用于确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源;若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,分配模块,用于根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;其中,M为重复使用图形处理器个数。
在本申请的一些实施例中,装置还包括:
分配信息生成模块,用于生成并存储针对各开发环境的图形处理器资源的分配信息。
在本申请的一些实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量,根据分配模块,还用于根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在本申请的一些实施例中,分配模块,用于判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在本申请的一些实施例中,装置还包括:
第一回收模块,用于当检测到目标训练任务结束时,回收为目标训练任务分配的图形处理器资源。
在本申请的一些实施例中,装置还包括:
第二回收模块,用于在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
本申请实施例还提供了一种电子设备,包括处理器、存储器及存储在存储器上并能够在处理器上运行的计算机程序,计算机程序被处理器执行时实现如上的资源的分配方法。
本申请实施例还提供了一种非易失性计算机可读存储介质,非易失性计算机可读存储介质上存储计算机程序,计算机程序被处理器执行时实现如上的资源的分配方法。
本申请实施例具有以下优点:
本申请实施例中,当检测到目标开发环境中的目标训练任务启动时,将触发客户端插件库的加载;从而,客户端插件库可以重定向目标训练任务的目标深度学习框架的加载流程,以劫持深度学习框架的启动流程,并在此过程生成目标图形处理器请求来请求为目标训练任务分配图形处理器资源。相比于现有技术来说,本申请实施例从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。且基于图形处理器动态共享逻辑,将图形处理器资源与开发环境进行解绑,只有用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
附图说明
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例的一种资源的分配方法的步骤流程图;
图2是本申请实施例的另一种资源的分配方法的步骤流程图;
图3是本申请实施例的一种图形处理器调度的步骤流程图;
图4是本申请实施例的另一种图形处理器调度的步骤流程图;
图5是本申请实施例的又一种图形处理器调度的流程示意图;
图6是本申请实施例的一种人工智能训练***的结构示意图;
图7是本申请实施例的一种人工智能训练***的部分结构示意图;
图8是本申请实施例的一种资源的分配装置的结构示意图;
图9是本申请实施例的一种电子设备的结构示意图;
图10是本申请实施例的一种非易失性计算机可读存储介质的示意图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了提高图形处理器资源的利用率,本申请实施例提供了一种资源的分配方法,该方法可以应用于人工智能训练***;人工智能训练***可以用于进行任务训练,例如:图像识别的训练等。
人工智能训练***可以包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,客户端插件库可以用于请求图形处理器资源和任务保活等功能。
当检测到目标开发环境中的目标训练任务启动时,将触发客户端插件库的加载;从而,客户端插件库可以重定向目标训练任务的目标深度学习框架的加载流程,以劫持深度学习框架的启动流程,并在此过程生成目标图形处理器请求来请求为目标训练任务分配图形处理器资源。
相比于现有技术来说,本申请实施例从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。
且基于图形处理器动态共享逻辑,将图形处理器资源与开发环境进行解绑,只有用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
参照图1,示出了本申请实施例的一种资源的分配方法的步骤流程图,该方法可以应用于人工智能训练***,人工智能训练***可以包括客户端插件库和至少一个节点,每个节点中可以创建有多个开发环境。
具体地,可以包括如下步骤:
步骤101、当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求;其中,目标图形处理器请求为客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后生成的。
其中,客户端插件库可以根据实际情况设定,其可以用于请求图形处理器资源和任务保活等功能,还可以用于其他功能,例如:发送训练任务完成消息等,本申请实施例对此不作限制。
目标训练任务可以是用户在目标开发环境中发起的训练任务,例如:可以是图像识别的训练任务。
目标深度学习框架可以是用户基于需求选择的深度学习框架,例如:caffe(为一种使用C++语言编写的深度学习的框架)、TensorFlow(是一种深度学习框架,它是一个完全基于Python语言设计的开源的软件)、pytorch(是一种用于构建深度学习模型的功能完备 框架,是一种通常用于图像识别和语言处理等应用程序的机器学习)、mxnet(是开源深度学习框架,允许用户在多种设备(无论是云基础设施还是移动设备)上定义、训练和部署深度神经网络)等,本申请实施例对此不作限制。
在实际应用中,为了提高了节点整体图形处理器资源的利用率,可以在人工智能训练***中设置一客户端插件库;在检测到目标开发环境中的目标训练任务启动时,将触发该客户端插件库的加载。
此时,客户端插件库可以重定向目标深度学习框架的加载流程,并生成对应的目标图形处理器请求。
在本申请的一些实施例中,目标图形处理器请求中可以包括目标训练任务所需要的图形处理器资源、目标开发环境的标识等,本申请实施例对此不作限制。
客户端插件库在生成目标图形处理器请求后,可以发送给人工智能训练***中的节点管理器;节点管理器可以用于图形处理器资源的分配。
步骤102、响应于目标图形处理器请求,为目标训练任务分配图形处理器资源。
节点管理器在接收到目标图形处理器请求后,可以响应于该目标图形处理器请求,为目标训练任务分配图形处理器资源;例如:将要分配给目标训练任务使用的图形处理器和图形处理器个数、将要分配给目标训练任务使用的显存容量等,本申请实施例对此不作限制。
从而,只有在用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
在本申请的一些实施例中,在为目标训练任务分配图形处理器资源后,目标深度学习框架可以采用节点管理器为目标训练任务分配图形处理器资源,进行人工智能训练。
在本申请的一些实施例中,当检测到目标开发环境中的目标训练任务启动时,将触发客户端插件库的加载;从而,客户端插件库可以重定向目标训练任务的目标深度学习框架的加载流程,以劫持深度学习框架的启动流程,并在此过程生成目标图形处理器请求来请求为目标训练任务分配图形处理器资源。相比于现有技术来说,本申请实施例从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。且基于图形处理器动态共享逻辑,将图形处理器资源与开发环境进行解绑,只有用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
参照图2,示出了本申请实施例的另一种资源的分配方法的步骤流程图,可以包括如下步骤:
步骤201、当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求。
在实际应用中,为了提高了节点整体图形处理器资源的利用率,可以在人工智能训练***中设置一客户端插件库;在检测到目标开发环境中的目标训练任务启动时,将触发该客户端插件库的加载。
此时,客户端插件库可以重定向目标深度学习框架的加载流程,并生成对应的目标图形 处理器请求。
客户端插件库在生成目标图形处理器请求后,可以发送给人工智能训练***中的节点管理器。
步骤202、确定预先为目标开发环境配置的目标图形处理器配额。
其中,目标图形处理器配额可以是用户预先针对目标开发环境所设置的、开发环境最大可使用的图形处理器资源,例如:最大可使用的图形处理器个数、显存大小等,本申请实施例对此不作限制。
在本申请的一些实施例中,可以预先接收用户针对各开发环境输入的图形处理器配额。
在实际应用中,图形处理器调度的前提条件之一是知道训练任务所需要的图形处理器资源,例如:图形处理器个数和显存大小等。但是,在训练任务启动时,该类信息对于节点管理器来说可能是并不知晓的,尤其是需要的图形处理器显存大小。
基于此,可以在创建开发环境时,接收用户输入的针对开发环境的图形处理器配额;然后,节点管理器可以对其进行存储,以便在后续需要对目标开发环境中的目标训练任务进行图形处理器资源的分配时,节点管理器还可以基于预先为开发环境配置的图形处理器配额进行图形处理器资源的分配。
步骤203、根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
在得到针对目标开发环境的目标图形处理器配额和目标图形处理器请求后,可以基于目标图形处理器配额和目标图形处理器请求来为目标训练任务分配图形处理器资源。
例如:如果目标图形处理器请求中包括有目标训练任务所需要的图形处理器资源的信息的话,则可以直接根据目标图形处理器请求来为目标训练任务分配图形处理器资源;但是需要注意的是,如果目标训练任务所需要的图形处理器资源超过了目标图形处理器配额的话,可以表示目标开发环境是无法进行目标训练任务的,此时可以生成调度失败信息,并返回给用户。
如果目标图形处理器请求中不包括有目标训练任务所需要的图形处理器资源的信息的话,则可以基于目标图形处理器配额来为目标训练任务分配图形处理器资源,本申请实施例对此不作限制。
在本申请一实施例中,目标开发环境对应的目标节点可以部署有对应的图形处理器资源(例如:部署有3个图形处理器,显存大小为5G),目标图形处理器配额包括目标图形处理器配额个数,在根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源的时候,可以通过如下子步骤来实现分配过程:
子步骤11、确定目标开发环境的当前已使用图形处理器个数。
在实际应用中,目标开发环境可能不仅仅在进行目标训练任务,还可能在进行其他训练任务;此时,可以先确定目标开发环境已经使用的图形处理器的个数,也就是目标开发环境当前正在进行的训练任务所使用的图形处理器的个数。
子步骤12、若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在确定目标开发环境的当前已使用图形处理器个数后,可以比较其与目标图形处理器配 额个数的大小;如果当前已使用图形处理器个数小于目标图形处理器配额个数的话,可以表示针对目标开发节点的图形处理器个数的配额还有剩余。
此时,可以从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源,以便目标训练任务使用新的图形处理器来进行训练。
在本申请的一些实施例中,可以通过如下步骤来实现子步骤12的分配过程:
确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源。若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
具体的,可以先确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;然后,可以比较当前任务数小于任务数阈值的图形处理器的个数与目标训练任务所需的图形处理器个数的大小。其中,任务数据阈值可以根据实际情况设定,本申请实施例对此不作限制。
如果当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数的话,可以表示未被目标开发环境使用的图形处理器无法满足目标训练任务的图形处理器资源的需求。
此时,可以从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,分别为目标训练任务分配图形处理器资源,以便尽可能地通过多个图形处理器来完成目标训练任务。
而如果当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数的话,可以表示未被目标开发环境使用的图形处理器可以满足目标训练任务的图形处理器资源的需求。
此时,可以从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源,以便将新的图形处理器分配给目标训练任务。
子步骤13、若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
而如果当前已使用图形处理器个数等于目标图形处理器配额个数的话,可以表示针对目标开发节点的图形处理器个数的配额没有剩余了。
此时,可以从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,可以通过如下步骤来实现子步骤13的分配过程:
确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
具体的,可以先确定目标训练任务所需的图形处理器个数;然后,比较目标训练任务所需的图形处理器个数与目标开发环境的当前已使用图形处理器个数的大小关系。
如果目标训练任务所需的图形处理器个数大于目标开发环境的当前已使用图形处理器个数的话,可以表示目标开发环境无法满足目标训练任务的图形处理器的资源要求。
此时,可以生成调度失败信息,并返回给用户。
而如果目标训练任务所需的图形处理器个数不大于目标开发环境的当前已使用图形处理器个数的话,可以表示目标开发环境可以满足目标训练任务的图形处理器的资源要求。
此时,可以按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N可以指目标训练任务所需的图形处理器个数。
在本申请一实施例中,在从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源的时候,可以通过如下步骤来进行分配的过程:
根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;其中,M为重复使用图形处理器个数。
具体的,可以先根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定需要被重复使用的图形处理器个数,例如:可以用目标训练任务所需的图形处理器个数减去目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,从而得到需要被重复使用的图形处理器个数。
然后,可以比较需要被重复使用的图形处理器个数与目标开发环境的当前已使用图形处理器个数的大小关系。
如果目标开发环境的当前已使用图形处理器个数,小于需要被重复使用的图形处理器个数的话,可以表示目标开发环境无法进行目标训练任务;此时可以生成调度失败信息,并发送给用户。
而如果目标开发环境的当前已使用图形处理器个数,不小于需要被重复使用的图形处理器个数的话,则可以表示目标开发环境无法进行目标训练任务在分配了新的图形处理器后,可以进行目标训练任务。
具体的,可以先确定目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器;然后,可以从目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在实际应用中,针对分配过程,还可以生成并存储针对各开发环境的图形处理器资源的分配信息,以便节点管理器可以基于分配信息确定各开发环境的图形处理器的分配情况。
如图3,示出了本申请实施例的一种图形处理器调度的步骤流程图:
在处理图形处理器请求时,先获取开发环境当前使用的图形处理器个数,记为第一个数;然后基于开发环境的配额,计算开发环境当前可用的新的图形处理器的个数(记为第二个数)=开发环境的配额-第一个数。
如果第二个数=0,则可以再判断第一个数是否小于训练任务所需的图形处理器个数(记为第三个数)。如果第一个数小于第三个数,则调度失败。如果第一个数不小于第三个数,则按照任务个数,从小到大对当前被目标开发环境使用的图形处理器进行排序,得到第一集合;然后,从第一集合中选取前第三个数个的图形处理器来为训练任务分配图形处理器资源;此时,即为调度成功。
如果第二个数不等于0的话,则可以获取满足条件(当前任务数小于任务数阈值)的前第二个数个新的图形处理器,记为第二集合。如果第二集合中的图形处理器的个数大于第三个数,则直接从第二集合中选取前第三个数个图形处理器来为训练任务分配图形处理器资源;此时,即为调度成功。如果第二集合中的图形处理器的个数不大于第三个数,则从当前任务数小于任务数阈值的图形处理器,和开发环境的当前的使用图形处理器中,为训练任务分配图形处理器资源。
具体的,可以先按照任务个数,从小到大对当前被目标开发环境使用的图形处理器进行排序,得到第三集合;然后根据训练任务所需的图形处理器个数,和节点对应的图形处理器资源中的当前未被开发环境使用的图形处理器的个数,确定重复使用图形处理器个数(记为第四个数)=第三个数-第二个数。
如果第四个数大于第三集合的图形处理器的个数,则调度失败;如果第四个数不大于第三集合的图形处理器的个数,则从第二集合和第三集合的前第四个数的图形处理器中为训练任务分配图形处理器资源。
在本申请另一实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量,目标图形处理器显存配额容量可以为用户预先输入的。
在根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源的时候,还可以包括如下子步骤包括:
子步骤21、根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在实际应用中,目标训练任务所需要的图形处理器显存容量是难以知晓的;因此,在针对目标训练任务分配图形处理器显存容量时,可以直接基于目标图形处理器显存配额容量来为目标训练任务分配图形处理器显存容量。
为了避免重复分配,本申请一实施例中,在根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量的时候,可以通过如下步骤实现分配的过程:
判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在为目标训练任务分配图形处理器显存容量时,可以先判断目标开发环境中是否已有其他训练任务。
如果在启动目标训练任务时,目标开发环境中没有其他训练任务的话,则可以直接根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
而如果在启动目标训练任务时,目标开发环境中有其他训练任务的话,由于分配是直接根据目标图形处理器显存配额容量进行的,而目标图形处理器显存配额容量又是目标开发环境所能够使用的最大的图形处理器显存容量,已经不能再分配了。
此时,可以不再为目标训练任务分配新的图形处理器显存容量,而是直接调用在先已经分配给目标开发环境的图形处理器显存容量来运行目标训练任务。
在本申请的一些实施例中,目标开发环境的图形处理器显存容量是固定的;因此,在将目标开发环境的图形处理器显存容量分配给目标训练任务后,可以根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
然后,当后续目标开发环境启动了新的训练任务的话,可以先根据分配信息,确定目标开发环境剩余的图形处理器显存配额容量,并基于剩余的图形处理器显存配额容量判断目标开发环境能否正常运行新的训练任务。
如图4,在分配图形处理器时,确定当前所分配的图形处理器是否为新的;如果是,则根据开发环境对应的图形处理器显存配额容量,和当前所调度的图形处理器显存容量,得到开发环境剩余的图形处理器显存配额容量,并在开发环境中添加该图形处理器,以及在对应的图形处理器下添加训练任务的信息。
如果不是新的,则在对应的图形处理器下添加训练任务的信息。
然后,继续判断下一个分配的图形处理器是否为新的,直至没有分配的图形处理器了。此时,可以按照图形处理器的剩余资源,从大到小的对图形处理器进行排序,并将任务信息添加至超时机制,以判断训练任务是否超时。
为了保证图形处理器资源的有效回收,以便其他开发环境/训练任务使用,在本申请一实施例中,还可以包括如下一种或多种的回收图形处理器资源的步骤:
第一种:当检测到目标训练任务结束时,回收为目标开发环境分配的图形处理器资源。
在实际应用中,如果检测到目标训练任务结束的话,可以回收为目标训练任务分配图形处理器资源;同时,可以跟新目标开发环境对应的分配信息。
第二种:在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
客户端插件库在检测到目标训练任务已经开启运行时,可以按照预设周期向节点管理器发送针对目标训练任务的心跳信息。
节点管理器如果在预设周期内未收到针对目标训练任务的心跳信息的话,可以表示该目标训练任务可能已经出现异常,此时可以回收为目标训练任务分配的图形处理器资源;同时,可以更新目标开发环境对应的分配信息。
如图5,示出了本申请实施例的又一种图形处理器调度的流程示意图:
启动训练任务后,运行深度学习框架,触发客户端插件库逻辑;此时,可以获取开发环境的图形处理器配额。
如果图形处理器配额为0,则表示该开发环境中的训练任务都不能使用图形处理器资源,本次关于图形处理器的操作出错退出;或者,只能使用中央处理器训练。
如果图形处理器配额不为0,则可以确定为训练任务分配的图形处理器个数。
如果该图形处理器个数为0,则表示该训练任务不能使用图形处理器,本次关于图形处理器的操作出错退出;或者,只能使用中央处理器训练。
如果该图形处理器个数不为0,则可以客户端插件库可以与节点管理器创建通信和心跳上报句柄,并向节点管理器请求图形处理器,并等待节点管理器的响应。
如果响应,则分配成功,运行训练任务,并在训练任务结束时,向节点管理器发送任务完成消息。
如果不响应,则表示分配失败;此时,可以判断是否是因为资源不足而分配失败。如果是,则继续请求图形处理器;如果不是,则分配失败,任务退出。
本申请实施例中,当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求;确定预先为目标开发环境配置的目标图形处理器配额;根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源;基于对节点图形处理器资源的超分(即预先为开发环境配置图形处理器配额),解决了用户开发环境空占图形处理器资源的问题,提高了节点图形处理器资源的利用率。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图6,示出了本申请实施例的一种人工智能训练***的结构示意图,可以包括至少一个节点、节点管理器和客户端插件库,每个节点中创建有多个开发环境。
客户端插件库,用于在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后,生成目标图形处理器请求;将目标图形处理器请求发送给节点管理器。
在实际应用中,为了提高了节点整体图形处理器资源的利用率,可以在人工智能训练***中设置一客户端插件库;在检测到目标开发环境中的目标训练任务启动时,将触发该客户端插件库的加载。
此时,客户端插件库可以重定向目标深度学习框架的加载流程,并生成对应的目标图形处理器请求。
客户端插件库在生成目标图形处理器请求后,可以发送给人工智能训练***中的节点管理器。
在本申请一实施例中,节点管理器,用于响应于目标图形处理器请求,为目标训练任务分配图形处理器资源。
节点管理器在接收到目标图形处理器请求后,可以响应于该目标图形处理器请求,为目标训练任务分配图形处理器资源。
作为一示例,节点管理器还可以用于确定预先为目标开发环境配置的目标图形处理器配额;根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
作为另一示例,节点管理器可以包括辅助模块;辅助模块,用于对节点管理器的分配过 程进行记录。具体的,针对分配过程,辅助模块可以生成并存储针对各开发环境的图形处理器资源的分配信息,以便节点管理器可以基于分配信息确定各开发环境的图形处理器的分配情况。
在本申请一实施例中,目标深度学习框架,采用节点管理器分配的图形处理器资源,进行人工智能训练。
在为目标训练任务分配图形处理器资源后,目标深度学习框架可以采用节点管理器为目标训练任务分配图形处理器资源,进行人工智能训练。
在实际应用中,图形处理器调度的前提条件之一是知道训练任务所需要的图形处理器资源,例如:图形处理器个数和显存大小等。但是,在训练任务启动时,该类信息对于节点管理器来说可能是并不知晓的,尤其是需要的图形处理器显存大小。
基于此,可以在创建开发环境时,接收用户输入的针对开发环境的图形处理器配额;然后,节点管理器可以对其进行存储,以便在后续需要对目标开发环境中的目标训练任务进行图形处理器资源的分配时,节点管理器还可以基于预先为开发环境配置的图形处理器配额进行图形处理器资源的分配。
作为一示例,节点管理器可以包括图形处理器管理模块;图形处理器管理模块,用于存储用户针对各开发环境输入的图形处理器配额。
在得到针对目标开发环境的目标图形处理器配额和目标图形处理器请求后,节点管理器可以基于目标图形处理器配额和目标图形处理器请求来为目标训练任务分配图形处理器资源。
如果目标图形处理器请求中不包括有目标训练任务所需要的图形处理器资源的信息的话,则节点管理器可以基于目标图形处理器配额来为目标训练任务分配图形处理器资源。
在本申请一实施例中,目标开发环境对应的目标节点部署对应的图形处理器资源,目标图形处理器配额包括目标图形处理器配额个数;节点管理器,用于确定目标开发环境的当前已使用图形处理器个数;若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源;若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在实际应用中,目标开发环境可能不仅仅在进行目标训练任务,还可能在进行其他训练任务;此时,节点管理器可以先确定目标开发环境已经使用的图形处理器的个数,也就是目标开发环境当前正在进行的训练任务所使用的图形处理器的个数。
在确定目标开发环境的当前已使用图形处理器个数后,节点管理器可以比较其与目标图形处理器配额个数的大小;如果当前已使用图形处理器个数小于目标图形处理器配额个数的话,可以表示针对目标开发节点的图形处理器个数的配额还有剩余。
此时,节点管理器可以从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源,以便目标训练任务使用新的图形处理器来进行训练。
在本申请一实施例中,节点管理器,用于确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务 数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源;若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在实际应用中,节点管理器可以确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源。若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
具体的,节点管理器可以先确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;然后,节点管理器可以比较当前任务数小于任务数阈值的图形处理器的个数与目标训练任务所需的图形处理器个数的大小。
如果当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数的话,可以表示未被目标开发环境使用的图形处理器无法满足目标训练任务的图形处理器资源的需求。
此时,节点管理器可以从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,分别为目标训练任务分配图形处理器资源,以便尽可能地通过多个图形处理器来完成目标训练任务。
而如果当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数的话,可以表示未被目标开发环境使用的图形处理器可以满足目标训练任务的图形处理器资源的需求。
此时,节点管理器可以从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源,以便将新的图形处理器分配给目标训练任务。
若当前已使用图形处理器个数等于目标图形处理器配额个数,则节点管理器可以从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
而如果当前已使用图形处理器个数等于目标图形处理器配额个数的话,可以表示针对目标开发节点的图形处理器个数的配额没有剩余了。
此时,节点管理器可以从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请一实施例中,节点管理器,用于确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
在实际应用中,节点控制器可以确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器 个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
具体的,节点管理器可以先确定目标训练任务所需的图形处理器个数;然后,比较目标训练任务所需的图形处理器个数与目标开发环境的当前已使用图形处理器个数的大小关系。
如果目标训练任务所需的图形处理器个数大于目标开发环境的当前已使用图形处理器个数的话,可以表示目标开发环境无法满足目标训练任务的图形处理器的资源要求。
此时,节点管理器可以生成调度失败信息,并返回给用户。
而如果目标训练任务所需的图形处理器个数不大于目标开发环境的当前已使用图形处理器个数的话,可以表示目标开发环境可以满足目标训练任务的图形处理器的资源要求。
此时,节点管理器可以按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N可以指目标训练任务所需的图形处理器个数。
在本申请一实施例中,节点管理器,用于根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;其中,M为重复使用图形处理器个数。
具体的,节点管理器可以先根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定需要被重复使用的图形处理器个数,例如:可以用目标训练任务所需的图形处理器个数减去目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,从而得到需要被重复使用的图形处理器个数。
然后,节点管理器可以比较需要被重复使用的图形处理器个数与目标开发环境的当前已使用图形处理器个数的大小关系。
如果目标开发环境的当前已使用图形处理器个数,小于需要被重复使用的图形处理器个数的话,可以表示目标开发环境无法进行目标训练任务;此时节点管理器可以生成调度失败信息,并发送给用户。
而如果目标开发环境的当前已使用图形处理器个数,不小于需要被重复使用的图形处理器个数的话,则可以表示目标开发环境无法进行目标训练任务在分配了新的图形处理器后,可以进行目标训练任务。
具体的,节点管理器可以先确定目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器;然后,节点管理器可以从目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请一实施例中,节点管理器可以包括辅助模块;辅助模块,用于对节点管理器的分配过程进行记录,生成并存储针对各开发环境的图形处理器资源的分配信息。
具体的,针对分配过程,辅助模块可以生成并存储针对各开发环境的图形处理器资源的分配信息,以便节点管理器可以基于分配信息确定各开发环境的图形处理器的分配情况。
在本申请另一实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量,目标图形处理器显存配额容量可以为用户预先输入的。节点管理器,还用于根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在实际应用中,在根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源的时候,还可以包括如下子步骤:根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在实际应用中,目标训练任务所需要的图形处理器显存容量是难以知晓的;因此,在针对目标训练任务分配图形处理器显存容量时,可以直接基于目标图形处理器显存配额容量来为目标训练任务分配图形处理器显存容量。
为了避免重复分配,本申请一实施例中,在根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量的时候,可以通过如下步骤实现分配的过程:
在本申请一实施例中,节点管理器,用于判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在实际应用中,节点管理器判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在为目标训练任务分配图形处理器显存容量时,节点管理器可以先判断目标开发环境中是否已有其他训练任务。
如果在启动目标训练任务时,目标开发环境中没有其他训练任务的话,则节点管理器可以直接根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
而如果在启动目标训练任务时,目标开发环境中有其他训练任务的话,由于分配是直接根据目标图形处理器显存配额容量进行的,而目标图形处理器显存配额容量又是目标开发环境所能够使用的最大的图形处理器显存容量,已经不能再分配了。
此时,节点管理器可以不再为目标训练任务分配新的图形处理器显存容量,而是直接调用在先已经分配给目标开发环境的图形处理器显存容量来运行目标训练任务。
在本申请的一些实施例中,目标开发环境的图形处理器显存容量是固定的;因此,在将目标开发环境的图形处理器显存容量分配给目标训练任务后,可以根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
然后,当后续目标开发环境启动了新的训练任务的话,节点管理器可以先根据分配信息,确定目标开发环境剩余的图形处理器显存配额容量,并基于剩余的图形处理器显存配额容量判断目标开发环境能否正常运行新的训练任务。
为了保证图形处理器资源的有效回收,以便其他开发环境/训练任务使用,在本申请一实施例中,节点管理器,还用于当检测到目标训练任务结束时,回收为目标训练任务分配的图形处理器资源。
在实际应用中,如果检测到目标训练任务结束的话,节点管理器可以回收为目标训练任务分配的图形处理器资源;同时,辅助模块可以跟新目标开发环境对应的分配信息。
在本申请另一实施例中,节点管理器,还用于在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
客户端插件库在检测到目标训练任务已经开启运行时,可以按照预设周期向节点管理器发送针对目标训练任务的心跳信息。
节点管理器如果在预设周期内未收到针对目标训练任务的心跳信息的话,可以表示目标训练任务可能已经出现异常,此时可以回收为目标训练任务分配的图形处理器资源;同时,可以更新目标开发环境对应的分配信息。
如图7,示出了本申请实施例的一种人工智能训练***的部分结构示意图:
node manager(节点管理器)和container(开发环境)之间通过通信模块进行通信;两者可以以UDP(User Datagram Protocol,用户数据报协议)进行通信,也可以以IPC(Inter-Process Communication,进程间通信)进行通信。
在接收到Jobs(分布式任务调度平台)发送来的任务后,将触发深度学习框架(例如:tf、pytorch)启动,深度学习框架的启动将触发client-plugin(客户端控件库)的加载;client-plugin可以向node manager发送请求图形处理器的消息,并等待node manager的分配。
当分配完成后,训练任务可以基于所分配的图形处理器资源进行训练;此时,client-plugin可以报上心跳给node manager,并持续更新任务持续的时间。
当训练任务完成时,client-plugin可以执行任务的后续操作,例如:向node manager发送task done消息。
node manager在接收到请求图形处理器的消息后,可以对该信息进行处理,并对图形处理器资源进行管理,以为训练任务分配图形处理器资源;得到分配策略后,可以进行消息的响应,并向client-plugin发送respone图形处理器请求。
另外,node manager还可以具备超时管理,具体的,node manager可以基于在预设周期内是否收到针对训练任务的心跳信息,来判断是否需要收回为训练任务分配的图形处理器资源;例如:node manager预设了预设周期为5s,如果node manager检测到5s内没有收到训练任务上报的心跳信息,则认为超时;此时,可以收回为训练任务分配的图形处理器资源。
需要说明的是,上述资源的分配方法可以应用于上述的人工智能训练***,
本申请实施例提供了一种人工智能训练***,包括至少一个节点、节点管理器和客户端插件库,每个节点中创建有多个开发环境;客户端插件库,用于在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后,生成目标图形处理器请求;将目标图形处理器请求发送给节点管理器;节点管理器,用于响应于目标图形处理器请求,为目标训练任务分配图形处理器资源;目标深度学习框架,采用节点管理器分配的图形处理器资源,进行人工智能训练。通过本申请实施例,实现了从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图 形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。
且基于图形处理器动态共享逻辑,将图形处理器资源与开发环境进行解绑,只有用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
参照图8,示出了本申请实施例的一种资源的分配装置的结构示意图,应用于人工智能训练***,人工智能训练***包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,
具体的,可以包括如下模块:
请求模块801,用于当检测到目标开发环境中的目标训练任务启动时,从客户端插件库中获取目标图形处理器请求;其中,目标图形处理器请求为客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向目标训练任务的目标深度学习框架的加载流程后生成的;
分配模块802,用于响应于目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,分配模块802,用于确定预先为目标开发环境配置的目标图形处理器配额;根据目标图形处理器配额和目标图形处理器请求,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,装置还包括:
配额接收模块,用于接收用户针对各开发环境输入的图形处理器配额。
在本申请的一些实施例中,目标开发环境对应的目标节点部署对应的图形处理器资源,目标图形处理器配额包括目标图形处理器配额个数,分配模块802,用于确定目标开发环境的当前已使用图形处理器个数;若当前已使用图形处理器个数小于目标图形处理器配额个数,则从目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源;若当前已使用图形处理器个数等于目标图形处理器配额个数,则从当前被目标开发环境使用的图形处理器中,为目标训练任务分配图形处理器资源。
在本申请的一些实施例中,分配模块802,用于确定目标训练任务所需的图形处理器个数;若所需的图形处理器个数大于当前已使用图形处理器个数时,生成调度失败信息;若所需的图形处理器个数不大于当前已使用图形处理器个数时,按照任务个数,对当前被目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为目标训练任务分配图形处理器资源;其中,N为目标训练任务所需的图形处理器个数。
在本申请的一些实施例中,分配模块802,用于确定当前未被目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若当前任务数小于任务数阈值的图形处理器的个数,小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器,和目标开发环境的当前的使用图形处理器中,为目标训练任务分配图形处理器资源;若当前任务数小于任务数阈值的图形处理器的个数,不小于目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源。
本申请的一个可选实施例中,分配模块802,用于根据目标训练任务所需的图形处理器个数,和目标节点对应的图形处理器资源中的当前未被目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若当前已使用图形处理器个数小于重复使用图形处理器个数,则生成调度失败信息;若当前已使用图形处理器个数不小于重复使用图形处理器个数,则将目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及当前任务数小于任务数阈值的图形处理器中,为目标训练任务分配图形处理器资源;其中,M为重复使用图形处理器个数。
在本申请的一些实施例中,装置还包括:
分配信息生成模块,用于生成并存储针对各开发环境的图形处理器资源的分配信息。
在本申请的一些实施例中,目标图形处理器配额还包括目标图形处理器显存配额容量,根据分配模块802,还用于根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量。
在本申请的一些实施例中,分配模块802,用于判断在启动目标训练任务时,目标开发环境中是否已有其他训练任务;若目标开发环境中没有其他训练任务,则根据目标图形处理器显存配额容量,为目标训练任务分配图形处理器显存容量;若目标开发环境中已有其他训练任务,则根据目标训练任务,确定目标开发环境剩余的图形处理器显存配额容量,并根据目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
在本申请的一些实施例中,装置还包括:
第一回收模块,用于当检测到目标训练任务结束时,回收为目标训练任务分配的图形处理器资源。
在本申请的一些实施例中,装置还包括:
第二回收模块,用于在预设周期内未收到针对目标训练任务的心跳信息时,回收为目标训练任务分配的图形处理器资源。
本申请实施例中,当检测到目标开发环境中的目标训练任务启动时,将触发客户端插件库的加载;从而,客户端插件库可以重定向目标训练任务的目标深度学习框架的加载流程,以劫持深度学习框架的启动流程,并在此过程生成目标图形处理器请求来请求为目标训练任务分配图形处理器资源。相比于现有技术来说,本申请实施例从深度学习框架的角度入手,分析训练任务启动时,深度学习框架的加载逻辑,通过劫持框架手段实现动态图形处理器共享,方法实现简单且无需修改框架,用户使用时无感知,跟默认的图形处理器共享模式一样灵活。且基于图形处理器动态共享逻辑,将图形处理器资源与开发环境进行解绑,只有用户真正启动训练任务时才分配图形处理器资源,解决了预分配模式下用户空占用图形处理器资源的问题,高效地利用了节点图形处理器资源,提高了节点整体图形处理器资源的利用率。
本申请实施例还提供了一种电子设备,如图9所示,该电子设备9包括处理器901、存储器902及存储在存储器902上并能够在处理器上运行的计算机程序,计算机程序被处理器执行时实现如上的资源的分配方法。
本申请实施例还提供了一种非易失性计算机可读存储介质,如图10所示,该非易失性计算机可读存储介质10上存储计算机程序1001,计算机程序1001被处理器执行时实现如上的资源的分配方法。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处 参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的非易失性计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请实施例是参照根据本申请实施例的方法、终端设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性地包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对所提供的一种资源的分配方法、装置和一种人工智能训练***,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。

Claims (30)

  1. 一种资源的分配方法,其特征在于,应用于人工智能训练***,所述人工智能训练***包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,所述方法包括:
    当检测到目标开发环境中的目标训练任务启动时,从所述客户端插件库中获取目标图形处理器请求;其中,所述目标图形处理器请求为所述客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向所述目标训练任务的目标深度学习框架的加载流程后生成的;
    响应于所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源。
  2. 根据权利要求1所述的方法,其特征在于,各开发环境设有对应的图形处理器配额,所述图形处理器配额是各开发环境在创建时,用户针对对应开发环境输入的开发环境最大可使用的图形处理器资源。
  3. 根据权利要求1所述的方法,其特征在于,所述响应于所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源,包括:
    确定预先为所述目标开发环境配置的目标图形处理器配额;
    根据所述目标图形处理器配额和所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    接收用户针对各开发环境输入的图形处理器配额。
  5. 根据权利要求4所述的方法,其特征在于,所述目标开发环境对应的目标节点部署对应的图形处理器资源,所述目标图形处理器配额包括目标图形处理器配额个数,所述根据所述目标图形处理器配额和所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源,包括:
    确定所述目标开发环境的当前已使用图形处理器个数;
    若所述当前已使用图形处理器个数小于所述目标图形处理器配额个数,则从所述目标节点对应的图形处理器资源中的当前未被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配图形处理器资源;
    若所述当前已使用图形处理器个数等于所述目标图形处理器配额个数,则从当前被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配图形处理器资源。
  6. 根据权利要求5所述的方法,其特征在于,所述从当前被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配图形处理器资源,包括:
    确定所述目标训练任务所需的图形处理器个数;
    若所述所需的图形处理器个数大于所述当前已使用图形处理器个数时,生成调度失败信息;
    若所述所需的图形处理器个数不大于所述当前已使用图形处理器个数时,按照任务个数,对当前被所述目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为所述目标训练任务分配图形处理器资源;
    其中,N为所述目标训练任务所需的图形处理器个数。
  7. 根据权利要求5所述的方法,其特征在于,所述从所述目标节点对应的图形处理器资源中的当前未被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配 图形处理器资源,包括:
    确定当前未被所述目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;
    若所述当前任务数小于任务数阈值的图形处理器的个数,小于所述目标训练任务所需的图形处理器个数,则从所述当前任务数小于任务数阈值的图形处理器,和所述目标开发环境的当前的使用图形处理器中,为所述目标训练任务分配图形处理器资源;
    若所述当前任务数小于任务数阈值的图形处理器的个数,不小于所述目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为所述目标训练任务分配图形处理器资源。
  8. 根据权利要求7所述的方法,其特征在于,所述从所述当前任务数小于任务数阈值的图形处理器,和所述目标开发环境的当前的使用图形处理器中,为所述目标训练任务分配图形处理器资源,包括:
    根据所述目标训练任务所需的图形处理器个数,和所述目标节点对应的图形处理器资源中的当前未被所述目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;
    若所述当前已使用图形处理器个数小于所述重复使用图形处理器个数,则生成调度失败信息;
    若所述当前已使用图形处理器个数不小于所述重复使用图形处理器个数,则将所述目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及所述当前任务数小于任务数阈值的图形处理器中,为所述目标训练任务分配图形处理器资源;
    其中,M为所述重复使用图形处理器个数。
  9. 根据权利要求5-8任一项所述的方法,其特征在于,所述方法还包括:
    生成并存储针对各开发环境的图形处理器资源的分配信息。
  10. 根据权利要求9所述的方法,其特征在于,所述目标图形处理器配额还包括目标图形处理器显存配额容量,所述根据所述目标图形处理器配额和所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源,还包括:
    根据所述目标图形处理器显存配额容量,为所述目标训练任务分配图形处理器显存容量。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述目标图形处理器显存配额容量,为所述目标训练任务分配图形处理器显存容量,包括:
    判断在启动所述目标训练任务时,所述目标开发环境中是否已有其他训练任务;
    若所述目标开发环境中没有其他训练任务,则根据所述目标图形处理器显存配额容量,为所述目标训练任务分配图形处理器显存容量;
    若所述目标开发环境中已有其他训练任务,则根据所述目标训练任务,确定所述目标开发环境剩余的图形处理器显存配额容量,并根据所述目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
  12. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    当检测到所述目标训练任务结束时,回收为所述目标训练任务分配的图形处理器资源。
  13. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在预设周期内未收到针对所述目标训练任务的心跳信息时,回收为所述目标训练任务分配的图形处理器资源。
  14. 一种人工智能训练***,其特征在于,包括至少一个节点、节点管理器和客户端插件库,每个节点中创建有多个开发环境;
    所述客户端插件库,用于在检测到目标开发环境中的目标训练任务启动时,重定向所述目标训练任务的目标深度学习框架的加载流程后,生成目标图形处理器请求;将所述目标图形处理器请求发送给所述节点管理器;
    所述节点管理器,用于响应于所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源;
    所述目标深度学习框架,采用所述节点管理器分配的图形处理器资源,进行人工智能训练。
  15. 根据权利要求14所述的***,其特征在于,各开发环境设有对应的图形处理器配额,所述图形处理器配额是各开发环境在创建时,用户针对对应开发环境输入的开发环境最大可使用的图形处理器资源。
  16. 根据权利要求14所述的***,其特征在于,
    所述节点管理器,用于确定预先为所述目标开发环境配置的目标图形处理器配额;根据所述目标图形处理器配额和所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源。
  17. 根据权利要求16所述的***,其特征在于,所述节点管理器包括图形处理器管理模块;
    所述图形处理器管理模块,用于存储用户针对各开发环境输入的图形处理器配额。
  18. 根据权利要求17所述的***,其特征在于,所述目标开发环境对应的目标节点部署对应的图形处理器资源,所述目标图形处理器配额包括目标图形处理器配额个数;
    所述节点管理器,用于确定所述目标开发环境的当前已使用图形处理器个数;若所述当前已使用图形处理器个数小于所述目标图形处理器配额个数,则从所述目标节点对应的图形处理器资源中的当前未被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配图形处理器资源;若所述当前已使用图形处理器个数等于所述目标图形处理器配额个数,则从当前被所述目标开发环境使用的图形处理器中,为所述目标训练任务分配图形处理器资源。
  19. 根据权利要求18所述的***,其特征在于,
    所述节点管理器,用于确定所述目标训练任务所需的图形处理器个数;若所述所需的图形处理器个数大于所述当前已使用图形处理器个数时,生成调度失败信息;若所述所需的图形处理器个数不大于所述当前已使用图形处理器个数时,按照任务个数,对当前被所述目标开发环境使用的图形处理器进行排序,并从其中任务个数最小的前N个图形处理器中,为所述目标训练任务分配图形处理器资源;其中,N为所述目标训练任务所需的图形处理器个数。
  20. 根据权利要求18所述的***,其特征在于,
    所述节点管理器,用于确定当前未被所述目标开发环境使用的图形处理器的当前任务数,并确定当前任务数小于任务数阈值的图形处理器的个数;若所述当前任务数小于 任务数阈值的图形处理器的个数,小于所述目标训练任务所需的图形处理器个数,则从所述当前任务数小于任务数阈值的图形处理器,和所述目标开发环境的当前的使用图形处理器中,为所述目标训练任务分配图形处理器资源;若所述当前任务数小于任务数阈值的图形处理器的个数,不小于所述目标训练任务所需的图形处理器个数,则从当前任务数小于任务数阈值的图形处理器中,为所述目标训练任务分配图形处理器资源。
  21. 根据权利要求20所述的***,其特征在于,
    所述节点管理器,用于根据所述目标训练任务所需的图形处理器个数,和所述目标节点对应的图形处理器资源中的当前未被所述目标开发环境使用的图形处理器的个数,确定重复使用图形处理器个数;若所述当前已使用图形处理器个数小于所述重复使用图形处理器个数,则生成调度失败信息;若所述当前已使用图形处理器个数不小于所述重复使用图形处理器个数,则将所述目标开发环境的当前使用的图形处理器中的任务数最小的前M个图形处理器,以及所述当前任务数小于任务数阈值的图形处理器中,为所述目标训练任务分配图形处理器资源;其中,M为所述重复使用图形处理器个数。
  22. 根据权利要求18-21任一项所述的***,其特征在于,所述节点管理器包括辅助模块;
    所述辅助模块,用于对所述节点管理器的分配过程进行记录,生成并存储针对各开发环境的图形处理器资源的分配信息。
  23. 根据权利要求22所述的***,其特征在于,所述目标图形处理器配额还包括目标图形处理器显存配额容量;
    所述节点管理器,还用于根据所述目标图形处理器显存配额容量,为所述目标训练任务分配图形处理器显存容量。
  24. 根据权利要求23所述的***,其特征在于,
    所述节点管理器,用于判断在启动所述目标训练任务时,所述目标开发环境中是否已有其他训练任务;若所述目标开发环境中没有其他训练任务,则根据所述目标图形处理器显存配额容量,为所述目标训练任务分配图形处理器显存容量;若所述目标开发环境中已有其他训练任务,则根据所述目标训练任务,确定所述目标开发环境剩余的图形处理器显存配额容量,并根据所述目标开发环境剩余的图形处理器显存配额容量对分配信息进行更新。
  25. 根据权利要求14所述的***,其特征在于,
    所述节点管理器,还用于当检测到所述目标训练任务结束时,回收为所述目标训练任务分配的图形处理器资源。
  26. 根据权利要求14所述的***,其特征在于,
    所述节点管理器,还用于在预设周期内未收到针对所述目标训练任务的心跳信息时,回收为所述目标训练任务分配的图形处理器资源。
  27. 一种资源的分配装置,其特征在于,应用于人工智能训练***,所述人工智能训练***包括客户端插件库和至少一个节点,每个节点中创建有多个开发环境,所述装置包括:
    请求模块,用于当检测到目标开发环境中的目标训练任务启动时,从所述客户端插件库中获取目标图形处理器请求;其中,所述目标图形处理器请求为所述客户端插件库在检测到目标开发环境中的目标训练任务启动时,重定向所述目标训练任务的目标深度 学习框架的加载流程后生成的;
    分配模块,用于响应于所述目标图形处理器请求,为所述目标训练任务分配图形处理器资源。
  28. 根据权利要求27所述的装置,其特征在于,各开发环境设有对应的图形处理器配额,所述图形处理器配额是各开发环境在创建时,用户针对对应开发环境输入的开发环境最大可使用的图形处理器资源。
  29. 一种电子设备,其特征在于,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至13中任一项所述的资源的分配方法。
  30. 一种非易失性计算机可读存储介质,其特征在于,所述非易失性计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至13中任一项所述的资源的分配方法。
PCT/CN2023/104043 2022-11-28 2023-06-29 一种资源的分配方法、装置和一种人工智能训练*** WO2024113836A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211498123.3A CN115601221B (zh) 2022-11-28 2022-11-28 一种资源的分配方法、装置和一种人工智能训练***
CN202211498123.3 2022-11-28

Publications (1)

Publication Number Publication Date
WO2024113836A1 true WO2024113836A1 (zh) 2024-06-06

Family

ID=84852084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104043 WO2024113836A1 (zh) 2022-11-28 2023-06-29 一种资源的分配方法、装置和一种人工智能训练***

Country Status (2)

Country Link
CN (1) CN115601221B (zh)
WO (1) WO2024113836A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601221B (zh) * 2022-11-28 2023-05-23 苏州浪潮智能科技有限公司 一种资源的分配方法、装置和一种人工智能训练***

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210191780A1 (en) * 2020-09-30 2021-06-24 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for processing development machine operation task, device and storage medium
CN114595058A (zh) * 2022-03-02 2022-06-07 北京金山云网络技术有限公司 基于gpu资源的模型训练方法和装置、电子设备和存储介质
CN115601221A (zh) * 2022-11-28 2023-01-13 苏州浪潮智能科技有限公司(Cn) 一种资源的分配方法、装置和一种人工智能训练***

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885389B (zh) * 2019-02-19 2021-07-16 浪潮云信息技术股份公司 一种基于容器的并行深度学习调度训练方法及***
CN110413408A (zh) * 2019-06-29 2019-11-05 苏州浪潮智能科技有限公司 一种深度学习框架的显存控制方法、设备以及存储介质
CN110941481A (zh) * 2019-10-22 2020-03-31 华为技术有限公司 资源调度方法、装置及***
CN113867959A (zh) * 2021-09-29 2021-12-31 苏州浪潮智能科技有限公司 一种训练任务资源调度方法、装置、设备及介质
CN114564302A (zh) * 2022-01-29 2022-05-31 苏州浪潮智能科技有限公司 一种gpu资源分配方法、***、设备以及介质
CN114721818A (zh) * 2022-03-11 2022-07-08 中国科学院信息工程研究所 一种基于Kubernetes集群的GPU分时共享方法和***
CN114741175A (zh) * 2022-04-15 2022-07-12 支付宝(杭州)信息技术有限公司 任务执行方法、装置、中心节点和下游节点设备
CN114816746A (zh) * 2022-04-21 2022-07-29 浪潮云信息技术股份公司 一种具有多重虚拟化类型的gpu实现混合类型虚拟化的方法
CN115562878B (zh) * 2022-12-06 2023-06-02 苏州浪潮智能科技有限公司 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210191780A1 (en) * 2020-09-30 2021-06-24 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for processing development machine operation task, device and storage medium
CN114595058A (zh) * 2022-03-02 2022-06-07 北京金山云网络技术有限公司 基于gpu资源的模型训练方法和装置、电子设备和存储介质
CN115601221A (zh) * 2022-11-28 2023-01-13 苏州浪潮智能科技有限公司(Cn) 一种资源的分配方法、装置和一种人工智能训练***

Also Published As

Publication number Publication date
CN115601221B (zh) 2023-05-23
CN115601221A (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
CN109104467B (zh) 开发环境构建方法、装置以及平台***和存储介质
US9661071B2 (en) Apparatus, systems and methods for deployment and management of distributed computing systems and applications
CN111052086B (zh) 一种云托管函数的暖启动技术
WO2020177564A1 (zh) Vnf的生命周期管理方法及装置
CN103064742A (zh) 一种hadoop集群的自动部署***及方法
WO2024113836A1 (zh) 一种资源的分配方法、装置和一种人工智能训练***
CN112416585A (zh) 面向深度学习的gpu资源管理与智能化调度方法
EP4050482A1 (en) Resource deployment system and method based on cloud cost
CN109347716B (zh) 消费者vnf的实例化方法及装置
CN115242877B (zh) 面向多K8s集群的Spark协同计算、作业方法及装置
CN112463290A (zh) 动态调整计算容器的数量的方法、***、装置和存储介质
CN110580195A (zh) 一种基于内存热插拔的内存分配方法和装置
CN116048825A (zh) 容器集群构建方法及***
CN112003931B (zh) 一种编排控制器部署方法、***及相关组件
CN113472557A (zh) 一种虚拟网元处理方法、装置及电子设备
CN112445602A (zh) 资源调度方法、装置、***及电子设备
CN114546648A (zh) 任务处理方法及任务处理平台
US11954525B1 (en) Method and apparatus of executing collaborative job for spark faced to multiple K8s clusters
CN113590415B (zh) 深度学***台的端口管理***、方法、设备及介质
CN112596855B (zh) 一种容器创建方法及装置
US20240127111A1 (en) Internet-of-things-oriented machine learning container image download method and system
CN112825044B (zh) 任务执行方法、装置及计算机存储介质
CN112148348B (zh) 任务处理方法、装置及存储介质
CN114327752A (zh) 一种微服务配置方法、装置及设备
CN117369942A (zh) 一种应用服务资源编排及自动化部署方法及***