CN114595058A

CN114595058A - Model training method and device based on GPU (graphics processing Unit) resources, electronic equipment and storage medium

Info

Publication number: CN114595058A
Application number: CN202210199265.3A
Authority: CN
Inventors: 任文龙; 徐雪梅
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-06-07

Abstract

The application provides a model training method and device based on GPU resources, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring development environment creation information; creating a target development environment by adopting target CPU resources according to the development environment creation information; acquiring a target model to be trained through a target development environment; determining GPU resource demand required by a target model to be trained; and under the condition that available GPU resources meeting GPU resource demand exist in the GPU resource pool, training the target model to be trained according to the available GPU resources, wherein the available GPU resources are GPU resources which do not execute the training task. According to the method and the device, the CPU resource is adopted for developing the model to be trained of the target, the GPU resource is only used for a training mode of the model to be trained of the target, the purpose of decoupling the development of the model to be trained of the target and the GPU resource is achieved, and the technical problems that the GPU resource is low in use efficiency and insufficient in GPU resource in the related technology can be solved.

Description

Model training method and device based on GPU (graphics processing Unit) resources, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training method and device based on GPU resources, electronic equipment and a storage medium.

Background

With the rise of artificial intelligence, the construction of artificial intelligence platforms is an important ring for the construction of intelligent internet. The artificial intelligence platform is a comprehensive service platform integrating development and training of artificial intelligence algorithms and storage and release of models. When an algorithm engineer uses a platform, firstly, a development environment capable of ensuring code development and real-time operation is independently established on an artificial intelligence platform, and as more and more artificial intelligence technologies are realized based on a Graphics Processing Unit (GPU), in order to ensure that training codes written by the artificial intelligence platform can run more efficiently, the algorithm engineer often uses the GPU in the development environment established by the artificial intelligence platform, which requires that the artificial intelligence platform has enough GPU machines, thereby increasing the operation and maintenance cost of the platform.

In the related art, as more development environments are created on a platform by algorithm engineers, and the technology of binding the development environments and the GPU causes inefficient use of GPU resources, so that the problem of insufficient GPU cards is likely to occur.

Therefore, the related art has a problem of inefficient use of GPU resources.

Disclosure of Invention

The application provides a model training method and device based on GPU resources, electronic equipment and a storage medium, and aims to at least solve the problem of low GPU resource utilization efficiency in the related art.

According to an aspect of the embodiments of the present application, there is provided a model training method based on GPU resources, including:

acquiring development environment creation information;

creating a target development environment by adopting target CPU resources according to the development environment creation information;

acquiring a target model to be trained through the target development environment;

determining the GPU resource demand required by the target model to be trained;

and under the condition that available GPU resources meeting the GPU resource demand exist in a GPU resource pool, training the target model to be trained according to the available GPU resources, wherein the available GPU resources are GPU resources which do not execute training tasks.

Optionally, as in the foregoing method, the creating a target development environment by using target CPU resources according to the development environment creation information includes:

responding to a creating instruction in the development environment creating information, and acquiring a target CPU resource in a CPU resource pool according to a target CPU resource parameter in the development environment creating information;

and creating to obtain the target development environment by installing the target software in the development environment creating information in the target CPU resource.

Optionally, as in the foregoing method, in a case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, before training the model to be trained according to the available GPU resources, the method further includes:

obtaining all candidate models to be trained currently, wherein the candidate models to be trained comprise the target models to be trained;

sequencing all the candidate models to be trained, and determining a training sequence corresponding to each candidate model to be trained;

and under the condition that the training sequence corresponding to the target model to be trained is first, executing skip operation of the step of training the model to be trained according to the available GPU resources under the condition that the available GPU resources meeting the GPU resource demand of the target model to be trained exist in the GPU resource pool.

Optionally, as in the foregoing method, in the case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, training the target model to be trained according to the available GPU resources includes:

under the condition that available GPU resources meeting the GPU resource demand of the target model to be trained exist in a GPU resource pool, acquiring data addresses in the development environment creation information, wherein the data addresses are addresses of training parameters for training the target model to be trained;

acquiring the training parameters according to the data address;

selecting a target GPU resource consistent with the GPU resource demand from the available GPU resources;

and training the target model to be trained by adopting the target GPU resources and the training parameters.

Optionally, as in the foregoing method, in a case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, after the training of the target model to be trained according to the available GPU resources, the method further includes:

acquiring target load information corresponding to the target GPU resource according to a fixed period, wherein the target load information is used for indicating the completion degree of the target GPU resource for training the target model to be trained;

under the condition that the target load information indicates that the target model to be trained is trained, acquiring a training result of the target model to be trained;

and returning the training result to the target development environment according to address information corresponding to the target model to be trained, wherein the address information is an address for acquiring the target model to be trained from the target development environment.

Optionally, as in the foregoing method, after the training result is returned to the target development environment according to the target identifier corresponding to the target model to be trained, the method further includes:

deleting all data in the target GPU resource, and determining the state of the target GPU resource to be available.

Optionally, as in the foregoing method, the obtaining, by the target development environment, a model to be trained includes:

receiving a compilation operation of a target object through the target development environment;

and obtaining the target model to be trained obtained after the target object completes the compiling operation.

According to another aspect of the embodiments of the present application, there is also provided a GPU resource-based model training apparatus, including:

the first acquisition module is used for acquiring development environment creation information;

the creating module is used for creating a target development environment by adopting target CPU resources according to the development environment creating information;

the second acquisition module is used for acquiring a target model to be trained through the target development environment;

the determining module is used for determining the GPU resource demand required by the target model to be trained;

and the training module is used for training the target model to be trained according to the available GPU resources under the condition that the available GPU resources meeting the GPU resource demand exist in the GPU resource pool, wherein the available GPU resources are GPU resources which do not execute the training task.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory communicate with each other through the communication bus; wherein the memory is used for storing the computer program; a processor for performing the method steps in any of the above embodiments by running the computer program stored on the memory.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method steps of any of the above embodiments when the computer program is executed.

In the embodiment of the application, the CPU resource is adopted for developing the model to be trained, and the GPU resource is only used for the training mode of the model to be trained, so that the aim of decoupling the development of the model to be trained and the GPU resource is achieved, the development of the model to be trained is not required to occupy the GPU resource, the utilization rate of the GPU resource can be improved, the training efficiency can be effectively improved under the condition that the number of the models to be trained is large, and the required GPU resource is more than all GPU resources in a GPU resource pool, the problem that the GPU resource in the related technology is low in use efficiency, and the technical problem that the GPU resource is not enough is easy to occur is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 is a schematic flow chart diagram illustrating an alternative GPU resource-based model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating an alternative GPU resource-based model training method according to another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an alternative GPU resource-based model training method according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a time division multiplexing service in accordance with an application example of the present application;

FIG. 5 is a block diagram of an alternative GPU resource-based model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of the embodiment of the application, a model training method based on GPU resources is provided. Optionally, in this embodiment, the model training method based on GPU resources may be applied to a hardware environment formed by a terminal and a server. The server is connected with the terminal through a network, can be used for providing services (such as advertisement push services and application services) for the terminal or a client installed on the terminal, and can be provided with a database on the server or independent of the server for providing data storage services for the server.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity ), bluetooth. The terminal may not be limited to a PC, a mobile phone, a tablet computer, and the like.

The model training method based on GPU resources can be executed by a server, can also be executed by a terminal, and can also be executed by the server and the terminal together. The method for the terminal to execute the model training method based on the GPU resources according to the embodiment of the present application may also be executed by a client installed thereon.

Taking the example of the method for performing model training based on GPU resources by a server in this embodiment, fig. 1 is a method for performing model training based on GPU resources provided in this embodiment, and includes the following steps:

step S101, acquiring development environment creation information.

The model training method based on the GPU resources in this embodiment may be applied to a scene in which a model needs to be trained by the GPU resources, for example: a scene in which the model is trained using the GPU card, or a scene in which other training is performed using GPU resources. In the embodiment of the present application, the model training method based on the GPU resources is described by taking the GPU card as an example, and the model training method based on the GPU resources is also applicable to other types of GPU resources without contradiction.

Taking a scenario in which the model is trained using the GPU cards as an example, the idle states of all the GPU cards (i.e., GPU resources) are determined to determine the available GPU cards (i.e., available GPU resources) in all the GPU cards, and the available GPU cards are used to train the target model to be trained.

In order to obtain the target model to be trained, a development environment for developing model codes needs to be obtained. The development environment is created by adopting specific computing resources, and before the development environment is created, development environment creation information needs to be acquired.

The development environment creation information may include sub information for applying for a computing resource, and sub information such as software and a training data set required for creating an environment. Alternatively, the development environment creation information may be information that is acquired through the AI platform and uploaded by the developer, and the development environment creation information is transferred to the container cluster system.

And step S102, creating a target development environment by adopting target CPU resources according to the development environment creation information.

After the development environment creation information is obtained, the target CPU resource required by the development environment can be determined, and then the target development environment is created and obtained.

Optionally, the development environment creation information may include sub information for applying for computing resources, and sub information such as software and training data sets (data for performing preliminary training) required for creating the environment; therefore, the target CPU resource can be obtained based on the sub-information application for performing the computing resource application, and then the target CPU resource is used to perform the installation of the software required for environment creation, so as to obtain the target development environment.

And step S103, acquiring a target model to be trained through a target development environment.

After the target development environment is obtained, a developer can develop the model in the target development environment, and after the model to be trained is developed, the purpose of obtaining the model to be trained through the target development environment is achieved.

And step S104, determining the GPU resource demand required by the target model to be trained.

After the target model to be trained is developed, the GPU resource demand which is actively uploaded by a developer and used for training the target model to be trained can be obtained, or the GPU resource demand can be automatically analyzed by a specified resource analysis tool based on the target model to be trained.

And step S105, under the condition that available GPU resources meeting GPU resource demand exist in the GPU resource pool, training the target model to be trained according to the available GPU resources, wherein the available GPU resources are GPU resources which do not execute the training task.

After the GPU resource demand is determined, the model to be trained can be trained under the condition that available GPU resources meeting the GPU resource demand exist in the GPU resource pool.

The available GPU resources may be GPU resources in an idle state in a GPU resource pool which do not execute training tasks.

For example, the GPU resource pool may include a plurality of GPU cards, and if the GPU resource demand is 4 GPU cards, the target model to be trained may be trained according to the available GPU resources if the number of GPU cards in the available GPU resources is greater than or equal to 4.

According to the method, the development of the model to be trained is developed by adopting the CPU resource, the GPU resource is only used for training the model to be trained, the purpose of decoupling the development of the model to be trained and the GPU resource is achieved, the development of the model to be trained is not required to occupy the GPU resource, the utilization rate of the GPU resource can be improved, the training efficiency can be effectively improved under the condition that the number of the models to be trained is large, the needed GPU resource is more than all GPU resources in a GPU resource pool, the problem that the GPU resource in the related technology is low in use efficiency and the GPU resource is not enough is easily solved.

As an alternative implementation manner, as in the foregoing method, the step S102, according to the development environment creation information, creating a target development environment by using a target CPU resource, includes the following steps:

step S201, responding to the creation instruction in the development environment creation information, and according to the target CPU resource parameter in the development environment creation information, acquiring the target CPU resource in the CPU resource pool.

After the CPU resource pool (e.g., CPU machine group) acquires the development environment creation information, it may be determined that creation of the development environment is required based on a creation instruction therein (e.g., information or sub-information of a specified field in the development environment creation information), and then the target CPU resource is acquired from the CPU resource pool based on the target CPU resource parameter in the development environment creation information.

The target CPU resource parameter may be information in the development environment creation information indicating the amount of resources required for the target development environment, and may include, but is not limited to: number of CPU cards, number of memories, etc. After the target CPU resource parameters are determined, the target CPU resources for allocation to the target development environment may be determined in the CPU resource pool.

The target CPU resource can be a CPU resource of which the number of the resources determined by the CPU resource pool according to the target CPU resource parameter is consistent with the target CPU resource parameter.

Step S202, target development environment is created by installing target software in the development environment creation information in the target CPU resource.

After the target CPU resource is determined, the target development environment can be created based on the target CPU resource, and the target development environment can be created by installing the target software in the development environment creation information in the target CPU resource.

When the present embodiment is implemented according to a container cluster, for example, using a Docker technology, the target software may include, but is not limited to: a notebook (the notebook is a software that can be programmed on the web side and can be connected to the CPU development environment container through url), Python, anaconda (an open-source Python release), tensorflow (a symbolic mathematical system based on data flow programming), and the like. And typically includes a notebook, to enable developers to obtain a real-time programming environment that interworks with the container cluster to facilitate later transfer of the compiled model to the container cluster.

Further, the development environment creation information may further include a creator ID, and the target development environment is bound to the creator ID, so that a user corresponding to the creator ID may have a right to access the target development environment in a later period.

By the method in the embodiment, the target development environment can be quickly created by adopting the target CPU resource in the CPU resource pool, so that the development of the model can be conveniently carried out by adopting the target development environment based on the CPU resource in the later period, and the aim of decoupling the model development and the GPU resource is fulfilled.

As an alternative implementation, as shown in fig. 2, in the aforementioned method, before the step 105 training the target model to be trained according to the available GPU resources in the case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, the method further includes the following steps:

step S301, all candidate models to be trained which are to be trained currently are obtained, wherein all candidate models to be trained comprise target models to be trained.

When the model is trained through the GPU resources, under the condition that the GPU resources required by all candidate models to be trained are greater than all resources of the GPU pool, each candidate model to be trained still needs to be queued for training.

The candidate models to be trained may be models to be trained currently, and the number of the candidate models to be trained may be one or more. Moreover, the candidate model to be trained includes the target model to be trained acquired in step S103 in the foregoing embodiment.

Step S302, all candidate models to be trained are ranked, and a training sequence corresponding to each candidate model to be trained is determined.

After all candidate models to be trained are determined, a training sequence corresponding to each candidate model to be trained can be determined according to the acquisition time of each candidate model to be trained.

For example, the earlier the acquisition time of the candidate model to be trained is, the earlier the corresponding training sequence is, that is, the candidate model to be trained with the earlier acquisition time is preferentially trained; in addition, the candidate models to be trained with less resource can be preferentially trained by sequencing according to the GPU resource quantity required by the candidate models to be trained.

Step S303, under the condition that the training sequence corresponding to the target model to be trained is the first, executing jump operation of the step for training the model to be trained according to the available GPU resources under the condition that the available GPU resources meeting the GPU resource demand of the target model to be trained exist in the GPU resource pool.

And under the condition that the training sequence corresponding to the model to be trained is determined to be the first sequence, the model to be trained next is indicated to be the model to be trained next, so that the jump operation of the step of training the model to be trained according to the available GPU resources is executed under the condition that the available GPU resources meeting the GPU resource demand of the model to be trained exist in the GPU resource pool, and the model to be trained is trained under the condition that the available GPU resources exist.

By the method in the embodiment, the GPU resources can be used for training the candidate models to be trained in sequence, the utilization rate of the GPU resources can be effectively improved, and the model training efficiency is further improved.

As shown in fig. 3, as an alternative implementation manner, as the foregoing method, in the step S105, when there are available GPU resources meeting the GPU resource demand in the GPU resource pool, the training of the target model to be trained according to the available GPU resources includes the following steps:

step S401, under the condition that available GPU resources meeting GPU resource demand of the target model to be trained exist in the GPU resource pool, acquiring data addresses in the development environment creating information, wherein the data addresses are addresses of training parameters used for training the target model to be trained.

In the case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand of the target model to be trained, it is described that the target model to be trained can be trained, and therefore, a data address of a training parameter for training the target model to be trained needs to be acquired.

The data address may be an address where the training parameters are stored in the cloud.

And step S402, acquiring training parameters according to the data address.

After the data address is obtained, the training parameters can be obtained by accessing the data address.

In step S403, a target GPU resource that is consistent with the GPU resource demand is selected from the available GPU resources.

Meanwhile, under the condition that the available GPU resources meet GPU resource demand, target GPU resources consistent with the GPU resource demand can be selected from the available GPU resources, so that a target model to be trained is trained through the target GPU resources; alternatively, the target GPU resource may be the number of GPU cards.

And S404, training the target model to be trained by adopting the target GPU resources and training parameters.

After the target GPU resources used for training are determined and the training parameters are obtained, the target GPU resources can be adopted, and the model to be trained of the target is trained through the training parameters.

Further, after the training is completed, the trained model can be verified through preset verification parameters, and when the trained model does not meet the preset precision requirement (for example, the accuracy reaches 99%), the trained model can be trained again until the preset precision requirement is met.

By the method in the embodiment, the GPU resources are only used for training the target model to be trained, so that the utilization rate of the GPU resources can be effectively improved, and the aim of improving the training efficiency is fulfilled.

After the target model to be trained is trained through the target GPU resources, the training progress changes along with the increase of the training time, so that the training of the target model to be trained needs to be completed when the target GPU resources are determined, and the training result is timely returned to relevant personnel after the target GPU resources complete the training of the target model to be trained.

In order to achieve the above object, as an alternative implementation manner, as aforementioned method, after the step S105 trains the target model to be trained according to available GPU resources in the case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, the method further includes the following steps:

step S501, target load information corresponding to the target GPU resource is obtained according to a fixed period, wherein the target load information is used for indicating the completion degree of the target GPU resource for training the target model to be trained.

After the target GPU resource trains the target model to be trained, in order to determine whether the target GPU resource completes training the target model to be trained, target load information corresponding to the target GPU resource may be acquired according to a fixed period.

The fixed period may be a period in which the target load information is acquired, for example, once every minute, once every 10 minutes, or the like.

The target load information may be the completion degree of the target GPU resource to train the target model to be trained, and may be characterized by the remaining untrained progress or the completed training progress, for example, 10% remaining, or 90% completed, and so on.

For example, when the target load information is characterized by the remaining untrained progress, 5% of the target load information of the target GPU resources for training the target model to be trained may be obtained every 1 minute.

Step S502, under the condition that the target load information indicates that the training of the target model to be trained is completed, the training result of the target model to be trained is obtained.

After the target load information is obtained, the target load information is used for indicating the completion degree of the target GPU resource to train the model to be trained, so that when the target load information indicates that the model to be trained is trained, the training result of the model to be trained can be obtained.

The training result may include, but is not limited to, the trained target model to be trained, a log file during the training process, a test result, and information indicating that the training of the target model to be trained has been completed.

For example, when the target load information is represented by the training-completed progress, and the target load information of the target GPU resource for training the target model to be trained is determined to be 100%, it is determined that the training of the target model to be trained is completed, and then the training result of the target model to be trained can be obtained.

Step S503, returning the training result to the target development environment according to the address information corresponding to the target model to be trained, wherein the address information is an address used for acquiring the target model to be trained from the target development environment.

After the training result is obtained, the training result can be returned to the developer, that is, the training result needs to be returned to the target development environment, so that the training result can be returned to the target development environment according to the address information corresponding to the target model to be trained.

Generally, when a target GPU resource acquires a target model to be trained, address information indicating a source of the target model to be trained is recorded, and then, after a training result is obtained, the target model to be trained can be returned to a target development environment according to the address information, so that a developer can obtain the training result by accessing the target development environment.

By the method in the embodiment, the training result can be returned to the target development environment as soon as possible after the training of the target model to be trained is completed, so that developers can obtain the training result in time.

After the target model to be trained is trained through the target GPU resource and the training result is returned to the target development environment, the target GPU resource finishes the training of the target model to be trained, so that the GPU resource needs to be used for training other candidate models to be trained in time to improve the use efficiency of the GPU resource.

In order to achieve the above object, as an alternative implementation manner, as aforementioned method, after returning the training result to the target development environment according to the target identifier corresponding to the target model to be trained, the method further includes the following steps:

step S601, delete all data in the target GPU resource, and determine the state of the target GPU resource as available.

After the training of the target model to be trained is completed through the target GPU resources and the training result is returned to the target development environment, all data in the target GPU resources can be deleted in order to reuse the target GPU resources for training other candidate models to be trained.

All data in the target GPU resources may include, but is not limited to: model files, log files during training, test results, and the like.

After all data in the target GPU resource are deleted, in order to enable the GPU cluster to determine that the target GPU resource can be used for training, the state of the target GPU resource can be determined to be available, so that the GPU cluster can identify that the target GPU resource is in an idle state, and then other training tasks are executed through the target GPU resource.

By the method in the embodiment, the trained target GPU resource can be released in time, and the target GPU resource is guaranteed to be a GPU resource without garbage data when returned to the GPU resource pool as an available resource, so that GPU resource pollution and fragmentation are prevented, and efficient operation of the GPU resource is guaranteed.

As an alternative embodiment, in the foregoing method, the step S103 of obtaining the model to be trained through the target development environment includes the following steps:

in step S701, a compilation operation of a target object is received through a target development environment.

After the target development environment is created, the target object can be received through the target development environment for compiling operation of model creation.

The target object may be a person, such as a developer, an algorithm engineer, or the like, for performing a compilation operation in the target development environment.

For example, in the case where a notebook is installed in the target development environment, the compiling operation by the algorithm engineer may be received through url.

Step S702, obtaining a target model to be trained obtained after the target object completes the compiling operation.

After the target object completes the compiling operation, the model compiling is completed for the target object, and then the model to be trained of the target after the compiling operation can be obtained. And the model to be trained of the target can be ranked together with other candidate models to be trained, and the training sequence corresponding to the model to be trained of the target is determined. So that the model to be trained can be trained according to the training sequence at the later stage.

By the method in the embodiment, the compiling of the model to be trained of the target can be completed in the target development environment, so that the decoupling between the compiling operation and the GPU resource can be achieved, the time of the GPU resource for training can be increased, and the training efficiency of the model can be effectively improved.

An application example to which any of the foregoing embodiments is applied is provided as follows:

1. creating a target development environment

Firstly, a target development environment of a CPU is created, and an AI platform transmits a CPU development environment creation instruction (namely, development environment creation information) to a container cluster system, wherein the CPU development environment creation instruction comprises but is not limited to the number of used CPUs and memories (namely, target CPU resource parameters), an algorithm engineer ID, a software package which needs to be installed when starting (namely, target software), a downloaded data set (namely, training parameters) and the like. The container cluster system constructs a CPU development environment container (i.e. a target development environment) according to the CPU development environment creating instruction and deploys the CPU development environment container to the platform cluster. The container cluster returns the created development environment ID information and the url for connection to the AI platform, and the algorithm engineer can log in the CPU development environment container according to the url so as to develop codes in the later period.

The whole AI platform is realized according to the container cluster, the container is a light-weight virtual operation environment, the resource isolation function is realized, and the mutual independence of the CPU development environment containers of each algorithm project is ensured. There are many container technologies, such as Docker technology, which can create services on physical machines, and generally, when creating a CPU development environment container, a container cluster is provided with the amount of used CPU and memory, an algorithm engineer ID, an installed software package, and downloaded data and instructions, and finally Docker will create a service with a software package, called a container, on a specified amount of CPU and memory, and download the data set into the container according to the specified instructions, where the installed software package generally includes a notebook, which is a piece of web-end programmable software, and can be connected to the CPU development environment container through url, and the installed software should also include a model development software package such as anaconda, python, and tenakflow.

Furthermore, by creating a CPU development environment container through the method, an algorithm engineer can obtain a real-time programming environment which is intercommunicated with the cluster.

2. Submission training

When the algorithm engineer has developed the model, the model to be trained of the target developed in the notebook is submitted to the GPU time division multiplexing service (i.e. the service for implementing the methods of steps S301 to S303), and the tasks submitted at a certain time are as shown in fig. 4:

when training tasks T1, T2, and T3 exist, the GPU time division multiplexing service will sequence the training tasks T1, T2, and T3, assuming that the sequencing result is as shown in fig. 4, the training task T1 is arranged at the forefront of the training queue, and when the number of available GPU resources in the GPU resource pool meets the resource number requirement for task one training (i.e., GPU resource demand), submit the algorithm file of task one, the required data address, and the training parameters to the GPU resource pool, and train through the GPU resources (i.e., target GPU resources) that are the same as the GPU resource demand.

The time division multiplexing service has the function of binding in the GPU resource pool according to the number of submitted GPUs required by training, so that GPU selection is more flexible. In the related art, for example, a development environment of a 2-card GPU is created, and after the model is developed, an algorithm engineer finds that the 2-card GPU is not enough for hardware support of model writing training when the algorithm engineer wants to train, so that the algorithm engineer can only reconstruct the development environment. The invention adopts the CPU to establish a development environment, when an algorithm engineer really needs to train the model after developing the model, the number of GPUs required is calculated according to the compiled model, the trained model file, the data address, the training parameters and the number of the GPUs are submitted to the time division multiplexing service, the time division multiplexing service sequences the existing training tasks, then GPU nodes in a computing resource pool are selected to train the training task arranged at the top, and the binding can be two cards, four cards and the like, so that the invention is more flexible.

The time division multiplexing relationship of the GPU is shown in the lower part of fig. 5, which shows that in each fixed time period, the GPU can run a fixed training task, and in different time periods, different training tasks can multiplex the GPU, thereby improving the resource utilization rate of the GPU.

3. Returning the training result and releasing the GPU

The time division multiplexing service can check the state of each GPU card in real time, further obtain the load information of each GPU node (each node can comprise a plurality of GPU cards), and determine the completion degree of the training task on the corresponding target GPU resource according to the load information so as to ensure that the training result can be quickly returned to the corresponding notebook development environment after the target GPU resource completes the training task. In addition, a non-busy computing resource node (i.e., a node that includes available GPU resources) is determined, and the queued next task is sent to the non-busy node of the GPU resource pool for training of the next training task.

The returned training results include model files (i.e., files after the target model to be trained is trained), log files during the training process, and test results, and the training results need to be returned to the notebook after each training task is run. The algorithm engineer may access the notebook via url to obtain the training results.

The time division multiplexing service also has a function of cleaning garbage data generated by a training task, in the training process, intermediate training data can be generated inevitably, if the data exist in GPU nodes all the time, the gradually accumulated data can cause GPU node faults, so after model training is completed each time, the time division multiplexing service can clean the data on each GPU node, delete the model files and data which are completely run, ensure that GPU resources in the GPU nodes are a machine without garbage data when returned to a resource pool, thereby ensuring the stable running of the GPU resources and determining the state of the GPU resources as available.

Further, in addition to creating a development environment in the CPU server cluster, a local PC of an algorithm engineer or other hardware devices may be used, but the training task is sent to a GPU management service, and the GPU management service orders the training tasks and then multiplexes GPU resources for different time periods to perform model training.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, an optical disk) and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present application.

As shown in fig. 5, according to another aspect of the embodiment of the present application, there is also provided a GPU resource-based model training apparatus, including:

a first obtaining module 1, configured to obtain development environment creation information;

the creating module 2 is used for creating information according to the development environment and creating a target development environment by adopting target CPU resources;

the second acquisition module 3 is used for acquiring a target model to be trained through a target development environment;

the determining module 4 is used for determining the GPU resource demand required by the target model to be trained;

and the training module 5 is used for training the target model to be trained according to the available GPU resources under the condition that the available GPU resources meeting the GPU resource demand exist in the GPU resource pool, wherein the available GPU resources are GPU resources which do not execute the training task.

It should be noted that the first obtaining module 1 in this embodiment may be configured to execute the step S101, the creating module 2 in this embodiment may be configured to execute the step S102, the second obtaining module 3 in this embodiment may be configured to execute the step S103, the determining module 4 in this embodiment may be configured to execute the step S104, and the training module 5 in this embodiment may be configured to execute the step S105.

As an alternative embodiment, the apparatus as described above, creates a module for:

responding to a creation instruction in the development environment creation information, and acquiring a target CPU resource in a CPU resource pool according to a target CPU resource parameter in the development environment creation information;

and installing target software in the development environment creation information in the target CPU resource to create and obtain a target development environment.

As an optional implementation manner, the apparatus as aforementioned, further includes a time division multiplexing module, configured to:

obtaining all candidate models to be trained currently, wherein all candidate models to be trained comprise target models to be trained;

sequencing all candidate models to be trained, and determining a training sequence corresponding to each candidate model to be trained;

and under the condition that the training sequence corresponding to the target model to be trained is first, executing jump operation of the step of training the model to be trained according to the available GPU resources under the condition that the available GPU resources meeting the GPU resource demand of the target model to be trained exist in the GPU resource pool.

As an alternative embodiment, the apparatus as described above, the training module is configured to:

under the condition that available GPU resources meeting the GPU resource demand of a target model to be trained exist in a GPU resource pool, acquiring data addresses in development environment creation information, wherein the data addresses are addresses of training parameters for training the target model to be trained;

acquiring training parameters according to the data address;

and training the target model to be trained by adopting the target GPU resources and training parameters.

As an alternative embodiment, the apparatus as described above, the training module, is further configured to:

acquiring target load information corresponding to a target GPU resource according to a fixed period, wherein the target load information is used for indicating the completion degree of the target GPU resource for training a target model to be trained;

and returning the training result to the target development environment according to the address information corresponding to the target model to be trained, wherein the address information is an address used for acquiring the target model to be trained from the target development environment.

As an optional implementation manner, the apparatus as in the foregoing further includes a deleting module, configured to:

deleting all data in the target GPU resource, and determining the state of the target GPU resource as available.

As an alternative implementation, in the foregoing apparatus, the second obtaining module is configured to:

receiving a compilation operation of a target object through a target development environment;

and obtaining a target model to be trained after the target object finishes the compiling operation.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules as a part of the apparatus may run in a corresponding hardware environment, may be implemented by software, and may also be implemented by hardware, where the hardware environment includes a network environment.

According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above method for model training based on GPU resources, where the electronic device may be a server, a terminal, or a combination thereof.

According to another embodiment of the present application, there is also provided an electronic apparatus including: as shown in fig. 6, the electronic device may include: the system comprises a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, wherein the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504.

A memory 1503 for storing a computer program;

the processor 1501, when executing the program stored in the memory 1503, implements the following steps:

step S101, acquiring development environment creation information;

Alternatively, in this embodiment, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), an NP (Network Processor), and the like; but also DSPs (Digital Signal processors), ASICs (Application Specific Integrated circuits), FPGAs (Field-Programmable Gate arrays) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

The embodiment of the present application further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the method steps of the above method embodiment are executed.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A model training method based on GPU resources is characterized by comprising the following steps:

acquiring development environment creation information;

determining the GPU resource demand required by the target model to be trained;

2. The method according to claim 1, wherein the creating a target development environment using target CPU resources according to the development environment creation information comprises:

3. The method of claim 1, wherein in the case that there are available GPU resources in the GPU resource pool that meet the GPU resource demand, before training the target model to be trained according to the available GPU resources, the method further comprises:

4. The method of claim 1, wherein the training the target model to be trained according to available GPU resources when available GPU resources meeting the GPU resource demand exist in a GPU resource pool comprises:

acquiring the training parameters according to the data address;

5. The method of claim 4, wherein after training the target model to be trained according to available GPU resources in the case that the available GPU resources meeting the GPU resource demand exist in the GPU resource pool, the method further comprises:

6. The method according to claim 5, wherein after returning the training result to the target development environment according to the target identifier corresponding to the target model to be trained, the method further comprises:

7. The method of any one of claims 1 to 6, wherein the obtaining a model to be trained by the target development environment comprises:

8. A GPU resource-based model training device, comprising:

9. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein said processor, said communication interface and said memory communicate with each other via said communication bus,

the memory for storing a computer program;

the processor for performing the method steps of any one of claims 1 to 7 by running the computer program stored on the memory.

10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method steps of any one of claims 1 to 7 when executed.