CN111126604B - Model training method, device, server and storage medium - Google Patents

Model training method, device, server and storage medium Download PDF

Info

Publication number
CN111126604B
CN111126604B CN201911416544.5A CN201911416544A CN111126604B CN 111126604 B CN111126604 B CN 111126604B CN 201911416544 A CN201911416544 A CN 201911416544A CN 111126604 B CN111126604 B CN 111126604B
Authority
CN
China
Prior art keywords
model
training
client
distributed storage
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911416544.5A
Other languages
Chinese (zh)
Other versions
CN111126604A (en
Inventor
张俊钦
周海维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201911416544.5A priority Critical patent/CN111126604B/en
Publication of CN111126604A publication Critical patent/CN111126604A/en
Application granted granted Critical
Publication of CN111126604B publication Critical patent/CN111126604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources

Abstract

The embodiment of the invention provides a model training method, a device, a server and a storage medium, wherein the method comprises the following steps: receiving a training request sent by a client; distributing a target number of graphic processors, and creating containers corresponding to the clients; acquiring training reference information of a model from a distributed storage system by using a distributed storage volume corresponding to a client, and loading the training reference information of the model into a container corresponding to the client; generating training environment data based on training reference information of the model; training the model by using the generated training environment data and the target number of graphic processors, and storing the generated training environment data in a distributed storage system by using a distributed storage volume corresponding to the client. Training environment data required by training of the model is generated rapidly, efficiency of data required by training of the configuration model is improved, duration of the data required by training of the configuration model is saved, and training of the model can be started rapidly.

Description

Model training method, device, server and storage medium
Technical Field
The present invention relates to the field of distributed technologies, and in particular, to a model training method, a device, a server, and a storage medium.
Background
In training a model such as a deep learning model, it is necessary to distribute data required for training of a graphics processor and a configuration model. At present, the following methods are generally adopted: in each training process, for each type of data required for training of the model, the data is configured by an engineer participating in training of the model manually one by means such as inputting data, selecting data.
However, when the model is trained each time, usually, the present training has a strong correlation with the last training, the data such as the operating system and the training frame adopted in the current model training still adopts the data adopted in the last training of the model, and for these types of data, when the model is trained each time, the engineer involved in the training of the model still needs to reconfigure one by one manually, so that the data required for configuring the model is complicated, the efficiency of the data required for configuring the model is low, the duration of the data required for configuring the model is long, and then when the model is trained each time, the model needs to wait for a long time to start the training of the model.
Disclosure of Invention
The embodiment of the invention aims to provide a model training method, a device, a server and a storage medium, so as to improve the efficiency of data required by training a configuration model and save the duration of the data required by training the configuration model. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a model training method, including:
receiving a training request sent by a client, wherein the training request comprises: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
distributing the target number of graphic processors and creating containers corresponding to the clients;
acquiring training reference information of the model stored in a distributed storage system by using a distributed storage volume corresponding to the client, and loading the training reference information of the model into a container corresponding to the client, wherein the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time;
based on training reference information of the model, training environment data adopted when the model is trained at the time of loading in a container corresponding to the client side is generated;
Training the model by using training environment data adopted when the model is trained at this time and the target number of graphic processors, and storing the training environment data adopted when the model is trained at this time in a distributed storage system by using a distributed storage volume corresponding to the client.
In a second aspect of the present invention, there is also provided a model training apparatus, including:
the receiving unit is configured to receive a training request sent by a client, and the training request comprises: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
a first processing unit configured to allocate the target number of graphics processors and create a container corresponding to the client;
the second processing unit is configured to acquire training reference information of the model stored in the distributed storage system by using the distributed storage volume corresponding to the client, and load the training reference information of the model into a container corresponding to the client, wherein the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time;
The generating unit is configured to generate training environment data adopted when the model is trained at the time of loading the training environment data in a container corresponding to the client based on training reference information of the model;
the training unit is configured to train the model by utilizing training environment data adopted when the model is trained at this time and the target number of graphic processors, and store the training environment data adopted when the model is trained at this time in a distributed storage system by utilizing a distributed storage volume corresponding to the client.
In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform any of the methods described above.
In yet another aspect of the invention there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.
According to the model training method provided by the embodiment of the invention, the training request sent by the client is received, and the training request comprises: the method comprises the steps of training a model, namely, the target number of graphic processors for training the model, and the names of distributed storage volumes corresponding to clients; distributing a target number of graphic processors and creating a container corresponding to the client; acquiring training reference information of the model stored in the distributed storage system by using a distributed storage volume corresponding to the client, and loading the training reference information of the model into a container corresponding to the client; based on training reference information of the model, training environment data adopted when the model is trained at this time and loaded in a container corresponding to the client is generated; training the model by using training environment data and target number of graphic processors adopted when the model is trained, and storing the training environment data adopted when the model is trained in a distributed storage system by using a distributed storage volume corresponding to the client. The training reference information of the model can be obtained from the distributed storage volume corresponding to the client side in each training of the model, training environment data required by the training of the model is quickly generated based on the training reference information of the model, and each type of information related to the training of the model does not need to be configured one by one when the model is trained each time. The efficiency of the data required by the training of the configuration model is improved, the duration of the data required by the training of the configuration model is saved, and the training of the model is started rapidly when the model is trained each time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart of one embodiment of a model training method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of training a model;
FIG. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of a server suitable for implementing the model training method provided in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
Step 101, receiving a training request sent by a client.
In the present invention, for a model, steps 101-104 may be performed with a server for training the model each time the model is trained.
In the invention, for a model, each time the model is trained, a training request sent by a client can be received, and the training request sent by the client comprises: the number of targets of the graphics processor (Graphics Processing Unit, GPU) used to train the model, the name of the distributed storage volume to which the client corresponds.
In the present invention, for a client, the distributed storage volume corresponding to the client is a distributed storage volume in a distributed storage system that is allocated in advance by the distributed storage system for the client.
In some embodiments, for a client, before first receiving a training request sent by the client, the method further includes: receiving a request for applying for a distributed storage volume sent by the client, wherein the request for applying for the distributed storage volume comprises: the names of the distributed storage volumes corresponding to the clients to be distributed; sending a distributed storage volume allocation request including the name to the distributed storage system; receiving allocation indication information sent by a distributed storage system, wherein the allocation indication information indicates that the distributed storage system has allocated a distributed storage volume corresponding to the client with the name; and sending the allocation indication information to the client.
In the present invention, for a client, a distributed storage volume application request may be generated on the client in advance before the client first sends a training request. The distributed storage volume application request includes: and the name of the distributed storage volume corresponding to the client to be allocated. The server for training the model may receive the request for applying for the distributed storage volume sent by the client, and after receiving the request for applying for the distributed storage volume sent by the client, the server for training the model may send a request for distributing the distributed storage volume including a name of the distributed storage volume corresponding to the client to be distributed to the distributed storage system, and the distributed storage system may distribute the distributed storage volume corresponding to the client having the name. The server for training the model may then receive the allocation indication information sent by the distributed storage system. The allocation indication information indicates that the distributed storage system has allocated the corresponding distributed storage volume of the client having the name. The server for training the model may send allocation indication information to the client. Thus, the client may determine, according to the allocation indication information, that the distributed storage system has allocated the distributed storage volume corresponding to the client having the name, and each time the model is trained, the client may send a training request including the name of the distributed storage volume corresponding to the client to the server for training the model.
Step 102, a target number of graphics processors are allocated, and a container corresponding to the client is created.
In the invention, for a client and a model, after receiving a training request sent by the client, a target number of graphics processors for training the model can be allocated according to the target number in the training request. At the same time, a container corresponding to the client may be created.
In the invention, the container corresponding to the client can be a container provided by an open-source application container engine Docker.
In the invention, the container corresponding to the client can provide an operation environment for the model, the deep learning framework, the driver of the deep learning framework and the like. When training the model, the deep learning framework and the drivers of the deep learning framework can all run in the container corresponding to the client.
And step 103, acquiring training reference information of the model stored in the distributed storage system by using the distributed storage volume corresponding to the client, and loading the training reference information of the model into a container corresponding to the client.
In the invention, for a client and a model, the distributed storage volume corresponding to the client can be determined according to the name in the training request. After a target number of graphics processors for training the model are allocated and a container corresponding to the client is created, the training reference information of the model stored in the distributed storage system can be acquired by using the distributed storage volume corresponding to the client, and the training reference information of the model is loaded into the container corresponding to the client.
In the present invention, the training reference information for the model may be stored in one or more servers in a distributed storage cluster in a distributed storage system.
In the invention, for a client and a model, when the client is used for training the model each time, after training reference information of the model is acquired from a distributed storage system by using a distributed storage volume corresponding to the client, the acquired training reference information of the model can be loaded into a container corresponding to the client.
In the invention, for a client and a model, each time the model is trained by the client, the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time.
In the invention, for a client and a model, when the model is trained for the first time by using the client, because the process of training the model last time does not exist, the training reference information of the model obtained by using the distributed storage volume corresponding to the client is preset training environment data.
In the invention, the preset training environment data is preset information for training the model. For example, for a client and a model, the preset training environment data of the model acquired by the distributed storage volume corresponding to the client includes: an operating system, a deep learning framework, a driver of the deep learning framework, and the like.
In the invention, for a client and a model, when the model is trained for any time after the model is trained by the client for the first time, training reference information of the model, which is acquired by a distributed storage volume corresponding to the client, is training environment data adopted when the model is trained for the last time. For example, training environment data employed in the last training of the model may include: an operating system used when the model is trained last time, a deep learning frame used when the model is trained last time, a driver of the deep learning frame used when the model is trained last time, and the like.
In the invention, for a model of a client, after the model is trained by the client each time, training environment data adopted when the model is trained can be used as training reference information of the model obtained by using a distributed storage volume corresponding to the client when the model is trained next time.
In the present invention, the preset training environment data may include, but is not limited to: the system comprises a preset operating system, a preset training frame, parameters of the preset training frame, a driver of the preset training frame and a preset training sample.
In the present invention, training environment data used in the last training of the model may include, but is not limited to: the model training system comprises an operating system used when the model is trained last time, a training frame used when the model is trained last time, parameters of the training frame used when the model is trained last time, a driving program of the training frame used when the model is trained last time, and training samples used when the model is trained last time.
In the present invention, the model may be a deep learning model. The training framework may be a deep learning framework such as tensor flow artificial intelligence learning system (Tensorflow), convolution structure for fast feature embedding (Caffe), and the like. The driver of the training framework may be a driver of the deep learning framework. The parameters of the training framework may be parameters of the deep learning framework. The training samples may be training samples of a deep learning framework.
And 104, generating training environment data adopted when the model is trained at the time of loading the training environment data in a container corresponding to the client based on training reference information of the model.
In the invention, when training environment data which is adopted when the model is trained at this time and loaded in a container corresponding to the client is generated for a client and a model based on training reference information of the model, when each item which is adopted when the model is trained at this time and belongs to the same type of item is different from the item, the item which is adopted when the model is trained at this time and belongs to the same type of item is replaced by the item which is adopted when the model is trained at this time and belongs to the same type of item, so that training environment data which is adopted when the model is trained at this time is obtained. The training environment data adopted when the model is trained at this time comprises: the non-replaced item in the training reference information, the item for replacing the replaced item in the training reference information. Training environment data adopted in the training of the model is also loaded in a container corresponding to the client.
For example, for a client and a model, when the model is trained by the client at any time after the model is trained for the first time, training reference information of the model, which is acquired from the distributed storage system by using the distributed storage volume corresponding to the client, is training environment data used when the model is trained last time. The training environment data used when the model is trained last time includes a deep learning frame, parameters of the deep learning frame, and the like, which are used when the model is trained last time.
When training environment data adopted when the model is trained at this time and loaded in a container corresponding to the client is generated based on training reference information of the model, configuration information sent by the client can be received first. The configuration information may include parameters of a deep learning frame used when the model is trained at the time selected by the user of the client, and the like. Then, according to the configuration information, some items in the training reference information of the model which is already loaded in the container corresponding to the client are replaced by items in the configuration information, so that training environment data adopted when the model is trained at this time is obtained.
If the deep learning frame adopted when the model is trained this time is different from the deep learning frame adopted when the model is trained last time, which is already loaded in the container corresponding to the client side, can be deleted, and the deep learning frame adopted when the model is trained this time is loaded in the container corresponding to the client side, so that the deep learning frame adopted when the model is trained last time is replaced with the deep learning frame adopted when the model is trained this time. The generated training environment data adopted when the model is trained at the time comprises the following steps: the deep learning framework adopted when training the model is performed.
If the deep learning frame adopted when the model is trained this time is the deep learning frame adopted when the model is trained last time, but the parameters of the deep learning frame adopted when the model is trained this time are different from the parameters of the deep learning frame adopted when the model is trained last time, the parameters of the deep learning frame adopted when the model is trained last time which are already loaded in the container corresponding to the client can be deleted, and the parameters of the deep learning frame adopted when the model is trained this time are loaded in the container corresponding to the client, so that the parameters of the deep learning frame adopted when the model is trained last time are replaced with the parameters of the deep learning frame adopted when the model is trained this time. The generated training environment data adopted when the model is trained at the time comprises the following steps: parameters of a deep learning framework adopted when training the model at this time.
In the invention, for a client and a model, when the model is trained for the first time by using the client, training reference information of the model, which is acquired from a distributed storage system by using a distributed storage volume corresponding to the client, is preset training environment data. When training environment data adopted when the model is trained this time is generated and loaded in a container corresponding to the client based on training reference information of the model, for each item in preset training environment data which is already loaded in the container corresponding to the client, when the item which belongs to the same type as the item and is adopted when the model is trained this time is different from the item, the item which is already loaded in the container corresponding to the client is replaced by the item which belongs to the same type as the item and is adopted when the model is trained this time, and therefore training environment data adopted when the model is trained this time are obtained.
In some embodiments, generating training environment data employed when training the model based on training reference information of the model includes: receiving a configuration request sent by a client, wherein the configuration request indicates to execute configuration operation of at least one item of training reference information aiming at a model; and executing configuration operation of at least one item of training reference information aiming at the model to generate training environment data adopted when the model is trained at the time when the training environment data is loaded in a container corresponding to the client.
In the invention, for a client and a model, when training environment data adopted when training the model is generated based on training reference information of the model, each item in the training reference information can be displayed in a training environment data configuration interface on the client, a user of the client can perform configuration indication operation on the client according to each item in the displayed training reference information, and a configuration request is generated on the basis of the configuration indication operation of the user of the client. Then, a configuration request sent by the client may be received, and a configuration operation for at least one item of training reference information of the model is performed, so as to generate training environment data adopted when the model is trained at this time.
In the invention, the user of the client can perform configuration indication operation on the client according to each item in the training reference information, the client generates a configuration request based on the configuration indication operation of the user of the client, and after receiving the configuration request sent by the client, the operation system adopted when the model is trained, the training frame adopted when the model is trained, the driving program of the training frame adopted when the model is trained, the parameters of the training frame adopted when the model is trained, the training samples adopted when the model is trained and the like can be flexibly configured, so that a plurality of different training requirements are met.
For example, when the model is trained by the client at any time after the model is trained for the first time, training reference information obtained by the distributed storage volume corresponding to the client is training environment data adopted when the model is trained for the last time. The training environment data used in the last training of the model includes: the model is trained by the operating system, the training frame, the driving program, the parameters and the training samples.
Each item of training environment data used when the model was last trained can be presented in a training environment data configuration interface on the client. Meanwhile, the identification of other operating systems except the operating system adopted when the model is trained last time, the identification of other training frames except the training frame adopted when the model is trained last time, and the identification of other driving programs except the driving program of the training frame adopted when the model is trained last time can be displayed in the training environment data configuration interface on the client. Therefore, the user of the client performs configuration indication operation according to the displayed information.
If the model is trained by an operating system different from the operating system used in the previous training of the model, the user of the client can select the configuration indication operation of the identifier of the operating system different from the operating system used in the previous training of the model. The configuration request generated at the client indicates to perform a configuration operation for an operating system used when the model was last trained, and the configuration operation for the operating system used when the model was last trained includes using an operating system different from the operating system used when the model was last trained as the operating system used when the model was currently trained. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: the operation system adopted when the user of the client side trains the model at this time.
If the model is trained by adopting a training frame different from the training frame adopted in the previous training of the model, the user of the client can select the configuration indication operation of the identification of the training frame different from the training frame adopted in the previous training of the model. The configuration request generated at the client indicates to perform a configuration operation for a training frame used when the model is trained last time, and the configuration operation for the training frame used when the model is trained last time includes taking a training frame different from the training frame used when the model is trained last time as the training frame used when the model is trained this time. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: the training framework adopted when the user of the client selects the model to train at this time.
If the model is trained by a training frame different from the training frame used in the previous training of the model, the user at the client can perform configuration indication operation of inputting parameters of the training frame used in the training of the model. The configuration request generated at the client indicates to perform a configuration operation for parameters of a training frame employed when the model was last trained, the configuration operation for parameters of a training frame employed when the model was last trained including taking parameters of the training frame entered by a user as parameters of the training frame employed when the model was this time trained. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: parameters of a training framework adopted when the model is trained at the time input by a user of the client.
If the training frame used when the model is trained last time is adopted when the model is trained this time, but parameters different from the parameters of the training frame used when the model is trained last time are adopted, the user of the client can input configuration indication operation of parameters different from the parameters of the training frame used when the model is trained last time. The configuration request generated at the client indicates to perform a configuration operation for parameters of a training frame used when the model is trained last time, and the configuration operation for parameters of a training frame used when the model is trained last time includes taking parameters of the training frame used when the model is trained this time as input by a user. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: the training frame adopted when the model is trained last time, and the parameters input by the user of the client are different from the parameters of the training frame adopted when the model is trained last time.
When the model is trained this time by adopting a training frame different from the training frame adopted in the previous training of the model, the user of the client can perform configuration indication operation of selecting the identifier of the driver of the training frame adopted in the training of the model this time. The configuration request generated at the client indicates to perform configuration operations for drivers of the training framework employed when training the model last time, the configuration operations for drivers of the training framework employed when training the model last time including: and the driver of the training framework selected by the user is used for training the model this time. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: the driver of the training framework adopted when the user of the client side trains the model at this time.
If the training framework adopted when the model is trained last time is adopted when the model is trained, but a different driving program is adopted from the driving program of the training framework adopted when the model is trained last time, the user of the client can perform configuration indication operation of selecting the identifier of the driving program different from the driving program of the training framework adopted when the model is trained last time. The configuration request generated at the client indicates to perform configuration operations for drivers of the training framework employed when training the model last time, the configuration operations for drivers of the training framework employed when training the model last time including: and the driver of the training framework selected by the user is used for training the model this time. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: the driver of the training framework adopted when the user of the client side trains the model at this time.
If a new training sample needs to be added when the model is trained at this time, a user of the client can perform configuration indication operation of indicating to add the training sample. The configuration request generated at the client indicates to perform a configuration operation for a training sample employed when the model was last trained, the configuration operation for the training sample employed when the model was last trained comprising: the configuration operation of adding new training samples in addition to the training samples in the training reference information. After the configuration operation is executed, the generated training environment data adopted when the model is trained at this time comprises the following steps: training samples used in the last training of the model, and new training samples added to the training samples used in the last training of the model.
In some embodiments, the configuration request sent by the client is a configuration request sent using a secure shell protocol connection; and before receiving the configuration request sent by the client, further comprising: receiving a connection establishment request sent by the client, wherein the connection establishment request comprises: secure shell protocol keys; judging whether the secure shell protocol key in the connection establishment request is matched with a preset secure shell protocol key of the client loaded in a container corresponding to the client; if yes, establishing a secure shell protocol connection with the client.
In the present invention, the client may send a configuration request to the server for training the model using ssh (Secure Shell) connection. For a client, before receiving a configuration request sent by the client, a server for training a model may allocate a preset ssh key to the client in advance, and the server for training the model may load the allocated preset ssh key of the client in a container corresponding to the client in advance. The server for training the model may send a connection establishment request including the ssh key to the server for training the model before receiving the configuration request sent by the client. The server for training the model judges whether the ssh key in the connection establishment request is matched with the preset ssh key of the client loaded in the container corresponding to the client. When the ssh key in the connection establishment request matches with the preset ssh key of the client loaded in the container corresponding to the client, the server for training the model establishes the ssh connection with the client. Thus, the client may send a configuration request to the server for training the model using the ssh connection.
And 105, training the model by using training environment data adopted when the model is trained and the target number of graphic processors, and storing the training environment data adopted when the model is trained in a distributed storage system by using a distributed storage volume corresponding to the client.
In the invention, when training the model by utilizing the training environment data and the target number of graphic processors adopted when the model is trained this time and loaded in the container corresponding to the client, the target number of graphic processors can read the data required by the training task of executing the model from the container corresponding to the client, execute the training task of the model and complete the training of the model.
In the present invention, for a client and a model, after training environment data adopted when the model is trained this time and loaded in a container corresponding to the client is generated, an operation of writing training environment data adopted when the model is trained this time and loaded in a container corresponding to the client into a distributed storage volume corresponding to the client may be performed, so that the training environment data adopted when the model is trained this time and loaded in a container corresponding to the client is stored on a server storing training environment data in a distributed storage cluster in a distributed storage system. Therefore, the distributed storage volumes corresponding to the clients are utilized to store training environment data adopted when the model is trained at the time loaded in the containers corresponding to the clients in the distributed storage system. The training environment data adopted when the model is trained at this time is used as training reference information of the model loaded into a container corresponding to the client when the model is trained next time.
In some embodiments, further comprising: and switching the path of the root directory of the container corresponding to the client to the path of the target directory in the distributed storage volume corresponding to the client, wherein the target directory is a directory for writing and reading training environment data.
In the invention, when the container corresponding to the client is started, the path of the root directory of the container corresponding to the client can be switched to the path of the target directory for writing and reading training environment data in the distributed storage volume corresponding to the client by utilizing the chroot (Change Root) command, so that the training environment data adopted when the model is trained at the time when the model is loaded in the container corresponding to the client is generated each time is automatically synchronized to a server for storing the training environment data in a distributed storage cluster in the distributed storage system. Therefore, the training environment data adopted when the model is trained is stored in the distributed storage system by using the distributed storage volumes corresponding to the clients. By switching paths of root directories of containers corresponding to the clients, when training environment data adopted when the model is trained at this time and loaded in the container corresponding to the clients is generated each time, the training environment data adopted when the model is trained at this time is automatically stored in the distributed storage system, and the training environment data adopted when the model is trained at this time is timely stored.
Referring to fig. 2, a schematic diagram of training a model is shown.
Before the client sends a training request to a server in the GPU and container scheduling system for the first time, the server in the GPU and container scheduling system may send a distributed storage volume allocation request to the distributed storage system, including the name of the distributed storage volume corresponding to the client. After receiving the distributed storage volume allocation request, the distributed storage system allocates a distributed storage volume corresponding to the client with the name to the client.
Each time a model is trained using the client, the client sends a name training request to the GPU and container scheduling system that includes a target number of graphics processors for training the model and a distributed storage volume corresponding to the client.
A server for training of models in a GPU and container scheduling system may allocate a target number of GPUs. And simultaneously, creating a container corresponding to the client. When a container corresponding to a client is started, the GPU and container scheduling system switches the path of the root directory of the container corresponding to the client to the path of the directory for writing training environment data for training the model and reading training environment data for training the model in the distributed storage volume by utilizing a color command.
The server for training the model may determine the distributed storage volume corresponding to the client based on the name of the distributed storage volume corresponding to the client. The server for training the model acquires training reference information of the model stored in the distributed storage system by using the distributed storage volume corresponding to the client, and loads the training reference information of the model into a container corresponding to the client. When the model is trained for the first time by using the client, training reference information of the model obtained from the distributed storage system by using the distributed storage volume corresponding to the client is preset training environment data. When the model is trained by the client at any time after the model is trained for the first time, training reference information of the model, which is acquired from the distributed storage system by the distributed storage volume corresponding to the client, is training environment data adopted when the model is trained for the last time.
Each time the model is trained using the client, the client may send a configuration request to the server for training of the model using the ssh connection with the server for training of the model to generate training environment data employed for training the model this time. After generating the training environment data adopted by the current training of the model, the model can be trained by utilizing the training environment data adopted by the current training of the model and the target number of graphic processors.
The path of the root directory of the container corresponding to the client is switched to the path of the target directory for writing the training environment data for training the model and reading the training environment data for training the model in the distributed storage volume corresponding to the client, so that the training environment data adopted when the model is trained in the container at each time can be automatically synchronized to a server for storing the training environment data in a distributed storage cluster in the distributed storage system and used as training reference information loaded in the container corresponding to the client when the model is trained next time.
Fig. 3 is a schematic structural diagram of a model training device according to an embodiment of the invention. The specific implementation of the operations performed by the respective units or sub-units in the model training apparatus may be the specific implementation of the corresponding operations described in the above-mentioned reference method embodiments.
As shown in fig. 3, the model training apparatus provided in the embodiment of the present invention includes: a receiving unit 301, a first processing unit 302, a second processing unit 303, a generating unit 304, and a training unit 305.
The receiving unit 301 is configured to receive a training request sent by a client, the training request comprising: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
The first processing unit 302 is configured to allocate the target number of graphics processors and create a container corresponding to the client;
the second processing unit 303 is configured to acquire training reference information of the model stored in the distributed storage system by using a distributed storage volume corresponding to the client, and load the training reference information of the model into a container corresponding to the client, where the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time;
the generating unit 304 is configured to generate training environment data adopted when the model is trained at this time, which is loaded in a container corresponding to the client, based on training reference information of the model;
the training unit 305 is configured to train the model by using training environment data and the target number of graphics processors, which are used when training the model, and store the training environment data, which is used when training the model, in a distributed storage system by using a distributed storage volume corresponding to the client.
In some embodiments, the model training apparatus further comprises: the allocation unit is configured to receive a distributed storage volume application request sent by the client before receiving the training request sent by the client, where the distributed storage volume application request includes: the names of the distributed storage volumes corresponding to the clients to be distributed; sending a distributed storage volume allocation request including the name to a distributed storage system; receiving allocation indication information sent by a distributed storage system, wherein the allocation indication information indicates that the distributed storage system has allocated a distributed storage volume corresponding to the client with the name; and sending the allocation indication information to the client.
In some embodiments, the generation unit 304 includes:
a configuration operation execution module configured to receive a configuration request sent by the client, the configuration request indicating to execute a configuration operation for at least one item of training reference information of the model; and executing configuration operation of at least one item of training reference information aiming at the model to generate training environment data adopted when the model is trained at the time when the training environment data is loaded in a container corresponding to the client.
In some embodiments, the model training apparatus further comprises: a connection establishment unit configured to receive a connection establishment request sent by the client before receiving a configuration request sent by the client, where the connection establishment request includes: secure shell protocol keys; judging whether the secure shell protocol key is matched with a preset secure shell protocol key of the client loaded in a container corresponding to the client; if yes, establishing a secure shell protocol connection with the client so that the client can send a configuration request by using the secure shell protocol connection.
In some embodiments, the model training apparatus further comprises:
And the directory switching unit is configured to switch the path of the root directory of the container corresponding to the client to the path of the target directory in the distributed storage volume corresponding to the client, wherein the target directory is a directory for writing and reading training environment data.
The embodiment of the present invention further provides a server, as shown in the figure, including a processor 401, a communication interface 402, a memory 403 and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 use the communication bus 404 to complete communication with each other,
a memory 403 for storing a computer program;
the processor 401, when executing the program stored in the memory 403, implements the following steps:
receiving a training request sent by a client, wherein the training request comprises: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
distributing the target number of graphic processors and creating containers corresponding to the clients;
acquiring training reference information of the model stored in a distributed storage system by using a distributed storage volume corresponding to the client, and loading the training reference information of the model into a container corresponding to the client, wherein the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time;
Based on training reference information of the model, training environment data adopted when the model is trained at the time of loading in a container corresponding to the client side is generated;
training the model by using training environment data adopted when the model is trained at this time and the target number of graphic processors, and storing the training environment data adopted when the model is trained at this time in a distributed storage system by using a distributed storage volume corresponding to the client.
The communication bus mentioned by the server may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the server and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method of any of the above embodiments is also provided.
In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of any of the above embodiments.
In the above embodiments, it may be implemented in whole or in part using software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A method of model training, the method comprising:
receiving a training request sent by a client, wherein the training request comprises: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
distributing the target number of graphic processors and creating containers corresponding to the clients;
acquiring training reference information of the model stored in a distributed storage system by using a distributed storage volume corresponding to the client, and loading the training reference information of the model into a container corresponding to the client, wherein the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time; the training environment data includes: an operating system used when the model is trained last time, a deep learning frame used when the model is trained last time, and a driving program of the deep learning frame used when the model is trained last time; the preset training environment data comprises: the method comprises the steps of presetting a set operating system, a deep learning frame and a preset driving program of the deep learning frame;
Based on training reference information of the model, training environment data adopted when the model is trained at the time of loading in a container corresponding to the client side is generated;
training the model by using training environment data adopted when the model is trained at this time and the target number of graphic processors, and storing the training environment data adopted when the model is trained at this time in a distributed storage system by using a distributed storage volume corresponding to the client.
2. The method of claim 1, wherein prior to receiving the training request sent by the client, the method further comprises:
receiving a request for applying for the distributed storage volume sent by the client, wherein the request for applying for the distributed storage volume comprises: the names of the distributed storage volumes corresponding to the clients to be distributed;
sending a distributed storage volume allocation request including the name to a distributed storage system;
receiving allocation indication information sent by a distributed storage system, wherein the allocation indication information indicates that the distributed storage system has allocated a distributed storage volume corresponding to the client with the name;
And sending the allocation indication information to the client.
3. The method of claim 2, wherein generating training environment data for the model to be used in training the model at the time loaded in the container corresponding to the client based on training reference information of the model comprises:
receiving a configuration request sent by the client, wherein the configuration request indicates to execute configuration operation of at least one item of training reference information aiming at the model;
and executing configuration operation of at least one item of training reference information aiming at the model to generate training environment data adopted when the model is trained at the time when the training environment data is loaded in a container corresponding to the client.
4. A method according to claim 3, wherein prior to receiving the configuration request sent by the client, the method further comprises:
receiving a connection establishment request sent by the client, wherein the connection establishment request comprises: secure shell protocol keys;
judging whether the secure shell protocol key is matched with a preset secure shell protocol key of the client loaded in a container corresponding to the client;
if yes, establishing a secure shell protocol connection with the client so that the client sends the configuration request by using the secure shell protocol connection.
5. The method according to one of claims 1 to 4, characterized in that the method further comprises:
and switching the path of the root directory of the container corresponding to the client to the path of a target directory in the distributed storage volume corresponding to the client, wherein the target directory is a directory for writing and reading training environment data.
6. A model training apparatus, the apparatus comprising:
the receiving unit is configured to receive a training request sent by a client, and the training request comprises: the target number of the graphic processors for training the model and the names of the distributed storage volumes corresponding to the clients;
a first processing unit configured to allocate the target number of graphics processors and create a container corresponding to the client;
the second processing unit is configured to acquire training reference information of the model stored in the distributed storage system by using the distributed storage volume corresponding to the client, and load the training reference information of the model into a container corresponding to the client, wherein the training reference information of the model is training environment data or preset training environment data adopted when the model is trained last time; the training environment data includes: an operating system used when the model is trained last time, a deep learning frame used when the model is trained last time, and a driving program of the deep learning frame used when the model is trained last time; the preset training environment data comprises: the method comprises the steps of presetting a set operating system, a deep learning frame and a preset driving program of the deep learning frame;
The generating unit is configured to generate training environment data adopted when the model is trained at the time of loading the training environment data in a container corresponding to the client based on training reference information of the model;
the training unit is configured to train the model by utilizing training environment data adopted when the model is trained at this time and the target number of graphic processors, and store the training environment data adopted when the model is trained at this time in a distributed storage system by utilizing a distributed storage volume corresponding to the client.
7. The apparatus of claim 6, wherein the apparatus further comprises:
the allocation unit is configured to receive a distributed storage volume application request sent by the client before receiving the training request sent by the client, where the distributed storage volume application request includes: the names of the distributed storage volumes corresponding to the clients to be distributed; sending a distributed storage volume allocation request including the name to a distributed storage system; receiving allocation indication information sent by a distributed storage system, wherein the allocation indication information indicates that the distributed storage system has allocated a distributed storage volume corresponding to the client with the name; and sending the allocation indication information to the client.
8. The apparatus of claim 7, wherein the generating unit comprises:
a configuration operation execution module configured to receive a configuration request sent by the client, the configuration request indicating to execute a configuration operation for at least one item of training reference information of the model; and executing configuration operation of at least one item of training reference information aiming at the model to generate training environment data adopted when the model is trained at the time when the training environment data is loaded in a container corresponding to the client.
9. The server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory use the communication bus to complete communication;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.
CN201911416544.5A 2019-12-31 2019-12-31 Model training method, device, server and storage medium Active CN111126604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911416544.5A CN111126604B (en) 2019-12-31 2019-12-31 Model training method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911416544.5A CN111126604B (en) 2019-12-31 2019-12-31 Model training method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN111126604A CN111126604A (en) 2020-05-08
CN111126604B true CN111126604B (en) 2024-02-02

Family

ID=70506906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911416544.5A Active CN111126604B (en) 2019-12-31 2019-12-31 Model training method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN111126604B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753997B (en) * 2020-06-28 2021-08-27 北京百度网讯科技有限公司 Distributed training method, system, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104394382A (en) * 2014-12-09 2015-03-04 浙江省公众信息产业有限公司 Storage method, device and system of video monitoring record
CN107870734A (en) * 2016-09-27 2018-04-03 苏宁云商集团股份有限公司 The exchange method and device of a kind of distributed file system
CN109034394A (en) * 2018-07-02 2018-12-18 第四范式(北京)技术有限公司 A kind of update method and device of machine learning model
CN109508485A (en) * 2018-10-30 2019-03-22 平安医疗健康管理股份有限公司 A kind of data processing model dissemination method, device, server and storage medium
CN109815715A (en) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 A kind of data ciphering method and relevant apparatus
CN110543946A (en) * 2018-05-29 2019-12-06 百度在线网络技术(北京)有限公司 method and apparatus for training a model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11144828B2 (en) * 2017-06-09 2021-10-12 Htc Corporation Training task optimization system, training task optimization method and non-transitory computer readable medium for operating the same

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104394382A (en) * 2014-12-09 2015-03-04 浙江省公众信息产业有限公司 Storage method, device and system of video monitoring record
CN107870734A (en) * 2016-09-27 2018-04-03 苏宁云商集团股份有限公司 The exchange method and device of a kind of distributed file system
CN110543946A (en) * 2018-05-29 2019-12-06 百度在线网络技术(北京)有限公司 method and apparatus for training a model
CN109034394A (en) * 2018-07-02 2018-12-18 第四范式(北京)技术有限公司 A kind of update method and device of machine learning model
CN109508485A (en) * 2018-10-30 2019-03-22 平安医疗健康管理股份有限公司 A kind of data processing model dissemination method, device, server and storage medium
CN109815715A (en) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 A kind of data ciphering method and relevant apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分布式资源环境下的UG参数化模型重构;张执南等;计算机工程与应用(29);全文 *

Also Published As

Publication number Publication date
CN111126604A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
US10949158B2 (en) Screenshot method and apparatus
CN107391629B (en) Method, system, server and computer storage medium for data migration between clusters
CN109582433B (en) Resource scheduling method and device, cloud computing system and storage medium
CN108052384B (en) Task processing method, service platform and electronic equipment
CN112445579B (en) Zero terminal data processing system and file copying method and device thereof
CN114244717B (en) Configuration method and device of virtual network card resources, computer equipment and medium
CN112860479A (en) Data storage method and cloud data center
CN113010224B (en) Front-end micro-servitization method, front-end micro-servitization device, computer equipment and storage medium
CN109104368B (en) Connection request method, device, server and computer readable storage medium
CN112148468A (en) Resource scheduling method and device, electronic equipment and storage medium
CN113676501A (en) Application deployment method and device based on Kubernetes cluster and electronic equipment
CN112333289A (en) Reverse proxy access method, device, electronic equipment and storage medium
CN111126604B (en) Model training method, device, server and storage medium
CN108520401B (en) User list management method, device, platform and storage medium
CN111158807A (en) Data access method and device based on cloud virtual machine
CN114465937A (en) Network card testing method, device, server, medium, and computer program product
CN107045452B (en) Virtual machine scheduling method and device
CN111147585B (en) Equipment upgrading method, device, storage medium and system
CN108833532B (en) Service processing method, device and system based on Internet of things
CN103118248A (en) Monitoring method, monitoring agency, monitoring server and monitoring system
CN109922120B (en) Method and terminal for improving DNS availability
CN114070889B (en) Configuration method, traffic forwarding device, storage medium, and program product
CN114328130B (en) Server monitoring method, system, equipment and computer readable storage medium
US20180131756A1 (en) Method and system for affinity load balancing
CN112764897B (en) Task request processing method, device and system and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant