CN113780554B

CN113780554B - Processing method and device of deep reinforcement learning model, medium and electronic equipment

Info

Publication number: CN113780554B
Application number: CN202111061787.9A
Authority: CN
Inventors: 洪伟峻; 申瑞珉; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-10-24
Anticipated expiration: 2041-09-10
Also published as: CN113780554A

Abstract

The disclosure relates to a processing method and device of a deep reinforcement learning model, a medium and electronic equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: dividing the deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and sending each model fragment to an intermediate node through a model distribution process; splicing the model fragments through the intermediate nodes to obtain a complete serialization model, and sending the complete serialization model to the interactive machine; performing deserialization processing on the complete serialization model through an interactive machine to obtain a deep reinforcement learning model, and performing interaction with a preset virtual environment through the deep reinforcement learning model to obtain training data; the training data is sent to the model training machine by the interactive machine, and the deep reinforcement learning model is trained by the model training machine through the training data. The present disclosure improves the distribution efficiency of models.

Description

Processing method and device of deep reinforcement learning model, medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to a processing method of a deep reinforcement learning model, a processing device of the deep reinforcement learning model, a computer readable storage medium and electronic equipment.

Background

Deep reinforcement learning (DRL, deep Reinforcement Learning) is a technique that has emerged in recent years that combines reinforcement learning with deep learning, and belongs to a sub-field of machine learning.

In the existing distributed deep reinforcement learning, the model needs to be distributed to a plurality of interactive machines by a training machine, each interactive process of each machine acquires a new model and then continues to interact with the model and the environment, and finally the training machine collects new interactive data for training. The common model distribution scheme is direct distribution, namely, a training machine directly distributes complete models to each interactive machine in sequence.

However, for direct distribution, the model distribution efficiency is low because the model is large and a single transmission can easily reach the upper bandwidth limit of the training machine.

Therefore, it is desirable to provide a new method and apparatus for processing a deep reinforcement learning model.

It should be noted that the information of the present invention in the above background section is only for enhancing understanding of the background of the present disclosure, and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

An object of the present disclosure is to provide a method of processing a deep reinforcement learning model, a processing apparatus of the deep reinforcement learning model, a computer-readable storage medium, and an electronic device, which further overcome, at least to some extent, the problem of low distribution efficiency of the model due to limitations and drawbacks of the related art.

According to one aspect of the present disclosure, there is provided a method of processing a deep reinforcement learning model, configured in a model training system having a model training machine and an interactive machine, the method comprising:

dividing a deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and sending each model fragment to an intermediate node through a model distribution process;

splicing the model fragments through the intermediate node to obtain a complete serialization model, and sending the complete serialization model to the interactive machine;

performing deserialization processing on the complete serialization model through the interaction machine to obtain the deep reinforcement learning model, and performing interaction with a preset virtual environment through the deep reinforcement learning model to obtain training data;

And transmitting the training data to the model training machine through the interactive machine, and training the deep reinforcement learning model through the training data through the model training machine.

In one exemplary embodiment of the present disclosure, partitioning a deep reinforcement learning model into a plurality of model fragments includes:

calculating the node number of the intermediate nodes, and determining the number of fragments which can be divided by the deep reinforcement learning model according to the node number;

and dividing the deep reinforcement learning model in equal parts according to the number of the fragments to obtain a plurality of model fragments.

In one exemplary embodiment of the present disclosure, sending each of the model fragments to an intermediate node through a model distribution process includes:

starting the model distribution process through a preset distributed execution engine;

and encoding the model fragments, and transmitting the model fragments to the intermediate node one by one through a model distribution process based on the fragment encoding sequence of the model fragments.

In an exemplary embodiment of the present disclosure, the splicing the model fragments by the intermediate node, to obtain a complete serialization model, includes:

Forwarding the current model fragments received by the intermediate node to other nodes except the intermediate node, and receiving all the model fragments except the current model fragments sent by the other nodes;

sorting the current model fragments and all other model fragments according to the fragment codes of the current model fragments and all other model fragments;

and splicing the sequenced current model fragments and all other model fragments to obtain a complete serialization model.

In one exemplary embodiment of the present disclosure, sending the complete serialization model into the interactive machine includes:

and sending the complete serialization model to an interaction process included in the interaction machine in a manner of interprocess communication of the intermediate node.

In an exemplary embodiment of the present disclosure, the interaction with a preset virtual environment through the deep reinforcement learning model, to obtain training data, includes:

the method comprises the steps that interaction is carried out with a preset virtual environment through the deep reinforcement learning model to obtain a plurality of interaction sequences, wherein the interaction sequences comprise a plurality of sampling data, and each sampling data comprises a first state of the preset virtual environment, a decision action and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state;

And generating the training data according to each interaction sequence.

In one exemplary embodiment of the present disclosure, training the deep reinforcement learning model with the training data includes:

determining, for each sample data in the training data, a dominance function value of the dominance function of the deep reinforcement learning model corresponding to an environmental state in the sample data, and dominance expectations of the dominance function value under a decision strategy corresponding to the sample data;

determining, for each sample data in the training data, an action value corresponding to the sample data according to the sample data, a dominance function value corresponding to the sample data, the dominance expectation, and a state value function of the deep reinforcement learning model;

and determining update gradient information of an action value function of the deep reinforcement learning model based on the action value, and updating the deep reinforcement learning model according to the update gradient information.

According to one aspect of the present disclosure, there is provided a processing apparatus of a deep reinforcement learning model, configured in a model training system having a model training machine and an interactive machine, the apparatus comprising:

The model division module is used for dividing the deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and sending each model fragment to the intermediate node through a model distribution process;

the fragment splicing module is used for splicing the model fragments through the intermediate node to obtain a complete serialization model, and sending the complete serialization model to the interactive machine;

the training data generation module is used for performing deserialization processing on the complete serialization model through an interactive machine to obtain the deep reinforcement learning model, and performing interaction with a preset virtual environment through the deep reinforcement learning model to obtain training data;

and the model training module is used for transmitting the training data to the model training machine through an interactive machine and training the deep reinforcement learning model through the training data through the model training machine.

According to one aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of processing a deep reinforcement learning model of any one of the above.

According to one aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of processing a deep reinforcement learning model of any one of the above via execution of the executable instructions.

According to the processing method of the deep reinforcement learning model, on one hand, in the model distribution process, the deep reinforcement learning model can be divided by a model training machine to obtain a plurality of model fragments, and each model fragment is sent to an intermediate node through a model distribution process; then splicing the model fragments through the intermediate nodes to obtain a complete serialization model, and sending the complete serialization model to an interactive machine; finally, performing deserialization processing on the complete serialization model through an interactive machine to obtain a deep reinforcement learning model, so that the problem of lower distribution efficiency caused by the fact that single transmission easily reaches the upper limit of the bandwidth of a training machine due to the fact that the model needs to be distributed integrally is avoided, and the distribution efficiency of the model is improved; on the other hand, training data can be obtained as the interaction between the interactive machine and the preset virtual environment can be performed through the deep reinforcement learning model; and then training the deep reinforcement learning model through training data in a model training machine, so that the burden of the model training machine is reduced, and the model training efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 schematically illustrates an example diagram of direct distribution of a deep reinforcement learning model.

FIG. 2 schematically illustrates an example diagram of tree distribution of a deep reinforcement learning model.

FIG. 3 schematically illustrates a flow chart of a method of processing a deep reinforcement learning model according to an example embodiment of the present disclosure.

FIG. 4 schematically illustrates a schematic diagram of processing a deep reinforcement learning model according to an example embodiment of the present disclosure.

Fig. 5 schematically illustrates a method flowchart for stitching the model fragments by the intermediate node to obtain a complete serialized model, according to an example embodiment of the disclosure.

FIG. 6 schematically illustrates a method flow diagram for training the deep reinforcement learning model with the training data, according to an example embodiment of the present disclosure.

FIG. 7 schematically illustrates a flow chart of another method of processing a deep reinforcement learning model according to an example embodiment of the present disclosure.

Fig. 8 schematically illustrates a block diagram of a processing apparatus of a deep reinforcement learning model according to an example embodiment of the present disclosure.

Fig. 9 schematically illustrates an electronic device for implementing a processing method of the deep reinforcement learning model described above, according to an example embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The deep reinforcement learning research is how the intelligent agent continuously tries to accumulate experience to obtain maximum rewards in the interaction process with the environment, the deep reinforcement learning combines reinforcement learning with the deep learning technology, and a deep neural network model is used as a fitter of a strategy function in the reinforcement learning, so that the capability boundary of the reinforcement learning is greatly widened, the reinforcement learning can also show the intelligent degree similar to or even exceeding the human level in the complex environment, and the most known DRL system belongs to alpha Go.

However, with deep reinforcement learning, the low data utilization rate is accompanied by that a great deal of interaction between a model and an environment is required for training a strong intelligent agent at present to continuously generate new training data, so that distributed training becomes a mainstream choice of a deep reinforcement learning framework. In distributed deep reinforcement learning (DDRL, distributed Deep Reinforcement Learning), a model needs to be distributed to a plurality of interactive machines by a training machine, each interactive process of each machine acquires a new model and then continues to interact with the model and the environment, and finally the training machine collects new interactive data for training. The higher the model distribution efficiency, the faster the convergence speed of the training thereof.

However, in the DDRL system, the model is distributed in a direct distribution manner and a tree distribution manner. Specifically, referring to fig. 1, direct distribution refers to that a training machine directly sends an overall model to a forwarding process where an interaction machine is located through a distribution process, and then sends the overall model to a corresponding interaction process through the forwarding process; further, referring to fig. 2, tree distribution refers to an interaction process that a training machine sends a complete model to some intermediate nodes (forwarding processes) through a distribution process, and then the complete model is forwarded to each interaction machine by the intermediate nodes.

In the tree distribution scheme, the model forwarding in each interactive machine does not involve network transmission, namely, the interactive process of each machine can directly acquire the model from the forwarding process of the machine through an inter-process communication means. However, tree distribution effectively reduces the transmission pressure of the training machine, but when the system scale is further enlarged or a larger model needs to be transmitted, and therefore the number of intermediate nodes cannot be increased continuously (otherwise, the bandwidth can reach the upper limit again), the depth of the tree can only be increased for multi-layer forwarding, which can cause long forwarding time.

Based on this, in the present exemplary embodiment, a method for processing a deep reinforcement learning model is provided, where the method may operate on a server, a server cluster, or a cloud server where a model training system having a model training machine and an interactive machine is located; of course, those skilled in the art may also operate the methods of the present disclosure on other platforms as desired, which is not particularly limited in the present exemplary embodiment. Referring to fig. 3, the method for processing the deep reinforcement learning model may include the steps of:

s310, dividing a deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and sending each model fragment to an intermediate node through a model distribution process;

s320, splicing the model fragments through the intermediate node to obtain a complete serialization model, and sending the complete serialization model to the interactive machine;

s330, performing deserialization processing on the complete serialization model through the interaction machine to obtain the deep reinforcement learning model, and performing interaction with a preset virtual environment through the deep reinforcement learning model to obtain training data;

And S340, transmitting the training data to the model training machine through the interactive machine, and training the deep reinforcement learning model through the training data by the model training machine.

In the processing method of the deep reinforcement learning model, on one hand, in the process of model distribution, the deep reinforcement learning model can be divided by a model training machine to obtain a plurality of model fragments, and each model fragment is sent to the intermediate node by a model distribution process; then splicing the model fragments through the intermediate nodes to obtain a complete serialization model, and sending the complete serialization model to an interactive machine; finally, performing deserialization processing on the complete serialization model through an interactive machine to obtain a deep reinforcement learning model, so that the problem of lower distribution efficiency caused by the fact that single transmission easily reaches the upper limit of the bandwidth of a training machine due to the fact that the model needs to be distributed integrally is avoided, and the distribution efficiency of the model is improved; on the other hand, training data can be obtained as the interaction between the interactive machine and the preset virtual environment can be performed through the deep reinforcement learning model; and then training the deep reinforcement learning model through training data in a model training machine, so that the burden of the model training machine is reduced, and the model training efficiency is improved.

Hereinafter, a processing method of the deep reinforcement learning model according to an exemplary embodiment of the present disclosure will be explained and described in detail with reference to the accompanying drawings.

First, the object of the present disclosure of the exemplary embodiment of the present invention will be explained and explained

Specifically, the disclosed example embodiments provide a novel model processing method of a distributed reinforcement learning system, which aims to solve the problems that when the system scale is too large, forwarding is long in time consumption and a single machine is easy to reach the upper limit of bandwidth in the existing distribution method. Meanwhile, in the processing method of the deep reinforcement learning model described in the exemplary embodiment of the present disclosure, unlike the tree distribution scheme described above, the data received by the interactive machine from the training machine is not a complete one model but a fragment of the model.

For example, assuming that there are N total interactive machines in the DDRL system, the exemplary embodiment of the present disclosure may select a positive integer M to satisfy M < = N before the system is started, after the start, the training machine will divide the model into M fragments and distribute the M fragments to the M interactive machines as intermediate nodes respectively when the model is distributed, then the M fragments are broadcast to all the interactive machines by the M machines, and finally all the fragments are received by the N interactive machines, and a complete model is restored therefrom, and a specific distribution schematic diagram may be shown with reference to fig. 4.

Next, an application scenario of the processing method of the deep reinforcement learning model according to the exemplary embodiment of the present disclosure will be explained and explained.

Specifically, the processing method of the deep reinforcement learning model described in the exemplary embodiments of the present disclosure may be applied to a distributed deep reinforcement learning system for training a game AI. The system may include 1 GPU (Graphics Processing Unit, graphics processor) machine with 80 cores and 8 cards as a training machine, send the latest model during training, and 298 CPU (Central Processing Unit ) machines with 36 cores as interactive machines, each of which starts 36 interactive processes to receive the model and generate training data using the model and environment interactions.

The model distribution part of the system is built by adopting a mode of combining Ray and zeroMQ. The Ray is a high-performance distributed execution engine and an open-source artificial intelligent framework and is used for realizing convenient cluster internal resource allocation and process scheduling; zeroMQ is a lightweight message core, and is used for realizing efficient network communication and making up the transmission efficiency deficiency of ray. And, before the system is started, the code running by the system is packed into a Docker and automatically deployed to all interactive machines through k8 s. When the system is started, a model distribution process is started through a Ray on a training machine, and then a forwarding process and an interaction process are started through the Ray on an interaction machine.

Next, a method of processing the deep reinforcement learning model described in fig. 3 will be explained and explained with reference to fig. 4.

In step S310, the deep reinforcement learning model is divided by a model training machine to obtain a plurality of model fragments, and each model fragment is sent to an intermediate node through a model distribution process.

In the present exemplary embodiment, first, a deep reinforcement learning model is divided by a model training machine to obtain a plurality of model fragments. Specifically, it may include: firstly, calculating the node number of the intermediate nodes, and determining the number of fragments which can be divided by the deep reinforcement learning model according to the node number; and secondly, dividing the deep reinforcement learning model in equal parts according to the number of the fragments to obtain a plurality of model fragments. Specifically, after the distributed deep reinforcement learning system is started, the model distribution process prepares the latest deep reinforcement learning model to be transmitted; then, the number of fragments is determined according to the number of nodes of the intermediate node, and the model is serialized based on the number of fragments and then uniformly divided into model fragments which are the same as the number of fragments.

Secondly, after the model fragments are obtained, each model fragment can be sent to an intermediate node through a model distribution process. Specifically, it may include: starting the model distribution process through a preset distributed execution engine; and encoding the model fragments, and transmitting the model fragments to the intermediate node one by one through a model distribution process based on the fragment encoding sequence of the model fragments. That is, a model distribution process is started through a Ray, and then model fragments are coded sequentially to finish marking the fragments; then, based on the sequence of fragment codes of the model fragments, each model fragment is sent to a corresponding forwarding process serving as an intermediate node one by one through a model distribution process in a network transmission mode through ZeroMQ.

In step S320, the model fragments are spliced by the intermediate node, so as to obtain a complete serialization model, and the complete serialization model is sent to the interactive machine.

In this exemplary embodiment, first, the model fragments are spliced by the intermediate node to obtain a complete serialization model. Specifically, referring to fig. 5, the following steps may be included:

Step S510, forwarding, by the intermediate node, the current model fragments received by the intermediate node to other nodes except the intermediate node, and receiving all the model fragments except the current model fragments sent by the other nodes;

step S520, sorting the current model fragments and all other model fragments according to the fragment codes of the current model fragments and all other model fragments;

and step S530, splicing the sequenced current model fragments and all other model fragments to obtain a complete serialization model.

Hereinafter, step S510 to step S530 will be explained and explained. Specifically, each intermediate node (i.e. forwarding process) immediately forwards the current model fragments received by itself to all forwarding processes except itself through network transmission of ZeroMQ after receiving the current model fragments; all model fragments after model serialization can be received by each forwarding process, and then the fragments can be spliced again on each forwarding process according to sequence marks to obtain a complete serialization model.

And secondly, after the complete serialization model is obtained, the complete serialization model can be sent to the interactive machine. The specific transmission process may include: and sending the complete serialization model to an interaction process included in the interaction machine in a manner of interprocess communication of the intermediate node. Specifically, the complete serialization model may be sent to 36 interaction processes located on the same interaction machine by way of interprocess communication. It should be added that 1 forwarding process and 36 interaction processes are started on each interaction machine in the distributed deep reinforcement learning system. As mentioned before, the role of the forwarding process is to serve as an intermediate node for the collection and forwarding of model fragments, in addition to distributing the complete model to the native interaction process. For the case of 100 interactive machines, 10 forwarding processes may be designated as intermediate nodes, while for the case of 298 interactive machines, 30 forwarding processes may be designated as intermediate nodes.

In step S330, the complete serialization model is subjected to deserialization processing by the interactive machine, so as to obtain the deep reinforcement learning model, and interaction is performed with a preset virtual environment by the deep reinforcement learning model, so as to obtain training data.

In this example embodiment, after the interaction process receives the complete serialization model, the original model that can interact with the preset virtual environment, that is, the deep reinforcement learning model, can be obtained by performing deserialization; and then, interacting with a preset virtual environment through the deep reinforcement learning model to obtain training data. Specifically, it may include: the method comprises the steps that interaction is carried out with a preset virtual environment through the deep reinforcement learning model to obtain a plurality of interaction sequences, wherein the interaction sequences comprise a plurality of sampling data, and each sampling data comprises a first state of the preset virtual environment, a decision action and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state; and generating the training data according to each interaction sequence.

Specifically, since the deep reinforcement learning model can combine the perception capability of deep learning and the decision capability of reinforcement learning, it can obtain a high-dimensional observation by interacting with the environment at each moment agent (agent), and perceive the observation by using the deep learning method to obtain a specific state characteristic representation of the observation, the sampled data is used for representing sampling at any moment in the interaction process, and the obtained specific state representation corresponding to the perceived observation; the cost function (state value function) and the cost function (action value function) of the state-action pair of each state can then be evaluated based on the expected return, and the decision strategy is promoted based on the two cost functions, and is used for mapping the current state into the corresponding decision action; the environment will react to this decision action and get the next view. Therefore, the virtual object can be controlled through the deep reinforcement learning model, and the virtual object is interacted with a preset virtual environment to obtain a plurality of interaction sequences.

Wherein the virtual environment may be a virtual game scene generated by a computer. The game scene may be a scene in which a virtual object perceives an environment in which it is located and acts according to the perceived environmental state. The virtual scene may include a virtual object and a plurality of environmental objects included in an environment where the virtual object is located, under the scene, the virtual object may fuse environmental states of the environment where the virtual object is located, and input the fused environmental states into the deep reinforcement learning model to obtain a decision action to be executed by the virtual object. The virtual object may be any kind of agent that can interact with the environment and act according to the environmental state of the environment.

It is further contemplated herein that the deep reinforcement learning model may be used to train game artificial intelligence. Taking a gunfight game as an example, the virtual object may be a game fight AI, and the corresponding decision action may be to control the attack, movement, stop, etc. of the game fight AI character.

In step S340, the training data is sent to the model training machine by the interactive machine, and the deep reinforcement learning model is trained by the model training machine through the training data.

In this example embodiment, after the training data is obtained, the training data may be sent to the model training machine, and after the model training machine receives the training data, the deep reinforcement learning model may be trained by the training data. Specifically, referring to fig. 6, training the deep reinforcement learning model by training data may specifically include the following steps:

step S610, for each sample data in the training data, determining a dominance function value of the dominance function of the deep reinforcement learning model corresponding to an environmental state in the sample data, and a dominance expectation of the dominance function value under a decision strategy corresponding to the sample data;

step S620, for each sample data in the training data, determining an action value corresponding to the sample data according to the sample data, a dominance function value corresponding to the sample data, the dominance expectation, and a state value function of the deep reinforcement learning model;

step S630, determining update gradient information of the action value function of the deep reinforcement learning model based on the action value, and updating the deep reinforcement learning model according to the update gradient information.

Hereinafter, step S610 to step S630 will be explained and explained. Specifically, firstly, the decision strategy is determined based on a strategy family function formed by a dominance function and a plurality of strategy parameters with association relation in a deep reinforcement learning model, and the plurality of strategy parameters are super parameters in the deep reinforcement learning model; meanwhile, the calculation of the dominance function may be calculated through a neural network, which may include a convolutional neural network or a cyclic neural network, to which the present example is not particularly limited; the dominance expectations of the dominance function values can be determined by means of mean approximations of the samples corresponding to the interaction sequences; secondly, determining the action value corresponding to the sampling data according to the sampling data, the dominance function value corresponding to the sampling data, the dominance expectation and the state value function of the deep reinforcement learning model; finally, deriving parameters of the deep reinforcement learning model based on a loss function corresponding to the action value function, obtaining updated gradient information, and updating the deep reinforcement learning model based on the updated gradient information; when determining the loss function, the mean square error between the target value corresponding to the action value and the action value can be calculated, and the loss function can be obtained.

To this end, the overall training of the deep reinforcement learning model has been completed.

The processing method of the deep reinforcement learning model according to the exemplary embodiment of the present disclosure will be further explained and explained below with reference to fig. 7. Specifically, referring to fig. 7, the method for processing the deep reinforcement learning model may include the following steps:

step S701, preparing the latest deep reinforcement learning model to be transmitted by a model distribution process;

step S702, the latest depth reinforcement learning model is divided into model fragments which are equivalent to the number of the intermediate nodes uniformly after being serialized by a serialization module, and the model fragments are encoded;

step S703, transmitting each model fragment to the corresponding forwarding process serving as an intermediate node through the fragment distribution module in a one-to-one manner by the distribution process through ZeroMQ in a network transmission manner;

step S704, continuing to forward fragments to all forwarding processes except the fragments through the network transmission of the zeroMQ by using the forwarding process serving as an intermediate node, and splicing the model fragments according to fragment codes to obtain a complete serialization model;

step S705, transmitting the complete serialization model to 36 interaction processes on the same machine in an inter-process communication mode, and executing anti-serialization processing on each interaction process to obtain an original model capable of interacting with a virtual environment;

Step S706, interacting with the virtual environment through the original model to obtain training data, and transmitting the training data to a training machine;

in step S707, the model training machine trains the deep reinforcement learning model through training data.

In the method provided by the exemplary embodiment of the disclosure, when the model is sent once, the network traffic of the whole cluster is only the number of intermediate nodes multiplied by the size of the model, and the network bandwidth pressure of a single machine can be greatly reduced by flexibly increasing the number of the intermediate nodes, so that the method has more advantages compared with a tree distribution scheme; in addition, the tree distribution structure is realized by comparing the embodiment with the tree distribution method and using the same method.

Specifically, in practical application, two methods were tested for the number of models of 20MB per second transmitted with different numbers of machines (100 and 298). Each model transfer is performed immediately after the distribution process confirms that the last complete model was received by all interaction processes. The test results are specifically shown in the following table 1:

wherein each experiment has adjusted the number of intermediate nodes M to an optimum.

Test results show that the method disclosed by the example embodiment of the disclosure has great advantages in the transmission efficiency of a 20MB large model, and the advantages are more obvious as the interactive machine scale is larger. When the machine scale reaches 298, the transmission efficiency is improved by about 46% compared with tree distribution.

According to the data in table 1, when the number of interactive machines is 298, the bandwidth occupied by each interactive machine forwarding model including the intermediate node is as shown in table 2 below:

it can be seen that the transmission scheme of the present invention has lower requirements for the configuration of the network bandwidth.

To this end, it can be known that: according to the processing method of the deep reinforcement learning model, no matter how large the system scale is, in the process of model transmission each time, the data transmission quantity between the training machine and all interactive machines is consistent with the size of the model, so that a stronger intelligent agent can be obtained under the condition that a DRL model is more and more complex, and the upper bandwidth limit of a network is not easy to reach; in addition, compared with the whole model, the time consumption of the transmission of the model fragments is shorter, and the intermediate nodes which receive the fragments can forward the fragments first, so that the whole time consumption from the start of the model transmission to the time when all nodes receive the whole model is reduced; meanwhile, the number M of the intermediate nodes is adjustable, and the size of M can be adjusted according to the model size and the total number N of the interactive machines so as to obtain better transmission efficiency.

The exemplary embodiments of the present disclosure also provide a processing device for a deep reinforcement learning model, configured in a model training system having a model training machine and an interactive machine. Referring to fig. 8, the processing apparatus of the deep reinforcement learning model may include a model partitioning module 810, a patch stitching module 820, a training data generating module 830, and a model training module 840. Wherein:

The model partitioning module 810 may be configured to partition the deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and send each model fragment to an intermediate node through a model distribution process;

the fragment splicing module 820 may be configured to splice the model fragments through the intermediate node to obtain a complete serialization model, and send the complete serialization model to the interaction machine;

the training data generating module 830 may be configured to perform deserialization processing on the complete serialization model through an interaction machine to obtain the deep reinforcement learning model, and interact with a preset virtual environment through the deep reinforcement learning model to obtain training data;

model training module 840 may be configured to send the training data to the model training machine via an interactive machine and to train the deep reinforcement learning model via the training data via the model training machine.

and generating the training data according to each interaction sequence.

The specific details of each module in the processing device of the deep reinforcement learning model are described in detail in the processing method of the corresponding deep reinforcement learning model, so that the details are not repeated here.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to such an embodiment of the present disclosure is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting the different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 910 may perform step S310 as shown in fig. 3: dividing a deep reinforcement learning model through a model training machine to obtain a plurality of model fragments, and sending each model fragment to an intermediate node through a model distribution process; step S320: splicing the model fragments through the intermediate node to obtain a complete serialization model, and sending the complete serialization model to the interactive machine; step S330: performing deserialization processing on the complete serialization model through the interaction machine to obtain the deep reinforcement learning model, and performing interaction with a preset virtual environment through the deep reinforcement learning model to obtain training data; step S340: and transmitting the training data to the model training machine through the interactive machine, and training the deep reinforcement learning model through the training data through the model training machine.

The storage unit 920 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 9201 and/or cache memory 9202, and may further include Read Only Memory (ROM) 9203.

The storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 930 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 950. Also, electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 960. As shown, the network adapter 960 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

A program product for implementing the above-described method according to an embodiment of the present disclosure may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of processing a deep reinforcement learning model, configured in a model training system having a model training machine and an interactive machine, the method comprising:

Forwarding the current model fragments received by the intermediate node to other nodes except the intermediate node, and receiving all the model fragments except the current model fragments sent by the other nodes; sorting the current model fragments and all other model fragments according to the fragment codes of the current model fragments and all other model fragments; splicing the sequenced current model fragments and all other model fragments to obtain a complete serialization model, and sending the complete serialization model to the interactive machine;

2. The method for processing the deep reinforcement learning model according to claim 1, wherein the dividing the deep reinforcement learning model to obtain a plurality of model fragments includes:

3. The method of processing a deep reinforcement learning model of claim 1, wherein sending each of the model fragments to an intermediate node by a model distribution process comprises:

4. The method of processing a deep reinforcement learning model of claim 1, wherein transmitting the complete serialization model into the interactive machine comprises:

5. The method for processing a deep reinforcement learning model according to claim 1, wherein the training data is obtained by interacting the deep reinforcement learning model with a preset virtual environment, comprising:

and generating the training data according to each interaction sequence.

6. The method of processing a deep reinforcement learning model according to claim 5, wherein training the deep reinforcement learning model by the training data comprises:

7. A processing apparatus for a deep reinforcement learning model, configured in a model training system having a model training machine and an interactive machine, the apparatus comprising:

the fragment splicing module is used for forwarding the current model fragments received by the intermediate node to other nodes except the intermediate node and receiving all the model fragments except the current model fragments sent by the other nodes; sorting the current model fragments and all other model fragments according to the fragment codes of the current model fragments and all other model fragments; splicing the sequenced current model fragments and all other model fragments to obtain a complete serialization model, and sending the complete serialization model to the interactive machine;

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of processing a deep reinforcement learning model according to any one of claims 1-6.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of processing the deep reinforcement learning model of any one of claims 1-6 via execution of the executable instructions.