CN113052312A

CN113052312A - Deep reinforcement learning model training method and device, medium and electronic equipment

Info

Publication number: CN113052312A
Application number: CN202110351941.XA
Authority: CN
Inventors: 范嘉骏
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-29

Abstract

The present disclosure relates to a training method, an apparatus, a medium, and an electronic device for a deep reinforcement learning model, the method including: acquiring an interaction sequence generated by interaction of the deep reinforcement learning model and a virtual environment, wherein the interaction sequence comprises a plurality of sampling data; aiming at each sampling data, determining an advantage function of the depth reinforcement learning model and an advantage function value corresponding to an environment state in the sampling data, and an advantage expectation of the advantage function value under a decision-making strategy corresponding to the sampling data, wherein the decision-making strategy is determined based on a strategy family function formed by a plurality of strategy parameters with incidence relations in the advantage function and the depth reinforcement learning model; for each sampling data, determining an action value according to the sampling data, the corresponding advantage function value, the advantage expectation and a state value function of the deep reinforcement learning model; determining updated gradient information for the action value function based on the action value; and updating the deep reinforcement learning model according to the updated gradient information.

Description

Deep reinforcement learning model training method and device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method, an apparatus, a medium, and an electronic device for a deep reinforcement learning model.

Background

With the development of random computer technology, various large models and complex machine learning models are gradually applied. The deep reinforcement learning combines the perception capability of the deep learning and the decision capability of the reinforcement learning, can be directly controlled according to the input image, and is closer to the thinking mode of human beings. In the training process of the deep reinforcement learning model, generally, a decision-making action strategy in a certain state needs to be evaluated based on an action value function, so as to facilitate strategy improvement of the deep reinforcement learning model.

In the related technology, errors are introduced in the process of obtaining action values by calculation based on an action value function, and are determined based on a strategy when strategy exploration is carried out, so that the richness degree of the strategy is very low, the strategy updating is often unstable, and the training efficiency and accuracy of a deep reinforcement learning model are difficult to ensure.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a training method for a deep reinforcement learning model, the method including:

acquiring an interaction sequence generated by interaction of a deep reinforcement learning model and a virtual environment, wherein the interaction sequence comprises a plurality of sampling data, and each sampling data comprises a first state of the virtual environment, a decision action and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state;

for each sampling data, determining an advantage function value of the deep reinforcement learning model corresponding to an environment state in the sampling data and an advantage expectation of the advantage function value under a decision strategy corresponding to the sampling data, wherein the decision strategy is determined based on a strategy family function formed by a plurality of strategy parameters having an association relationship in the advantage function and the deep reinforcement learning model;

for each sampling data, determining an action value corresponding to the sampling data according to the sampling data, an advantage function value corresponding to the sampling data, the advantage expectation and a state value function of the deep reinforcement learning model;

determining updated gradient information of an action value function of the deep reinforcement learning model based on the action value;

and updating the deep reinforcement learning model according to the updating gradient information.

In a second aspect, the present disclosure further provides a training apparatus for deep reinforcement learning model, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an interaction sequence generated by interaction of a deep reinforcement learning model and a virtual environment, the interaction sequence comprises a plurality of sampling data, and each sampling data comprises a first state of the virtual environment, a decision action and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state;

a first determining module, configured to determine, for each sample data, an advantage function value corresponding to an advantage function of the deep reinforcement learning model and an environment state in the sample data, and an advantage expectation of the advantage function value under a decision policy corresponding to the sample data, where the decision policy is determined based on a policy family function formed by multiple policy parameters having an association relationship in the advantage function and the deep reinforcement learning model;

a second determining module, configured to determine, for each piece of the sample data, an action value corresponding to the sample data according to the sample data, an advantage function value corresponding to the sample data, the advantage expectation, and a state value function of the deep reinforcement learning model;

a third determination module, configured to determine update gradient information of an action value function of the deep reinforcement learning model based on the action value;

and the updating module is used for updating the deep reinforcement learning model according to the updating gradient information.

In a third aspect, a computer-readable medium is provided, on which a computer program is stored which, when being executed by a processing device, carries out the steps of the method of the first aspect.

In a fourth aspect, an electronic device is provided, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.

In the technical scheme, the decision-making strategy is determined based on a strategy family function formed by a plurality of strategy parameters with incidence relations of the dominant function and the deep reinforcement learning model, so that the strategy family function corresponding to the decision strategy can be represented by the advantage function, so that the advantage expectation of the advantage function value under the decision strategy corresponding to the sampling data can be calculated based on the advantage function, thereby obtaining an accurate merit expectation that, compared to prior art estimates of its expectation by the average of the merit function values, can effectively reduce the error in the calculation process of the advantage expectation, not only can accurately evaluate the advantage value of each action in the environment state, meanwhile, the evaluation accuracy of the strategy in the deep reinforcement learning model is improved, and accurate data support can be provided for the training process of the deep reinforcement learning model. Furthermore, a strategy family function can be formed based on a plurality of strategy parameters with incidence relations, so that a decision strategy is determined based on the strategy family function, the search space of the decision strategy can be effectively increased, the diversity of strategy optimization is improved, the effective updating of the strategy of the deep reinforcement learning model is improved to a certain extent, the efficiency of the deep reinforcement learning model is improved, meanwhile, the robustness of the deep reinforcement learning model is improved, and the high requirements of the training of the deep reinforcement learning model on equipment resources are effectively reduced.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart of a training method of a deep reinforcement learning model provided according to an embodiment of the present disclosure;

FIG. 2 is a schematic illustration of a target hyperspace provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a target hyperspace and its corresponding reference hyperspace provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a block diagram of a training apparatus for a deep reinforcement learning model provided according to an embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a training method of a deep reinforcement learning model according to an embodiment of the present disclosure, and as shown in fig. 1, the method may include:

in step 11, an interaction sequence generated by interaction between the deep reinforcement learning model and the virtual environment is obtained, where the interaction sequence includes a plurality of sample data, and each sample data includes an environment state of the virtual environment, a decision action, and a report value obtained by executing the decision action when the virtual environment is in a state corresponding to the environment state.

The deep reinforcement learning model combines the perception capability of deep learning and the decision capability of reinforcement learning, obtains a high-dimensional observation through interaction between an agent (agent) and the environment at each moment, and perceives the observation by using a deep learning method to obtain a specific state characteristic representation of the observation, wherein the sampling data is used for representing that sampling is carried out at any moment in the interaction process, and the obtained specific state representation corresponding to the perception observation; then, a value function (state value function) of each state and a value function (action value function) of a state-action pair can be evaluated based on expected returns, and a decision strategy is promoted based on the two value functions, wherein the decision strategy is used for mapping the current state into a corresponding decision action; the environment will react to this decision-making action and get the next observation. The above process is continuously cycled to obtain the optimal strategy for achieving the goal, illustratively, the goal is to maximize the accumulated return.

In a possible embodiment, the interaction sequence is obtained by sampling during the interaction of a virtual object with the virtual environment, wherein the virtual object is controlled based on the deep reinforcement learning model, the deep reinforcement learning model is used for determining each decision action performed by the virtual object, and the virtual environment is the environment where the virtual object is located.

The virtual environment may be a virtual scene environment generated by a computer, for example, the virtual environment may be a game scene, and illustratively, multimedia data for interacting with a user is rendered so that the multimedia data can be rendered and displayed as the game scene, the virtual environment provides a multimedia virtual world, and the user can control actions of virtual objects through controls on an operation interface, or directly control virtual objects operable in the virtual environment and observe objects, characters, scenery and the like in the virtual environment from the perspective of the virtual objects, and interact through the virtual objects and other virtual objects and the like in the virtual environment. As another example, the virtual environment may also include other virtual objects in the scene, and the like. The virtual object may be an avatar in a virtual environment for simulating a user, which may be a human or other animal avatar, or the like.

The application scene may be a scene in which the virtual object senses the environment in which the virtual object is located and acts according to the sensed environment state. The application scene can comprise a virtual object and a plurality of environment objects contained in the environment where the virtual object is located, under the scene, the virtual object can fuse the environment state of the environment where the virtual object is located, and the fused environment state is input into the deep reinforcement learning model, so that the decision action to be executed by the virtual object is obtained. The virtual object may be any agent that can interact with the environment and act according to the environment state of the environment.

Illustratively, the deep reinforcement learning model is used for training the game artificial intelligence (AI, game), the sampling sample is an interaction sequence obtained by sampling the game artificial intelligence in a game play of a target game, and the virtual environment is a training environment in which the game artificial intelligence is located in the target game.

As an example, the target game is a gunfight type game, the virtual object may be a game fight AI, and the corresponding decision action may be to control game fight AI character attack, movement, stop, and the like. As another example, the target game is a driving game, the virtual object may be an automatic driving game vehicle AI, and the corresponding decision-making action may be to control the vehicle to turn, go straight, brake, etc. As another example, the target game may be an assembly game, the virtual object may be a robot AI, and the corresponding decision-making action may be to control the robot AI to move, grab and put down an object to be assembled, and so on.

For example, when the game artificial intelligence is sampled in the game of the target game to obtain the interaction sequence, the environment where the game artificial intelligence is located can be sensed by the game artificial intelligence to obtain the multi-modal environment state of the training environment where the game artificial intelligence is located. The environment state may include an environment image and object information of each environment object in the environment image, where the object information includes specific parameters corresponding to the environment object. For example, when the virtual object is a game match AI in a gun-battle game, the virtual environment may be a training environment in which the game match AI is located in the gun-battle game, the environment image may be a game map in which the game match AI is located, the environment object may be an enemy unit, a road, a building, or the like in the game map, and the object information may include information such as a numerical parameter (for example, blood volume, offensive power, skill) of the enemy unit, a name, a location, and the like. When the virtual object is a game vehicle AI of a driving game, the virtual environment may be a training environment in which the game vehicle AI is located in the driving game, the environment image may be a captured image of the surroundings of the vehicle, the environment object may be another vehicle, an obstacle, a road, and the like around the vehicle, and the object information may include information such as a vehicle speed, a traveling direction, and a size of the other vehicle. When the virtual object is a robot AI in the assembly game, the virtual environment may be a training environment in which the robot AI is located in the assembly game, the environment image may be a photographed image of an area where the object to be assembled is located, the environment object may be the object to be assembled, and the object information may include information such as the size, shape, and position of the object to be assembled. The object information for each environmental object may then be preprocessed to obtain an object feature vector for each environmental object. For example, the object information of each environmental object may be input into a pre-trained deep learning network to convert the object information of each environmental object into an object feature vector of the environmental object.

As an example, the virtual object may perform a decision-making action in a first state of the virtual environment, and after the virtual object performs the decision-making action, the virtual environment may react to the decision-making action to obtain a second state of the virtual environment and a return value corresponding to the decision-making action. When sampling is performed during the interaction between the virtual object and the virtual environment, the first state, the decision action, the second state, and the return value may be used as sampling data corresponding to the sampling time, and if not otherwise stated, the environmental state in the embodiment of the present disclosure is the first state. In a complete interactive process, the sampling data according to the sequence of the sampling time is formed into an interactive sequence. Illustratively, the goal model may be a deep reinforcement learning model, the goal game may be a maze-like game, the virtual object may be a game AI, the virtual environment may be a virtual maze environment in which virtual rewards may appear in random locations, and the deep reinforcement learning model may be trained to determine a strategy for game AI from virtual maze entry E1 to exit E2 to maximize the virtual rewards gained by game AI from entry E1 to exit E2. Illustratively, from the sample at the entry E1, the action of the game AI in the first state of the virtual maze environment at the initial time is going straight or turning right, then the decision action in the state of the initial time may be determined according to the policy, illustratively the decision action is going straight, the environment reacts based on the decision action to obtain the reward value and the second state, and the sample obtains a sample data. Sampling at the next moment, obtaining a first state of the game AI at the next moment in the virtual maze environment, wherein the corresponding action in the first state is straight movement or right turning, determining a corresponding decision action in the first state at the next moment according to a policy, illustratively, the decision action is right turning, and similarly, obtaining a return value and a second state based on the reaction of the decision action by the environment, and obtaining next sampling data. The interactive sequence including a plurality of sample data can be obtained by sampling in the above-described manner during the movement of the game AI to the exit E2.

When sampling is performed, an image of the virtual environment corresponding to the sampling time can be acquired, so that feature extraction can be performed on the image to obtain the first state. After the virtual object performs the decision-making action, an image of the virtual environment is acquired and feature extraction is performed on the image to obtain a second state. The reward value may be a change of a score value corresponding to the virtual object after the decision action is executed, or a change of a virtual life bar, and the reward value may be set according to an actual usage scenario, which is not limited by the present disclosure.

In step 12, for each sample data, an advantage function value of the depth reinforcement learning model corresponding to the environment state in the sample data and an advantage expectation of the advantage function value under a decision strategy corresponding to the sample data are determined, where the decision strategy is determined based on a strategy family function formed by the advantage function and a plurality of strategy parameters having an association relationship in the depth reinforcement learning model.

The multiple strategy parameters are hyper-parameters in the deep reinforcement learning model and are used for representing the diversity of strategies, the values of the multiple strategy parameters can be set based on human experience, and the values can be dynamically adjusted based on corresponding interaction sequences in the process of updating the deep reinforcement learning model, so that the accuracy and the efficiency of updating the deep reinforcement model are further improved. In this embodiment, a policy family function is formed by a plurality of policy parameters having an association relationship, so that on one hand, the policy function can be represented based on the merit function to realize automatic determination of the decision policy, on the other hand, an exploration space corresponding to the decision policy can be effectively increased, accuracy and diversity of the decision policy are improved, and exploration efficiency of the deep reinforcement learning model can be improved to a certain extent.

In the deep reinforcement learning model, the calculation of the merit function may be implemented by a Neural network, and the merit function network may be implemented based on CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks), for example. Therefore, the environmental state in the sample data can be input into the dominance function network, so that the output value of the dominance function network, that is, the dominance function value corresponding to the environmental state can be obtained.

In the art, the merit function is typically used to evaluate the merit value of selecting action a in state s, and therefore, in determining the merit value, it is typically necessary to determine the merit expectation of selecting each action in state s. In the related art, the expectation is estimated based on the average value of the merit function values, and errors are inevitably introduced, so that the calculation errors of the action values are caused, and the optimal strategy cannot be found in the training process of the deep reinforcement learning model.

In the deep reinforcement learning model, the current state is mapped to the corresponding decision action based on the decision strategy, and therefore, in the embodiment of the present disclosure, the strategy family function of the decision strategy may be solved based on the dominance function and the plurality of strategy parameters, so that the strategy family function of the decision strategy may have an explicit expression form, so that under the decision strategy corresponding to the sample data, the dominance expectation of the dominance function value may be directly calculated based on the dominance function and the corresponding decision strategy, without using mean approximation of samples corresponding to the interaction sequence to determine the dominance expectation. Therefore, when the action value is calculated, the problem that errors are introduced in calculation of the advantage expectation is solved, so that the accurate advantage expectation and the action value are obtained, the accurate evaluation of the decision strategy for selecting the decision action is realized, and the strategy improvement efficiency of the deep reinforcement learning model is improved.

In step 13, for each sample data, an action value corresponding to the sample data is determined according to the sample data, the merit function value corresponding to the sample data, the merit expectation, and the state value function of the deep reinforcement learning model.

In the deep reinforcement learning model, a value function is usually used to evaluate the value of a certain state or a state-action, that is, the value of an agent selecting a certain state or executing a certain action in a certain state. The value of a state is usually evaluated by using a state value function, and the value of a state can be expressed by the values of all actions in the state, i.e. the expectation of the accumulated return obtained based on the state s, and under this strategy, the accumulated return obeys a distribution, and the expectation of the accumulated return at the state is defined as a state value function v(s):

V_π(s)＝E_π[G_t|S_t＝s]

V_π(S) means the state S at time t under strategy π_tWhen s is the value, G is accumulated_tThe expected value at s.

As an example, the accumulated reward may be a sum of reward values corresponding to each decision-making action included in the interaction sequence. As another example, the more distant a decision-making action has an effect on the current decision-making action, the less the accumulated reward is, the accumulated reward may be the accumulated sum of the return value of each decision-making action in the interaction sequence multiplied by the attenuation coefficient corresponding to the decision-making action, wherein the attenuation coefficients corresponding to the decision-making actions decrease in the order of the decision-making actions, for example:

G_t＝R_t+1+γR_t+2+γ²R_t+3+…+γ^n-1R_t+n

＝R_t+1+γ(R_t+2+γR_t+3+…+γ^n-2R_t+n)

＝R_t+1+γG_t+1

wherein R is_iA return value for the decision action at time i, γ for the attenuation factor, and n for the number of samples in the interaction sequence after time t to the end of the interaction.

Thus, in another embodiment, the accumulated reward may be obtained from the last decision-making action of the interaction sequence by multiplying its reward value by the decay value and adding the reward value of the previous decision-making action until the reward value of the first decision-making action in the interaction sequence is added. Wherein the attenuation value can be set according to the actual usage scenario.

Similarly, in the deep reinforcement learning model, the calculation of the state value function can be realized through a neural network. Therefore, the environmental state in the sample data can be input into the state value function network, so that the output value of the state value function network, that is, the state value of the state value function corresponding to the environmental state can be obtained.

In the deep reinforcement learning model, an action value function is generally adopted to evaluate the value of executing a certain action in a certain state, that is, the expectation of the accumulated return obtained after selecting an action a in a state s is based on:

Q_π(s,a)＝E_π[G_t|S_t＝s,A_t＝a]；

i.e. representing the state S at time t under strategy pi_tValue s, selected action A_tWhen a, the accumulated reward G_tThe expected value at s, a, that is to say the action value function, can be used to evaluate the strategy pi.

Based on the definition of the dominance function a (s, a), the state value function v(s), and the action value function Q (s, a), the following relationships exist in deep reinforcement learning in the art:

Q(s,a)＝A(s,a)+V(s)

then in this embodiment, the action value corresponding to the action value function and the environmental state and the decision action may be determined according to the sampled data, the merit function value corresponding to the sampled data, the merit expectation, and the state value function of the deep reinforcement learning model.

In step 14, update gradient information of the action value function of the deep reinforcement learning model is determined based on the action value.

The update gradient of the action value function can be determined by deriving the parameters of the deep reinforcement learning model through a loss function corresponding to the action value function. For example, when determining the loss function, a mean square error between the target value corresponding to the action value and the action value may be calculated, such as:

Q(θ)＝E_π[(Q^π(s_t,a_t)-Q_θ(s_t,a_t))²]

wherein Q (θ) is used to represent the mean square error, Q_θ(s_t,a_t) For expressing action value, Q^π(s_t,a_t) For representation and Q_θ(s_t,a_t) And the corresponding target value theta is used for representing the model parameters to be updated in the deep reinforcement learning model.

Then, by performing derivation on the loss function and simplifying the process, such as simplifying a constant multiple formed by the derivation, the update gradient of the action value function is obtained as follows:

thus, the corresponding update gradient information can be determined based on the above formula and the environmental state and decision action in the sampled data.

In step 15, the deep reinforcement learning model is updated according to the update gradient information.

For example, a PPO (proximity Policy Optimization) algorithm may be used to update parameters in the deep reinforcement learning model based on the updated gradient information, so as to implement Policy Optimization of the deep reinforcement learning model.

In a possible embodiment, the policy parameters include policy entropy parameters and weight parameters corresponding to each of the policy entropy parameters, wherein a sum of the weight parameters corresponding to each of the policy entropy parameters is 1. The entropy is a measure of uncertainty, the larger the uncertainty is, the larger the entropy is, and the strategy entropy parameter in the depth-enhanced learning model in the present disclosure is a hyper-parameter for representing the diversity of the strategy.

Accordingly, the policy family function is determined by:

and determining sub-functions of the sub-strategies corresponding to the strategy entropy parameters according to the dominance function values and the strategy entropy parameters.

As described above, the merit function value is used to represent the value of a certain state, that is, the value of all actions in the state can be used to represent, and the merit function value may be a vector, where each dimension in the vector is used to represent the value of the action corresponding to the dimension. For example, the ratio of the dominance function value to each policy entropy parameter may be determined as a sub-function of the sub-policy corresponding to the policy entropy parameter.

And then, determining the sum of the products of the probability distribution obtained after each sub-function is subjected to softmax processing and the weight parameter corresponding to the sub-function as the strategy family function.

By way of example, the policy family function μmay be expressed as:

where m is used to represent the number of policy entropy parameters, τ_iFor representing the ith policy entropy parameter, ε, among the policy parameters_iThe weighting parameter is used for representing the weighting parameter corresponding to the ith strategy entropy parameter.

Therefore, in the embodiment of the present disclosure, a sub-policy may be determined based on each policy entropy parameter, and then each sub-policy is used as a base policy to form a multi-dimensional policy space, each base policy may represent an entropy group, and a plurality of entropy groups are combined to obtain richer policies in the multi-dimensional policy space, so that the diversity of explorable policies may be greatly increased.

To facilitate the calculation of the policy family function, when the policy entropy parameter is 2, the policy family function can be expressed as:

that is, the policy entropy parameter τ included in the policy parameter₁The corresponding weight parameter is ε₁In this case, 1-epsilon can be directly converted₁As the policy entropy parameter τ₂The weight parameter of (2) so as to simplify the parameter representation in the strategy family function and reduce the data calculation amount to a certain extent.

The manner of performing softmax processing to obtain probability distribution is conventional operation in the art, and is not described herein again. In this embodiment, the probability information for each action in the policy in the state is determined based on the dominance function value by converting the subfunction into the probability distribution, the probability distribution of the subfunction is expressed based on the dominance function value, the decision-making action is subsequently determined, and the training efficiency and accuracy of the deep reinforcement learning model are improved.

In one possible embodiment, in step 13, an exemplary implementation manner of determining an action value corresponding to the sample data according to the sample data, the advantage function value corresponding to the sample data, the advantage expectation, and the state value function of the deep reinforcement learning model is as follows, and the step may include:

the state value corresponding to the state value function is determined according to the environmental state in the sampled data, and the environmental state may be input to the state value function network, for example, as described above, so as to obtain the state value.

And determining the difference between the advantage function value and the advantage expectation as a processing advantage function value, so that the advantage of the value of each action in the environment state relative to the advantage expectation value can be represented by the processing advantage function value. If the value of a certain action under the environment state is better than the expected value, the processing advantage function value is a positive value, which means that the action corresponding to the processing advantage function value is selected to be positive, and more returns can be obtained. Through the conversion scheme, the processing advantage function value can meet the constraint that the processing advantage function value is expected to be 0, the stability of output is improved, the learning efficiency is improved, and meanwhile, the learning process of the deep reinforcement learning model is more stable.

Determining a sum of the processing merit function value and the state value as the action value.

As shown above, Q (s, a) ═ a (s, a) + v(s), the action value can be determined after the process merit function value and the state value are determined.

Illustratively, as indicated above, the state value V may be determined based on a state value function network, namely:

V＝V(s_t)

determining a dominance value A based on a dominance function network, namely:

A＝A(s_t)

the policy family function μ can be expressed as:

the processing merit function value may be further determined

Wherein pi is used to represent the decision strategy.

In the embodiment of the present disclosure, the deep reinforcement learning model may be trained in an off-policy (off-policy) manner, that is, a learning update manner in a state where a policy corresponding to an agent to be learned is different from a policy corresponding to the agent when sampling is performed in an environment interaction process. In order to improve the training efficiency of the model, a strategy mu ' can be used for interacting with the environment and sampling in the interaction process to obtain an interaction sequence, wherein the strategy mu ' is a strategy for data sampling determined based on the strategy family function, and the model under the strategy pi is updated based on the interaction sequence corresponding to the strategy mu ' to respectively perform data acquisition and model learning. In the process, since the interaction sequence and the strategy corresponding to the deep reinforcement learning model are different, importance sampling is required to be performed on the interaction sequence and the deep reinforcement learning model, and calculation can be performed according to an importance sampling mode in the field, which is not described herein again.

Then, the action value Q is determined:

by means of the technical scheme, in the process of determining the action value corresponding to the sampling data, the state value and the advantage value are obtained by respectively evaluating the state and the state-action, and the application range of the action value determination method is widened. Meanwhile, in the process, the advantage function value is processed through the advantage expectation, so that the stability of the determined processing advantage function value can be improved, the advantage expectation is determined based on mathematical calculation, other errors cannot be introduced in the determination process, the accuracy of the action value can be guaranteed, the strategy of selecting the action in the environmental state is accurately evaluated, the strategy of the deep reinforcement learning model is effectively updated, the error of strategy updating in the deep reinforcement learning model caused by the error determined by the action value is avoided, and the accuracy of the decision action of controlling the virtual object based on the deep reinforcement learning model is guaranteed. In addition, the learning efficiency can be improved, and meanwhile, the required calculation amount and the number of samples in the process of training the deep reinforcement learning model can be reduced to a certain extent.

In a possible embodiment, in step 13, an exemplary implementation manner of determining the updated gradient information of the action value function of the deep reinforcement learning model based on the action value is as follows, and the step may include:

and determining the updating gradient information of the action value function according to the updating gradient information of the decision strategy and the expected value of the difference value of the action value and the state value under the decision strategy in the component of the target direction.

The target direction may be an update gradient direction of the decision policy, and the determination manner of the action value and the state value is described in detail above and is not described herein again. In this embodiment, the deviation of the policy evaluation may be represented by an expected value of the difference between the action value and the state value in the target direction of the difference value under the decision policy. Thus, the difference between the updated gradient information of the decision strategy and the expected value may be used to construct updated gradient information of an action value function, such as:

wherein,

updated gradient information for representing the action value function,

updated gradient information, Q, for representing the decision strategy_tFor representing the value of the action at time t, V_tThe method is used for representing the state value corresponding to the time t, g is used for representing the updating gradient direction, and pi is used for representing the decision strategy.

Therefore, the updated gradient information of the action value function can be calculated and solved based on the relation, meanwhile, when the deep reinforcement learning model is updated based on the updated gradient information, the strategy can be guaranteed to be improved, and the error of strategy evaluation is reduced, so that the strategy determined by the updated deep reinforcement learning model is more optimal, the evaluation of the determined strategy is more accurate, the optimization efficiency of the deep reinforcement learning model is improved, and the convergence of the deep reinforcement learning model can be guaranteed to a certain extent.

In a possible embodiment, a plurality of policy parameter sets having an association relationship in the deep reinforcement learning model form a policy parameter combination, values of the policy parameter combination are updated based on a parameter determination model corresponding to the policy parameter combination and an interaction sample generated by the interaction sequence, the interaction sample includes a sampling value combination corresponding to the policy parameter combination corresponding to the interaction sequence and an optimization feature parameter corresponding to the deep reinforcement learning model, and the sampling value combination includes a sampling value corresponding to each policy parameter.

For convenience of illustration, the policy parameter combination α described by including 2 dimensions may be represented as: α ═ λ₁,λ₂) The sampling value combination can contain the strategy parameter lambda respectively₁And λ₂The sampling value of (2). Therefore, in the embodiment of the present disclosure, when the values of the plurality of policy parameters having the association relationship in the deep reinforcement learning model are determined in a unified manner, the accuracy of the value of each policy parameter can be ensured, and meanwhile, the matching degree of the values of the plurality of policy parameters and the whole deep reinforcement learning model can be ensured, thereby avoiding the problem of trapping in a saddle point when each policy parameter is optimized respectively.

In a possible embodiment, an exemplary implementation manner of updating the value of the policy parameter combination based on the parameter determination model corresponding to the policy parameter combination and the interaction sample generated by the interaction sequence is as follows, and the step may include:

and under the condition that the number of the parameter determination models is one, updating the state value corresponding to the strategy parameter combination in the parameter determination model according to the interaction sample, wherein the strategy parameters correspond to the dimensions in the target hyperspace corresponding to the strategy parameter combination in a one-to-one manner, and the parameter space of each strategy parameter is discretized into a plurality of value intervals under the dimension corresponding to the strategy parameter, so that the target hyperspace is discretized into a plurality of value spaces.

In connection with the above example, the policy parameter combination includes two policy parameters, each being λ₁And λ₂Then the target hyperspace corresponding to the policy parameter combination is a two-dimensional space, as shown in fig. 2, where the X-axis dimension corresponds to the policy parameter λ₁The Y-axis dimension corresponds to the strategic parameter λ₂Strategy parameter λ₁And λ₂Parameter space that corresponds respectively can carry out discretization on its corresponding dimension, and wherein the interval that different dimensions carried out discretization can be the same, also can be different, and the user can set up based on the in-service use scene, and this disclosure is to this openThis is not limiting.

As shown in FIG. 2, the policy parameter λ is measured in the X-axis dimension₁Is discretized at discrete intervals H1, and the policy parameter lambda is discretized in the Y-axis dimension₂The parameter space of (2) is discretized at discretization intervals H2, the target hyperspace is discretized into 12 value spaces (C00-C23) as shown in fig. 2. The state values corresponding to the policy parameter combinations can be represented by a vector, that is, the state values corresponding to the 12 value spaces are respectively a dimension value in the vector.

As an example, the parameter space of each policy parameter in the policy parameter combination is discretized into a plurality of value intervals, so that the target hyperspace corresponding to the policy parameter combination is discretized into a plurality of value spaces, and the state value corresponding to the policy parameter combination can be used to represent the accumulated return brought by selecting each policy parameter value in the policy parameter combination from each value space based on the policy in a state where the value of the policy parameter in the policy parameter combination is the corresponding sampling value in the sampling value combination. For example, in the present disclosure, the state value corresponding to the policy parameter combination may be determined in an iterative update manner, that is, the state value corresponding to the policy parameter combination is iteratively updated according to the sampling value combination corresponding to the interaction sample.

As an example, when discretizing the parameter space in each dimension in the target hyperspace, for each dimension, the number of value intervals corresponding to the parameter space may be predetermined, and then the parameter space of the policy parameter corresponding to the dimension may be uniformly divided to obtain a plurality of value intervals in the dimension. If the parameter space of the policy parameter is [0,9], the parameter space is divided into 9 value intervals, the value range corresponding to the value interval a1 is [0,1 ], the value range corresponding to the value interval a2 is [1,2 "), and so on, and the description thereof is omitted here.

And then, determining a target space from the plurality of value spaces according to the updated state value corresponding to the strategy parameter combination.

In this step, the accumulated return of the value of the policy parameter combination from each value space can be accurately evaluated by the state value corresponding to the policy parameter combination determined according to the interaction sequence corresponding to the deep reinforcement learning model, so that the target space for determining the value corresponding to the policy parameter combination can be selected according to the evaluation result to ensure the accuracy of the value corresponding to the policy parameter combination and the consistency of the value corresponding to the policy parameter combination and the actual application process of the deep reinforcement learning model.

And determining a target value combination corresponding to the strategy parameter combination according to the target space, and determining the value of each strategy parameter according to the target value combination.

As an embodiment, the uniform distribution sampling may be performed in a value range corresponding to the target space, and a value corresponding to a point obtained by sampling in each dimension is determined as a target value of the policy parameter corresponding to the dimension. As shown in fig. 2, the determined target space is C13, and the sampling point obtained by sampling from the target space is P1, then the value Px of P1 corresponding to the X-axis dimension may be determined as the policy parameter λ₁Determining a value Py corresponding to P1 in the Y-axis dimension as a strategy parameter lambda₂The target value of (2).

When a plurality of weight parameters in the strategy parameters are determined, the weight parameters can be normalized, so that the weight proportion relation among subfunctions corresponding to the determined strategy entropy parameters can be ensured, and the accuracy of the determined strategy family function is ensured.

Therefore, for a plurality of strategy parameters with incidence relations, the parameter space of each strategy parameter can be simultaneously expressed through the target hyperspace, so that the parameter space of the strategy parameters under each dimension can be discretized into a plurality of value spaces, and the value of the strategy parameter combination can be determined based on the interaction sequence corresponding to the depth-enhanced learning model using the strategy parameter combination and the optimized characteristic parameters of the depth-enhanced learning model. Therefore, on one hand, the value of the strategy parameter of the deep reinforcement learning model can be accurately set, and the phenomenon that the deep reinforcement learning model cannot be converged or the convergence speed is too low due to the fact that the strategy parameter setting value is not appropriate due to the limitation of human experience is avoided. On the other hand, the accuracy of the value of each strategy parameter can be ensured, and the training efficiency of the deep reinforcement learning model is further improved.

In a possible embodiment, as described above, the virtual environment may be a game environment, and then sampling may be performed in the process of interacting the virtual object with the virtual environment to obtain interaction data, and the value of the policy parameter in the deep reinforcement learning model may be determined based on the above manner in the training process of the deep reinforcement learning model, so that a greater reward can be obtained when the deep reinforcement learning model determines the decision-making action of the virtual object, the accuracy of the decision-making action of the virtual object is ensured, the accuracy of virtual character control is improved, and meanwhile, the amount of data and manpower required in the training process can be reduced.

In a possible embodiment, in a case that the parameter determination model is one, according to the interaction sample, an exemplary implementation manner of updating a state value corresponding to the policy parameter combination in the parameter determination model is as follows, and this step may include:

and determining the value space to which the sampling value combination belongs as the value space to be updated according to the sampling value combination.

For example, the identifier of the sample value under the dimension corresponding to the sample value may be respectively determined according to each sample value in the sample value combination, for example, the value interval to which the sample value belongs may be determined based on the range length corresponding to the value interval of the sample value, and the subscript i of the value interval to which the sample value belongs may be determined by the following formula:

i＝(min(max(x,l),r)-l)//acc

wherein x is used to represent the sampling value; l is used to represent the left boundary of the parameter space; r is used to represent the right boundary of the parameter space; // for integer division symbols; acc is used to indicate the range length of the value interval.

Therefore, when the subscript of the sampling interval corresponding to each sampling value is determined based on the above manner, the value space to be updated is determined based on the dimension of each sampling value. For example, if the subscript determined by the X-axis dimension is 2 and the subscript determined by the Y-axis dimension is 1, the value space to which the sampling value combination belongs is C21.

And then, updating the state value of the value space to be updated according to the optimization characteristic parameters.

In this embodiment, a value space to be updated corresponding to the sampling value can be determined by a relationship between the sampling value and the parameter space, so that a state value of the value space to be updated can be updated, and for other value spaces except the value space to be updated, the corresponding state value does not need to be updated, so that accuracy of the state value corresponding to the policy parameter can be ensured, and data support is provided for subsequently and accurately selecting a target space.

In this embodiment, the value space to which the sampling value combination belongs is determined as the value space to be updated, and the state value of the value space to be updated can be updated according to the optimized characteristic parameter by the following formula:

wherein T is used to represent the optimization feature parameter, and if the optimization feature parameter is cumulative reward, that is, the cumulative reward is optimized in the direction of increasing during optimization, T may be G_tIf the optimization characteristic parameter is error rate of the deep reinforcement learning model, namely the error rate is optimized towards the direction of reduction during optimization, T can be-error; k(s) is used for expressing the hit times of the value space s to be updated, namely the times of the value of the strategy parameter combination corresponding to the interactive sequence belonging to the value space s to be updated, V(s)And V'(s) is used for representing the updated state value of the value space s to be updated.

For another example, the state value of the value space to be updated can be updated according to the following formula and the optimized characteristic parameter by the following formula:

V'(s)＝V(s)+lr*(T-V(s))

here, lr represents a learning rate for updating the state value.

Therefore, by the technical scheme, the state value corresponding to the strategy parameter space can be updated based on the interactive sample, so that the state value corresponds to the actual optimization characteristic parameter in the deep reinforcement learning model, the accuracy of the subsequently determined target value combination can be ensured, and the training efficiency of the deep reinforcement learning model is optimized.

In a possible embodiment, the target hyperspace corresponds to a plurality of reference hyperspaces, a discrete interval corresponding to a value space in each reference hyperspace is greater than a discrete interval corresponding to a value space in the target hyperspace, and an origin corresponding to each reference hyperspace is different.

As shown in fig. 3, where the a space is the target hyperspace shown in fig. 2, the B1 space and the B2 space are reference hyperspaces corresponding to the target hyperspace shown in fig. 2, and for convenience of representation, the B1 space and the B2 space are represented as shown in fig. 3, the positions of the origins of the B1 space and the B2 space are offset from the positions of the origins of the target hyperspaces, but positions corresponding to the a space exist in both the B1 space and the B2 space.

As shown in fig. 3, the discrete interval corresponding to the reference hyperspace B1 and the discrete interval corresponding to the reference hyperspace B2 are both greater than the discrete interval corresponding to the target hyperspace, for example, the discrete intervals corresponding to the reference hyperspaces may be set according to the discrete intervals corresponding to each dimension in the target hyperspace, which may be the same or different, and may be set according to the actual use requirement.

Accordingly, according to the interaction sample, an exemplary implementation manner of updating the state value corresponding to the policy parameter combination in the parameter determination model is as follows, and this step may include:

and determining a reference value space to which the sampling value combination belongs in each reference hyperspace and a value space to be updated to which the sampling value combination belongs in the target hyperspace according to the sampling value combination.

And updating the state value of each reference value space according to the optimization characteristic parameters.

The method for determining the value space to which the sampling value combination belongs in each hyperspace and updating the state value of the value space is described in detail above, and is not described herein again.

And then, updating the state value of the value space to be updated according to the updated state value corresponding to each reference hyperspace.

By the technical scheme, the state value corresponding to the target hyperspace can be updated based on a plurality of reference hyperspaces, and, since the discrete interval corresponding to the value space in each of the reference hyperspaces is larger than the discrete interval corresponding to the value space in the target hyperspace, the calculation efficiency of determining the state value corresponding to the reference value space in the reference hyperspace is higher, on one hand, the updating efficiency of the state value corresponding to the target hyperspace can be improved to a certain degree, on the other hand, the accuracy of the state value corresponding to the determined strategy parameter combination can also be improved to a certain degree based on fusion expression of the target hyperspaces corresponding to a plurality of reference hyperspaces, accurate data support is provided for subsequent determination of the target value combination, the optimization of the strategy parameters in the deep reinforcement learning model is improved, and therefore the efficiency of strategy optimization in the deep reinforcement learning model can be improved.

In a possible embodiment, an exemplary implementation manner of updating the state value of the value space to be updated according to the updated state value corresponding to each reference hyperspace is as follows, and the step may include:

and determining a mapping value space corresponding to the value space to be updated in each reference hyperspace.

Wherein the mapping value space may be determined based on an offset between an origin of each of the reference hyperspaces and an origin of the target hyperspace. The coordinate systems corresponding to the target hyperspace and the reference hyperspace may be converted to be determined, and the mapping manner of the coordinate system conversion may adopt any conversion manner in the art, which is not described herein again.

For example, as shown in fig. 3, if the value space to be updated in the target hyperspace a is C01, the reference value space corresponding to the target hyperspace a in the reference hyperspace B1 is M1, and the reference value spaces corresponding to the target hyperspace B2 are M2 and M3.

And determining the state value of the value space to be updated according to the state value of each mapping value space.

For example, the average value of the state value of each mapping value space may be used to determine the state value of the value space to be updated.

Therefore, according to the technical scheme, the state values corresponding to the strategy parameter combinations do not need to be directly updated based on the interactive samples, but the state values corresponding to the multiple reference hyperspaces are updated based on the interactive samples, so that the target hyperspaces can be subjected to fusion representation through the multiple reference hyperspaces based on the mapping relation between the target hyperspace and the reference hyperspace, the accuracy of the state values corresponding to the determined strategy parameter combinations can be improved, and the accuracy of the subsequently determined strategy parameter values is further improved.

In order to improve the efficiency of value determination of the policy parameter, when the number of the interaction samples reaches a preset threshold, a step of updating the state value corresponding to the policy parameter combination in the parameter determination model according to the interaction samples may be performed for each interaction sample, where each interaction sample corresponds to a value of the policy parameter in a different value space, that is, the state value corresponding to the policy parameter combination may be updated simultaneously based on a plurality of interaction samples, and the updating manner based on each interaction sample is the same as that described above, and is not described herein again.

In this embodiment, after the state value corresponding to the policy parameter combination is updated, the score corresponding to each value space may be recalculated, so that the accuracy of the score of each value space may be ensured, and accurate data support may be provided for determining the target space.

In a possible embodiment, an exemplary implementation manner of determining a target space from the multiple value spaces according to the updated state value corresponding to the policy parameter combination is as follows, and the step may include:

firstly, determining a target score of each value space according to the updated state value corresponding to the strategy parameter combination, wherein the target score is used for representing the reliability degree of selecting the value space.

As an example, an exemplary implementation manner of determining the target score of each value space according to the updated state value corresponding to the policy parameter combination is as follows, and the step may include:

and in the updated state value corresponding to the strategy parameter combination, aiming at each value space, determining the result obtained after normalization processing is carried out on the state value of the value space as the value score of the value space.

For example, the value score S of the value space can be determined by the following formula_i：

Wherein, V_i' is used for representing the latest state value corresponding to the current value space, and mu (V ') and sigma (V ') respectively represent the mean value and standard deviation corresponding to the updated state value of each value space, that is, the value fraction of each value space can be normalized by the formula.

Then, aiming at each value space, determining the value space according to the value fraction of the value space and the hit times of the value spaceTarget fraction of space, and determining target fraction Score of ith value space_iThe formula of (1) is as follows:

wherein c is a preset constant and is used for adjusting the influence of the hit times on the target score, Mi is the hit times of the value space i, and j is used for representing subscripts of each value space.

As described above, in the embodiment of the present disclosure, the state value of the value space may be determined in the iterative update manner, the state value of each value space is initially 0, and for each value space, after the value space to be updated is determined according to the sampling value combination of the policy parameter combination, the state value of the corresponding value space to be updated is updated, and the state values of other value spaces except the value space to be updated are kept unchanged. Therefore, in this embodiment, in order to improve the diversity of target space selection in the initial training process, the number of hits in the value space needs to be considered when determining the score of the value space, so as to reduce the influence degree of the values of historical hits on the selection of the target space. Therefore, in the process, as the number of interactive samples increases, the state value corresponding to the strategy parameter combination is more accurate, and as the number of hits increases, the influence of the number of hits on the target score is gradually reduced, so that the diversity and the exploration space of target space selection can be improved in the initial learning stage, the accuracy of the determined target value combination can be improved to a certain extent, the overlarge influence of random samples in the initial state is avoided, and when the state value is accurate, the influence of the number of hits on the target space selection is reduced, so that the forward optimization adjustment of the target space selection on the optimized characteristic parameters is ensured.

Then, a target space can be determined from the plurality of value spaces according to the target score of each value space.

In a possible embodiment, the step of determining the target space from the plurality of value spaces according to the target score of each value space may include:

and determining the value space with the maximum target score as the target space.

In the embodiment of the disclosure, the value space with the largest target score can be directly selected as the target space, so that effective adjustment of the target value combination determined from the target space to the deep reinforcement learning model optimization can be effectively ensured, and the efficiency of the deep reinforcement learning model optimization is improved.

In another possible embodiment, the step of determining the target space from the plurality of value spaces according to the target score of each value space may include:

and performing softmax processing on the target scores corresponding to the plurality of value spaces to obtain probability distribution formed by probability information of the plurality of value spaces, sampling the plurality of value spaces according to the probability distribution, and determining the value spaces obtained by sampling as the target spaces.

In this embodiment, in order to further improve the diversity of the policy parameter combination value exploration, the state values of the value spaces may be mapped based on the softmax function, so as to be mapped to values in the range of 0 to 1, and the values are used as probability information of the value spaces, so as to obtain the probability distribution of the value spaces. When sampling is carried out based on probability distribution, the probability that the value space with smaller probability information is sampled can be achieved, the possibility that a plurality of value intervals are sampled can be guaranteed to a certain extent, the problem that the determined target space enables the characteristic optimization parameters to be in the locally optimal parameters is solved, training of the deep reinforcement learning model is prevented from being stopped due to the fact that the training reaches the locally optimal parameters, and the accuracy and the robustness of the deep reinforcement learning model training can be guaranteed.

and under the condition that the number of the parameter determination models is multiple, updating the state value corresponding to the strategy parameter combination in each parameter determination model according to the interaction sample, wherein the learning rate of each parameter determination model is different, the strategy parameters in each parameter determination model correspond to the dimensions in the target hyperspace corresponding to the strategy parameter combination one by one, and the parameter space of each strategy parameter is discretized into a plurality of value intervals under the dimension corresponding to the strategy parameter, so that the target hyperspace is discretized into a plurality of value spaces, and the plurality of value spaces corresponding to the parameter determination models are divided into the same.

In this embodiment, a plurality of parameter determination models may be used to determine the value of each of the policy parameters in the policy parameter combination. Illustratively, 5 parameter determination models may be initialized at random, and the learning rates of the 5 parameter determination models are set in advance. In the disclosure, the learning rate of each parameter determination model is different, and when different parameter determination models learn based on the same interaction sample, the respective parameters can be adjusted in multiple learning step lengths, so that each parameter determination model can perform personalized learning, and the diversity of the parameter determination models when determining the corresponding value of the policy parameter combination based on each parameter determination model is increased, thereby ensuring that the parameter determination models consider the comprehensiveness of the features when determining the target value combination, and improving the accuracy of the target value combination.

When the number of the parameter determination models is multiple, the state value corresponding to the policy parameter combination under each parameter determination model may be updated in the same manner as described above. When the state value corresponding to the combination of the policy parameters is updated in the above manner, each of the parameter determination models is updated by using the learning rate corresponding to the parameter determination model.

And aiming at each parameter determination model, determining candidate spaces from the multiple value spaces according to the updated state value corresponding to the strategy parameter combination in the parameter determination model.

When the number of the parameter determination models is multiple, the specific implementation manner of determining the candidate space by each parameter determination model is similar to the determination manner of determining the target space from multiple value spaces by the parameter determination model when the number of the parameter determination models is one, and details are not repeated here.

And determining the candidate space determined by the model according to each parameter, and determining a target space.

Illustratively, this step may include:

and acquiring the number of parameter determination models for determining the candidate spaces for each candidate space.

And determining the candidate space with the maximum number as the target space.

As described above, the value space may be C00-C23, the number of the parameter determination models is 5, for example, M1-M5, for example, the candidate space determined by M1 is C00 and C01, the candidate space determined by M2 is C00 and C10, the candidate space determined by M3 is C10 and C01, the candidate space determined by M4 is C00 and C01, and the candidate space determined by M5 is C11 and C01. The number of parameter determination models for determining the candidate space may be obtained in this embodiment, for example, for the candidate space C00, the corresponding parameter determination models are M1, M2 and M4, that is, the corresponding number of candidate space C00 is 3. The determination is performed in the same manner for other candidate spaces. Then, the candidate space with the largest number of correspondences can be determined as the target space, that is, the value space determined by the parameter determination model with the largest number. As in the above example, the target space is value space C01.

Therefore, in the technical scheme, the target space can be determined based on the candidate space selected by the multiple parameter determination models, on one hand, the deviation of a single model for selecting the target space can be avoided, on the other hand, the learning rate of the multiple parameter determination models is different, and the modes for determining the candidate space are possibly different, so that the diversity of the multiple parameter determination models can be ensured, the comprehensiveness and diversity of the considered features during the determination of the target space are improved, the accuracy of the target space is ensured, and meanwhile, the exploration space of the values corresponding to the strategy parameter combination is widened.

And then, determining a target value combination corresponding to the strategy parameter combination according to the target space, and determining the value of each strategy parameter according to the target value combination. The specific implementation of this step has been described in detail above, and is not described herein again.

Therefore, the target space can be determined by combining the candidate space determined by the multiple parameter determination models, and the value corresponding to the strategy parameter combination is further determined, so that the accuracy of the value corresponding to the strategy parameter combination is ensured, and meanwhile, the matching degree of the target value and the actual application scene of the deep reinforcement learning model is improved. In addition, the determined target value can enable the optimized characteristic parameters of the deep reinforcement learning model to be better, so that the training efficiency of the deep reinforcement learning model can be effectively improved, the iteration times of the deep reinforcement learning model training are reduced to a certain extent, and the convergence efficiency of the deep reinforcement learning model is improved.

In one possible embodiment, the method may further comprise:

obtaining a new parameter determination model for determining the policy parameter combination in case the number of interaction samples reaches a number threshold. The number threshold may be set according to an actual usage scenario, which is not limited by this disclosure. For example, the new parameter determination model may be a newly initialized parameter determination model, and the model parameters in the new parameter determination model are randomly initialized values.

And then, replacing the parameter determination model with the longest use time in the parameter determination models for determining the strategy parameter combination with the new parameter determination model.

As an example, the initial usage time of each parameter determination model may be recorded, and the model with the earliest initial usage time may be determined as the parameter determination model with the longest usage time. As another example, the identification information of the parameter determination models may be stored in a queue, where the queue may be a FIFO (First Input First Output) queue, and when the parameter determination model is replaced, the parameter determination model of the identification information at the head of the queue is directly deleted, and the identification information of the new parameter determination model is added to the tail of the queue.

Wherein, in this embodiment, when the number of the interactive samples reaches the number threshold, it indicates that the plurality of parameter determination models have been trained based on the portion of the interactive samples, and the parameters of the plurality of parameter determination models have been optimized. At this time, when determining the candidate space based on the plurality of parameter determination models, each parameter determination model is greatly influenced by the history of the interaction samples at the time of determination. Therefore, in the embodiment of the present disclosure, when the historical interaction samples reach a certain number, the model with the longest usage time in the multiple parameter determination models, that is, the parameter determination model with the largest influence by the historical interaction samples, may be replaced, so that the excessive influence of the historical interaction samples may be reduced to a certain extent, the exploration diversity of the values of the policy parameter combinations may be ensured, and meanwhile, some models that have been subjected to parameter optimization are retained in the multiple parameter determination models, so that the accuracy of the finally determined target space may be ensured.

The present disclosure further provides a training apparatus for a deep reinforcement learning model, as shown in fig. 4, the apparatus 10 includes:

an obtaining module 100, configured to obtain an interaction sequence generated by interaction between a deep reinforcement learning model and a virtual environment, where the interaction sequence includes a plurality of sample data, and each sample data includes a first state of the virtual environment, a decision action, and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state;

a first determining module 200, configured to determine, for each sample data, an advantage function value corresponding to an advantage function of the deep reinforcement learning model and an environmental status in the sample data, and an advantage expectation of the advantage function value under a decision policy corresponding to the sample data, where the decision policy is determined based on a policy family function formed by multiple policy parameters having an association relationship in the advantage function and the deep reinforcement learning model;

a second determining module 300, configured to determine, for each of the sample data, an action value corresponding to the sample data according to the sample data, an advantage function value corresponding to the sample data, the advantage expectation, and a state value function of the deep reinforcement learning model;

a third determining module 400, configured to determine updated gradient information of an action value function of the deep reinforcement learning model based on the action value;

an updating module 500, configured to update the deep reinforcement learning model according to the updated gradient information.

Optionally, the policy parameters include policy entropy parameters and weight parameters corresponding to each of the policy entropy parameters;

the policy family function is determined by:

determining subfunctions of the sub-strategies corresponding to the strategy entropy parameters according to the dominance function values and the strategy entropy parameters;

and determining the sum of the products of the probability distribution obtained after the softmax processing of each sub-function and the weight parameter corresponding to the sub-function as the strategy family function.

Optionally, the second determining module includes:

the first determining submodule is used for determining a state value corresponding to the state value function according to the environment state in the sampling data;

a second determining sub-module for determining a difference between the merit function value and the merit expectation as a processing merit function value;

a third determining submodule configured to determine a sum of the processing merit function value and the state value as the action value.

Optionally, the third determining module includes:

Optionally, a plurality of policy parameter sets having an association relationship in the deep reinforcement learning model form a policy parameter combination, values of the policy parameter combination are determined based on a hyper-parameter determination module, the hyper-parameter determination module updates an interaction sample generated by the interaction sequence based on a parameter determination model corresponding to the policy parameter combination, the interaction sample includes a sampling value combination corresponding to the policy parameter combination corresponding to the interaction sequence and an optimized feature parameter corresponding to the deep reinforcement learning model, and the sampling value combination includes a sampling value corresponding to each policy parameter.

Optionally, the hyper-parameter determination module comprises:

a first updating sub-module, configured to update, according to the interaction sample, a state value corresponding to the policy parameter combination in the parameter determination model if the parameter determination model is one, where the policy parameters correspond to dimensions in a target hyperspace corresponding to the policy parameter combination one to one, and a parameter space of each policy parameter is discretized into a plurality of value intervals in the dimension corresponding to the policy parameter, so that the target hyperspace is discretized into a plurality of value spaces;

a fourth determining submodule, configured to determine a target space from the multiple value spaces according to the updated state value corresponding to the policy parameter combination;

and the fifth determining submodule is used for determining a target value combination corresponding to the strategy parameter combination according to the target space and determining the value of each strategy parameter according to the target value combination.

Optionally, the hyper-parameter determination module comprises:

a second updating sub-module, configured to update, for each parameter determination model, a state value corresponding to the policy parameter combination in the parameter determination model according to the interaction sample when the parameter determination model is multiple, where learning rates of the parameter determination models are different, and in each parameter determination model, the policy parameter corresponds to a dimension in a target hyperspace corresponding to the policy parameter combination one to one, and a parameter space of each policy parameter is discretized into multiple value intervals under the dimension corresponding to the policy parameter, so that the target hyperspace is discretized into multiple value spaces, and the multiple value spaces corresponding to the multiple parameter determination models are divided into the same;

a sixth determining submodule, configured to determine a model for each parameter, and determine a candidate space from the multiple value spaces according to an updated state value corresponding to the policy parameter combination in the parameter determination model;

a seventh determining submodule, configured to determine a target space according to the candidate space determined by each parameter determination model;

and the eighth determining submodule is used for determining a target value combination corresponding to the strategy parameter combination according to the target space and determining the value of each strategy parameter according to the target value combination.

Optionally, the deep reinforcement learning model is configured to train game artificial intelligence, the interaction sequence is a sequence obtained by sampling the game artificial intelligence in a game play of a target game, and the virtual environment is a training environment in which the game artificial intelligence is located in the target game.

Referring now to FIG. 5, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an interaction sequence generated by interaction of a deep reinforcement learning model and a virtual environment, wherein the interaction sequence comprises a plurality of sampling data, and each sampling data comprises a first state of the virtual environment, a decision action and a return value obtained by executing the decision action when the virtual environment is in a state corresponding to the first state; for each sampling data, determining an advantage function value of the deep reinforcement learning model corresponding to an environment state in the sampling data and an advantage expectation of the advantage function value under a decision strategy corresponding to the sampling data, wherein the decision strategy is determined based on a strategy family function formed by a plurality of strategy parameters having an association relationship in the advantage function and the deep reinforcement learning model; for each sampling data, determining an action value corresponding to the sampling data according to the sampling data, an advantage function value corresponding to the sampling data, the advantage expectation and a state value function of the deep reinforcement learning model; determining updated gradient information of an action value function of the deep reinforcement learning model based on the action value; and updating the deep reinforcement learning model according to the updating gradient information.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not constitute a limitation to the module itself in some cases, for example, the obtaining module may also be described as a module for obtaining an interaction sequence generated by the deep reinforcement learning model interacting with the virtual environment.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a training method of a deep reinforcement learning model, according to one or more embodiments of the present disclosure, wherein the method includes:

Example 2 provides the method of example 1, wherein the policy parameters include policy entropy parameters and a weight parameter corresponding to each of the policy entropy parameters;

the policy family function is determined by:

Example 3 provides the method of example 1, wherein the determining the action value corresponding to the sample data according to the sample data, the merit function value corresponding to the sample data, the merit expectation, and the state value function of the deep reinforcement learning model comprises:

determining a state value corresponding to the state value function according to the environment state in the sampling data;

determining a difference between the merit function value and the merit expectation as a treatment merit function value;

Example 4 provides the method of example 1, wherein the determining updated gradient information for the action value function of the deep reinforcement learning model based on the action value includes:

According to one or more embodiments of the present disclosure, example 5 provides the method of example 1, where a plurality of policy parameter sets having an association relationship in the deep reinforcement learning model form a policy parameter combination, values of the policy parameter combination are updated based on a parameter determination model corresponding to the policy parameter combination and an interaction sample generated by the interaction sequence, the interaction sample includes a sampling value combination corresponding to the policy parameter combination corresponding to the interaction sequence and an optimization feature parameter corresponding to the deep reinforcement learning model, and the sampling value combination includes a sampling value corresponding to each policy parameter.

Example 6 provides the method of example 5, wherein updating, based on the parameter determination model corresponding to the policy parameter combination and the interaction sample generated by the interaction sequence, the value of the policy parameter combination includes:

under the condition that the number of the parameter determination models is one, updating the state value corresponding to the strategy parameter combination in the parameter determination model according to the interaction sample, wherein the strategy parameters correspond to the dimensions in the target hyperspace corresponding to the strategy parameter combination in a one-to-one manner, and the parameter space of each strategy parameter is discretized into a plurality of value intervals under the dimension corresponding to the strategy parameter, so that the target hyperspace is discretized into a plurality of value spaces;

determining a target space from the plurality of value spaces according to the updated state value corresponding to the strategy parameter combination;

Example 7 provides the method of example 5, wherein updating, based on the parameter determination model corresponding to the policy parameter combination and the interaction sample generated by the interaction sequence, the value of the policy parameter combination includes:

under the condition that the number of the parameter determination models is multiple, updating the state values corresponding to the strategy parameter combinations in the parameter determination models according to the interaction samples, wherein the learning rate of each parameter determination model is different, the strategy parameters in each parameter determination model correspond to the dimensions in the target hyperspace corresponding to the strategy parameter combinations one by one, and the parameter space of each strategy parameter is discretized into a plurality of value intervals under the dimension corresponding to the strategy parameter, so that the target hyperspace is discretized into a plurality of value spaces, and the plurality of value spaces corresponding to the parameter determination models are divided into the same;

for each parameter determination model, determining candidate spaces from the multiple value spaces according to the updated state values corresponding to the strategy parameter combinations in the parameter determination model;

determining the candidate space determined by the model according to each parameter, and determining a target space;

Example 8 provides the method of any one of examples 1-7, wherein the deep reinforcement learning model is used to train a game artificial intelligence, the interaction sequence is a sequence obtained by sampling the game artificial intelligence in a game play of a target game, and the virtual environment is a training environment in which the game artificial intelligence is located in the target game.

Example 9 provides an apparatus for training a deep reinforcement learning model, the apparatus including:

Example 10 provides a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processing device, implements the steps of the method of any of examples 1-8, in accordance with one or more embodiments of the present disclosure.

Example 11 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A training method of a deep reinforcement learning model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the policy parameters include policy entropy parameters and weight parameters corresponding to each of the policy entropy parameters;

the policy family function is determined by:

3. The method of claim 1, wherein determining the action value corresponding to the sample data according to the sample data, the advantage function value corresponding to the sample data, the advantage expectation, and the state value function of the deep reinforcement learning model comprises:

4. The method of claim 1, wherein determining updated gradient information for an action value function of the deep reinforcement learning model based on the action value comprises:

5. The method according to claim 1, wherein a plurality of policy parameter sets having an association relationship in the deep reinforcement learning model form a policy parameter combination, values of the policy parameter combination are updated based on a parameter determination model corresponding to the policy parameter combination and an interaction sample generated by the interaction sequence, the interaction sample includes a sampling value combination corresponding to the policy parameter combination corresponding to the interaction sequence and an optimized feature parameter corresponding to the deep reinforcement learning model, and the sampling value combination includes a sampling value corresponding to each policy parameter.

6. The method of claim 5, wherein updating the values of the policy parameter combinations based on the parameter determination models corresponding to the policy parameter combinations and the interaction samples generated by the interaction sequence comprises:

7. The method of claim 5, wherein updating the values of the policy parameter combinations based on the parameter determination models corresponding to the policy parameter combinations and the interaction samples generated by the interaction sequence comprises:

8. The method according to any one of claims 1 to 7, wherein the deep reinforcement learning model is used for training game artificial intelligence, the interaction sequence is a sequence obtained by sampling the game artificial intelligence in a game play of a target game, and the virtual environment is a training environment in which the game artificial intelligence is located in the target game.

9. An apparatus for training a deep reinforcement learning model, the apparatus comprising:

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.