CN113962362A

CN113962362A - Reinforced learning model training method, decision-making method, device, equipment and medium

Info

Publication number: CN113962362A
Application number: CN202111211093.9A
Authority: CN
Inventors: 刘建林; 解鑫; 袁晓敏; 许铭; 刘颖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-21

Abstract

The utility model provides a reinforcement learning model training method, a decision device, equipment and a medium, which relate to the technical field of computers and further relate to artificial intelligence technologies such as deep learning and reinforcement learning, wherein the model training method comprises the following steps: acquiring a first agent parameter of a first agent obtained by performing reinforcement learning training according to a first scene; determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters; training an auxiliary learning network in the second scene according to the second agent parameters; and training the target reinforcement learning model according to the target auxiliary learning network obtained by training. The embodiment of the disclosure can improve the training efficiency and the adaptability of the reinforcement learning model and the accuracy of the model, and further improve the accuracy of decision making based on the reinforcement learning model.

Description

Reinforced learning model training method, decision-making method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an artificial intelligence technology such as deep learning and reinforcement learning, and further relates to a reinforcement learning model training method, a decision method, an apparatus, a device, and a medium.

Background

Reinforcement learning is a mathematical framework for performing policy-based learning through experience, is one of the paradigms and methodologies of machine learning, belongs to a technical branch of artificial intelligence technology, and is used for describing and solving the problem that an agent (agent) learns a policy to achieve maximum return or achieve a specific target in an interaction process with the environment. The reinforcement learning is an unsupervised learning method without prior knowledge and data, and the main working mode is that a strategy model continuously makes action (action) attempts in the environment, learning information is obtained by receiving the return of the environment to the action, model parameters are updated, and finally model convergence is realized.

Disclosure of Invention

The embodiment of the disclosure provides a reinforcement learning model training method, a decision-making device, equipment and a medium, which can improve the training efficiency and the adaptability of a reinforcement learning model and the accuracy of the model, and further improve the accuracy of decision-making based on the reinforcement learning model.

In a first aspect, an embodiment of the present disclosure provides a reinforcement learning model training method, including:

acquiring a first agent parameter of a first agent obtained by performing reinforcement learning training according to a first scene;

determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters;

training an auxiliary learning network in the second scene according to the second agent parameters;

and training the target reinforcement learning model according to the target auxiliary learning network obtained by training.

In a second aspect, an embodiment of the present disclosure provides a decision method, including:

acquiring state data of a target scene;

inputting the state data into a target reinforcement learning model to obtain decision data aiming at the target scene;

the target reinforcement learning model is obtained by training through the reinforcement learning model training method of the first aspect.

In a third aspect, an embodiment of the present disclosure provides a reinforcement learning model training apparatus, including:

the first agent parameter acquisition module is used for acquiring first agent parameters of a first agent obtained by performing reinforcement learning training according to a first scene;

the second agent parameter determining module is used for determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters;

the auxiliary learning network training module is used for training an auxiliary learning network in the second scene according to the second agent parameters;

and the target reinforcement learning model training module is used for training the target reinforcement learning model according to the target auxiliary learning network obtained by training.

In a fourth aspect, an embodiment of the present disclosure provides a decision apparatus, including:

the state data acquisition module is used for acquiring state data of a target scene;

the decision data acquisition module is used for inputting the state data into a target reinforcement learning model to obtain decision data for the target scene;

the target reinforcement learning model is obtained by training through the reinforcement learning model training device of the third aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a reinforcement learning model training method as provided in the embodiments of the first aspect or a decision-making method as provided in the embodiments of the second aspect.

In a sixth aspect, the disclosed embodiments also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the reinforcement learning model training method provided in the first aspect or the decision-making method provided in the second aspect.

In a seventh aspect, this disclosure also provides a computer program product, which includes a computer program that, when being executed by a processor, implements the reinforcement learning model training method provided in the first aspect or the decision-making method provided in the second aspect.

According to the method and the device, the second agent parameters of the second agent of the target reinforcement learning model in the second scene are determined through the first agent parameters of the first agent obtained through reinforcement learning training according to the first scene, so that the auxiliary learning network is trained in the second scene according to the second agent parameters, and the target reinforcement learning model is trained according to the trained auxiliary learning network. After the training of the target reinforcement learning model is completed, the state data of the target scene can be input into the target reinforcement learning model to obtain decision data aiming at the target scene, the problems of low training efficiency, adaptability and accuracy of the model and the like in the existing reinforcement learning model training method are solved, the training efficiency and adaptability of the reinforcement learning model and the accuracy of the model can be improved, and the accuracy of decision making based on the reinforcement learning model is further improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a reinforcement learning model training method provided by an embodiment of the present disclosure;

FIG. 2 is a flowchart of a reinforcement learning model training method provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of a decision method provided by an embodiment of the present disclosure;

FIG. 4 is a block diagram of a reinforcement learning model training apparatus according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of a decision-making device provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device for implementing a reinforcement learning model training method or a decision-making method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The traditional reinforcement learning has the problems of huge solution space, difficult training convergence and unstable output, so that the reinforcement learning can only be applied in the scenes of simple scenes, low calculation requirement and low trial and error cost at present, and the reinforcement learning is difficult to apply in complex scenes (such as scene types with a large number of intelligent agents, overhigh environment information dimensionality or a large number of executed actions). The training process of reinforcement learning depends on a relatively stable environment, and when environmental information changes, reinforcement learning agents trained in the previous environment are often difficult to adapt to a new environment, so that an existing reinforcement learning model is difficult to make a reasonable and correct decision aiming at the new environment.

Specifically, the reinforcement learning agent trained in a simple scene cannot directly cope with a complex scene. The reinforcement learning agent training directly using complex scenes is extremely difficult to converge, and the trained reinforcement learning agent is difficult to adapt to the changing environment. Training for reinforcement learning agents in complex environments often adopts the same training method as traditional reinforcement learning, and training convergence is pursued by continuously increasing computing power and training time. When the environment changes, training often needs to be restarted, and the training cost is high and the training efficiency is low.

In one example, fig. 1 is a flowchart of a reinforcement learning model training method provided by an embodiment of the present disclosure, which may be applied to a case of efficiently training a reinforcement learning model by using an auxiliary learning network, where the method may be performed by a reinforcement learning model training apparatus, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a terminal device or a server device, and the specific device type of the electronic device is not limited in the embodiments of the present disclosure. Accordingly, as shown in fig. 1, the method comprises the following operations:

s110, obtaining a first agent parameter of a first agent obtained by performing reinforcement learning training according to a first scene.

Wherein the first scene may be a scene type of an environment. Alternatively, the first scene may be a simple scene corresponding to a simple environment. The first agent may be an agent in a reinforcement learning model trained through the first scenario, and the agent may be a trained mature agent. The first agent parameter may be a model parameter of the first agent.

In the embodiment of the present disclosure, a reinforcement learning model suitable for a second scenario may be trained based on a first agent in a reinforcement learning model obtained by reinforcement learning in a first scenario. Therefore, after the first agent is obtained by performing reinforcement learning training based on the first scene, the first agent parameters of the first agent can be obtained.

And S120, determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters.

Wherein the second scene may be a scene type associated with the first scene. For example, the second scenario may be a similar complex scenario derived on the basis of the first scenario. The target reinforcement learning model may be a reinforcement learning model obtained by reinforcement learning based on the second scenario. The second agent may be an agent of the target reinforcement learning model, which may be an untrained agent. The second agent parameter may be a model parameter of the second agent.

Correspondingly, after the first agent parameters of the first agent obtained by reinforcement learning training according to the first scene are obtained, the second agent parameters of the second agent of the target reinforcement learning model in the second scene can be determined according to the first agent parameters. For example, the initialization operation for the second agent may be implemented directly using part or all of the first agent parameters as the second agent parameters. Or, the parameters of part or all of the first agent can be adaptively adjusted according to the requirements of the second scene, and the adjusted parameters are used as the parameters of the second agent to realize the initialization operation of the second agent. The embodiments of the present disclosure do not limit the specific implementation of determining the second agent parameter according to the first agent parameter. Therefore, the second agent parameters of the second agent are determined through the first agent parameters, and the transfer learning mode of the agent is realized.

In a specific example, assume that there is already an agent X in the reinforcement learning model trained in a simple scenario, which can be an initial agent. Correspondingly, the agent X can be utilized for optimization and improvement to obtain the agent A, so that the agent A has better performance in a complex scene and can adapt to a dynamically changing environment.

And S130, training an auxiliary learning network in the second scene according to the second agent parameters.

Wherein, the auxiliary learning network can be used for independently assisting to train the second agent under the condition of being separated from the influence of the environmental state. Optionally, the auxiliary learning network may be any type of learning network, such as a neural network or a convolutional neural network, and the specific network type of the auxiliary learning network is not limited in the embodiments of the present disclosure.

It can be appreciated that in a conventional reinforcement learning training process, when an agent completes an initial configuration, the agent often needs to interact with the environment iteratively. Specifically, the agent selects an action for the environment, the state of the environment changes after receiving the action, and simultaneously a reinforcement signal (reward or punishment) is generated and fed back to the agent, the agent selects the next action according to the reinforcement signal and the current state of the environment, the selection principle is to increase the probability of being reinforced, and the iterative interaction process is repeated until the reinforcement learning training is finished. This reinforcement learning training process requires a large amount of computing power and training time, and is costly and inefficient.

In the embodiment of the disclosure, in order to improve the training efficiency of the target reinforcement learning model and the adaptability to the environmental scenario, after determining the parameters of the second agent of the target reinforcement learning model, an auxiliary learning network may be introduced to train the second agent independently in the second scenario. The second agent is trained in an auxiliary mode through the auxiliary learning network, and the training process of the second agent can be separated from a complex iterative interaction process. It should be noted that the learning-assisted network needs to be trained in the second scenario. Therefore, although the assisted learning network does not need to perform iterative interaction with the environment frequently, the assisted learning network can also realize a fast training process of the second agent by considering environmental factors in the second scene, such as sampling sample data by taking the second scene as a reference. Therefore, the iterative interaction process of the second agent and the complex environment can be reduced by assisting the independent training of the second agent through the auxiliary learning network, so that the calculation power and the time cost required by the training process are reduced, the training efficiency of the second agent is improved, and the training efficiency of the target reinforcement learning model is improved.

And S140, training the target reinforcement learning model according to the target auxiliary learning network obtained through training.

The target assisted learning network is also an assisted learning network obtained by training in a second scene according to the second agent parameters.

Correspondingly, after the training of the auxiliary learning network is completed, the second intelligent agent which is primarily trained can be determined by using the trained auxiliary learning network. And then, continuing the training process of reinforcement learning on the target reinforcement learning model by using the second agent after the primary training is finished until the training of the target reinforcement learning model is finished.

In summary, the intelligent agent in the reinforcement learning model obtained by training in the simple scene is migrated to the intelligent agent in the target reinforcement learning model in the complex scene, and the auxiliary learning network is separated from the strong constraint of the environmental scene to assist in independently training the intelligent agent in the target reinforcement learning model, so that the intelligent agent can effectively adapt to the changing environment by using the training mode of the known intelligent agent in the complex scene, the decision accuracy of the intelligent agent is improved, the accuracy of the target reinforcement learning model is improved, the training cost of the intelligent agent in the complex scene can be reduced, the training efficiency of the intelligent agent in the complex scene is improved, and the training efficiency and the adaptability of the reinforcement learning model in the complex scene and the accuracy of the model are improved.

According to the embodiment of the disclosure, the second agent parameter of the second agent of the target reinforcement learning model in the second scene is determined according to the first agent parameter of the first agent obtained by reinforcement learning training in the first scene, so that the auxiliary learning network is trained in the second scene according to the second agent parameter, and the target reinforcement learning model is trained according to the target auxiliary learning network obtained by training, thereby solving the problems of low training efficiency, adaptability and accuracy of the model in the existing reinforcement learning model training method, and improving the training efficiency, adaptability and accuracy of the reinforcement learning model.

In an example, fig. 2 is a flowchart of a reinforcement learning model training method provided in the embodiment of the present disclosure, and the embodiment of the present disclosure performs optimization and improvement on the basis of the technical solutions of the above embodiments, and provides various specific optional implementation manners for determining the second agent parameter, training the auxiliary learning network, and the target reinforcement learning model.

A reinforcement learning model training method as shown in fig. 2 includes:

s210, obtaining a first agent parameter of a first agent obtained by reinforcement learning training according to a first scene.

In an optional embodiment of the present disclosure, the first agent parameter comprises a control strategy parameter of the first agent.

In an embodiment of the present disclosure, the model of the first agent may include two neural networks, one of which is responsible for outputting the control strategy, i.e., the operator network, and the other of which is responsible for evaluating how well the current control strategy is executed in the current environmental state, i.e., the critic network. Accordingly, the control policy parameter may be a network parameter of an actor network, and the network parameter of the criticc network may be referred to as an evaluation policy parameter. Alternatively, the control strategy parameters of the first agent may be used for agent transfer learning.

And S220, determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters.

Correspondingly, step S220 may specifically include the following operations:

s221, initializing the second agent parameters according to the first agent parameters.

In the disclosed embodiment, the model of the second agent may also include two neural networks. The input of the neural network model responsible for outputting the control strategy may be information of a corresponding moment in the second scenario, and the output is an action that the second agent should perform at the corresponding moment.

In the migration learning of the agent, the control policy parameters of the second agent may be initialized with the control policy parameters of the first agent, for example, the control policy parameters of the first agent may be directly used as the control policy parameters of the second agent. The evaluation policy parameters of the second agent do not need to be initialized by using the evaluation policy parameters of the first agent, and the initialization of criticc network parameters can be performed in a conventional manner, so that the initialization of the evaluation policy parameters is realized.

S222, determining a target second agent sub-parameter in the second agent parameters.

The target second agent sub-parameter may be a relevant parameter capable of ensuring the decision performance of the agent in the first scenario. It is understood that the type of the first scenario is different, and the corresponding target second agent sub-parameters may also be different.

And S223, freezing the target second agent sub-parameter.

Correspondingly, after the control strategy parameters of the second agent are initialized by adopting the control strategy parameters of the first agent, partial parameters which can maintain the basic decision performance on the first scene in the second agent parameters can be determined as target second agent sub-parameters, and the target second agent sub-parameters are frozen, so that the target second agent sub-parameters can be kept constant in the training process of the second agent. The benefits of this arrangement are: the trained second agent can ensure the basic performance in the first scene, can be simultaneously applied to the decision process of the first scene and the second scene, and further improves the adaptability of the second agent to different environmental scenes.

And S230, initializing the network parameters of the auxiliary learning network according to the second agent parameters.

Accordingly, after determining second agent parameters for a second agent based on the first agent parameters, the network parameters of the secondary learning network may be further initialized with the second agent parameters. In the disclosed embodiment, the auxiliary learning network may be a classification network, i.e., given an input, the network outputs a unique value representing the class of the input. In the second scenario, the input to the assistant learning network may be defined as the current time, and the output is the probability of each action that should be performed at the current time. That is, the input of the assistant learning network is an information state matrix formed by the current state of the environment in the second scenario, and the output is a one-dimensional vector with the same number of actions, and each element value in the vector represents the probability of executing the corresponding action. Optionally, the inputs and outputs of the supplementary learning network and the second agent model are kept consistent.

And S240, acquiring local sample data through a local greedy strategy in the second scene.

Wherein the local sample data may be key sample data in the second scene. The key sample data may be the useful data that has the greatest effect on model training among all sample data. Alternatively, the local sample data may include both input and output types of data.

And S250, training the auxiliary learning network after the network parameters are initialized according to the local sample data.

In order to train the auxiliary learning network, data sampling can be performed in the second scene through a local greedy strategy such as a greedy algorithm, so that useful state information can be screened out in the second scene in advance, an optimal control strategy under the acquired state information in the second scene is obtained, local sample data is formed according to the data acquired in the second scene and the optimal control strategy under the data, and the auxiliary learning network after network parameter initialization is trained by using the local sample data acquired in the second scene in a supervised learning mode. Optionally, the label of the local sample data may be in a one-hot form, and cross entropy loss L is used_cross(Y, P) is used as a training objective function, and L can be minimized by a Stochastic Gradient Descent (SGD) method when parameters are updated_crossAnd (Y, P) updating parameters of the auxiliary learning network, wherein the condition of the termination of the training of the auxiliary learning network can be that the set iteration number is reached or the model loss is below a threshold value.

Wherein the objective function of the auxiliary learning network training is cross entropy loss L_cross(Y,P)：

Wherein L is_cross(Y, P) represents model loss, Y represents tag data of local sample data, P represents model output result, and n represents sample of local sample dataNumber, k denotes the number of categories of local sample data, y_i,kRepresenting a dataset truth tag, p_i,kRepresenting the model output value.

In the embodiment of the present disclosure, the trained auxiliary learning network is equivalent to completing the preliminary training of the second agent, at this time, the auxiliary learning network can complete the preliminary agent control in the second scene, but the effect of the agent control in the second scene is not ideal, and the second agent needs to be further strengthened learning training in combination with the second scene.

According to the technical scheme, the auxiliary learning network is used as the intermediate network to assist in training the second intelligent agent, the situation that the independent second intelligent agent is assisted by the aid of the pre-screened useful information in the second scene is achieved, the completely new intelligent agent parameters of the second intelligent agent are obtained, and the training efficiency and the convergence speed of the target reinforcement learning model can be improved.

In one specific example, the training process of the assisted learning network is as follows:

it should be noted that fig. 2 is only a schematic diagram of an implementation manner, and the step S230 and the step S240 are not executed sequentially. That is, step S230 may be executed first and then step S240 may be executed, step S240 may be executed first and then step S230 may be executed, or both may be executed simultaneously.

In an optional embodiment of the present disclosure, the training of the supplementary learning network in the second scenario according to the second agent parameter may include: acquiring local sample data through a local greedy strategy in the second scene; acquiring target correction sample data; determining sample data weights of the local sample data and the target correction sample data; and training the auxiliary learning network according to the local sample data, the target correction sample data and the sample data weight.

The target correction sample data may be input/output sample data composed of output sample data determined by expert experience on part of the input sample data in the second scenario, and may be used to correct part of data in the local sample data. The sample data weight may be a specific gravity occupied by each type of sample data, and may reflect the importance of the sample data.

It will be appreciated that in the second scenario where data sampling is performed, the local sample data collected may not be comprehensive and may be less explanatory. Therefore, in order to improve the richness and accuracy of sample data and improve the precision of the auxiliary learning network, on the basis that local sample data is acquired through a local greedy strategy in a second scene, certain states in the second scene can be selected, and experts determine output in the states through experience to be used as final output, so that target correction sample data is obtained. It can be understood that, since the target correction sample data has a certain authority, the data thereof is strongly related to the second scenario, and thus the sample data weight corresponding thereto may be higher than that of the local sample data. Correspondingly, when the auxiliary learning network is trained, target correction sample data can be added on the basis of the local sample data, and sample data weights are respectively set for the local sample data and the target correction sample data, so that the auxiliary learning network is trained uniformly according to a sample data pool formed by the local sample data with the sample data weights and the target correction sample data.

And S260, initializing model parameters in the target reinforcement learning model according to the trained network parameters of the target auxiliary learning network.

S270, training the target reinforcement learning model in the second scene.

In order to further improve the model accuracy of the target reinforcement learning model, after the training of the auxiliary learning network is completed, the trained auxiliary learning network may be used to initialize the model parameters in the target reinforcement learning model, and the initialized target reinforcement learning model is continuously trained in the second scene, so as to continuously update the model parameters of the target reinforcement learning model.

According to the technical scheme, the intelligent agent parameters of the second intelligent agent in the target reinforcement learning model are initially and independently trained by the aid of the auxiliary learning network, the second intelligent agent which is initially trained and completed is further substituted into the second scene in combination with the second scene to carry out reinforcement learning training, and the training efficiency and the convergence rate of the target reinforcement learning model can be improved.

In an optional embodiment of the present disclosure, the model parameters in the target reinforcement learning model may include control strategy parameters; the initializing the model parameters in the target reinforcement learning model according to the trained network parameters of the target auxiliary learning network may include: initializing control strategy parameters in the target reinforcement learning model according to the network parameters of the target auxiliary learning network obtained by training; and/or the model parameters in the target reinforcement learning model comprise evaluation strategy parameters, and the method further comprises the following steps: and initializing the evaluation strategy parameters in the target reinforcement learning model according to preset configuration parameters.

The preset configuration parameters may be parameters for initializing configuration of evaluation strategy parameters in the reinforcement learning model, and may be set according to application scene types of the reinforcement learning model, and specific parameter contents of the preset configuration parameters are not limited in the embodiment of the present disclosure.

Similarly, after the training of the assistant learning network is completed, the control strategy parameters in the target reinforcement learning model, that is, the operator (x) in the target reinforcement learning model, may be initialized according to the network parameters of the target assistant learning network obtained by the training. The operator network may be configured to output a probability of an action to be performed in the current state according to the input current state. The evaluation strategy parameters in the target reinforcement learning model can be initialized in a conventional manner, that is, the criticc (x) in the target reinforcement learning model can be initialized according to the preset configuration parameters. A critic's network may be used to fit a state cost function whose input is the current state and whose output is the cost of the current state. The actor network and critic network may be two types of neural networks. After the initialization of the model parameters of the target reinforcement learning model is completed, the training process of reinforcement learning can be continued by combining the second scene, and the parameters of the operator network and the critic network are continuously updated in the training process. In the training process, the second agent can continuously interact with the environment through the output of the actor network to obtain a training sample; when enough training samples are collected, the parameters of the operator network and the critic network are updated.

Wherein, the following loss function L can be adopted in the critic network training_critic：

L_critic＝E[(G_t-V_π(s))²]

Wherein, V_π(s) represents the output computed by the critic network for the current input s, s represents the current state, π represents the current policy, E [ ·]Expressing the mathematical expectation, G_tRepresenting the dominance function, t representing the current time, R_t+1Representing the reward value, R, achieved by performing the action at time t +1_t+2Representing the reward value, R, achieved by performing the action at time t +2_t+3Representing the reward value, R, achieved by performing the action at time t +3_t+k+1The reward value obtained by executing the action at the time t + k +1 is shown, gamma represents a discount factor which can be a decimal number of 0-1, and k represents the number of steps calculated backwards at the current time t. The training goal of the entire critic network is to minimize the difference between the estimated and true values of the network.

The following loss function L can be adopted in the course of operator network training_actor：

L_actor＝E[log(π_θ(a|s))A(s,a)]

A(s,a)＝E[G_t|S_t＝s，A_t＝a]-E[G_t|S_t＝s]

Wherein, pi_θ(as) represents the output (i.e., the probability of performing action a) computed by the actor network for the current input s. A (S, a) represents a merit function, S_tIndicating the state at time t. The actor network may comprise two output layers, corresponding to the mean and the square of the probability distribution of the actionIn contrast, the dimension of each output layer is consistent with the dimension of the action, and the action generation needs to sample from the action probability distribution.

In the training process of the target reinforcement learning model, a sample buffer area (experience pool) can be set to store the sampled data, and the data in the experience pool can come from different actor network parameters in consideration of the fact that the environment can be dynamically changed and the actor network parameters are continuously updated. Meanwhile, in order to ensure that the learned samples are from the latest environmental information every time, the acquired samples are only used once in the training process, and after the acquired samples are used, the sample buffer area is emptied, so that the training is greatly influenced by the previous environmental information if the sample buffer area is not emptied. However, the operation can cause unstable training, so the KL divergence can be used to measure the difference between the old and new parameters, and then the threshold value is set for truncation, so as to ensure that the new and old parameters are updated within a certain range, and ensure the stability of the training.

At this time, the loss function of the operator network is updated as:

wherein, beta represents a weight factor,

representing the output calculated by the new actor network for the current input s,

representing the output computed by the previous actor network for the current input s, KL (P, Q) represents the KL divergence computation, which is defined as:

wherein P, Q denotes 2 different probability distributions, P_iAnd Q_iRepresenting the data sampled from P, Q, respectively, and n represents the number of sampled data.

Although the target reinforcement learning model still needs to be trained in combination with the second scene, the control strategy parameters of the target reinforcement learning model are trained in advance by the assistant learning network, and the ability of simple intelligent control in the second scene is provided, so that the computational power and the training time cost can be greatly reduced in the training process of the target reinforcement learning model in combination with the second scene, and the training efficiency is improved.

In an optional embodiment of the present disclosure, the training the target reinforcement learning model in the second scenario may include: acquiring training result data of a second agent in the target reinforcement learning model in the current training round; performing data screening on the training result data according to the target intervention data to obtain target training result data; classifying the target training result data according to an intervention discriminator to obtain intervention result data; and training the target reinforcement learning model according to the target training result data and the intervention result data.

The training result data may be a result output by the second agent in the target reinforcement learning model in the current training turn, and may include, for example, information of the current time, an action that the second agent should perform at the current time, and the like. The target intervention data can be used for screening partial data in the training result data, and can be intervention data determined according to expert experience. The target training result data may be training result data that requires an intervention correction. The intervention arbiter may be configured to perform classification determination on the data filtered according to the target intervention data. The intervention result data is the result obtained by classifying and judging the data screened according to the target intervention data by the intervention discriminator.

Considering that the solution space of the target reinforcement learning model is more complex than that of the first scene in the training process, the output of the model has many useless actions, and the useless actions cause the model to be difficult to converge. Therefore, in order to improve the convergence speed of the target reinforcement learning model, expert intervention can be added in the training process of the target reinforcement learning model. Specifically, in each training turn, training result data of the second agent in the target reinforcement learning model in the current training turn may be obtained, and data screening may be performed on the obtained training result data according to the target intervention data, specifically, training result data including useful actions may be screened out as the target training result data. For example, the target intervention data may be that the action a1 output at the current time t1 is a useful action, and if the training result data is the action a1 output at the current time t1, the training result data may be screened as the target training result data.

Correspondingly, after the target training result data are obtained through screening, the intervention discriminator can be used for classifying the target training result data to obtain intervention result data. The intervention arbiter may be a binary network, the input may be information of a corresponding time in the second scenario and what action the agent should perform at the corresponding time (i.e., a state action pair consisting of the environment information and the agent action), and the output may be whether intervention is required (yes or no, may be represented by 1, and may be represented by 0). The intervention arbiter may first perform a training process before assisting in training the target reinforcement learning model. Optionally, the cross entropy loss L may be used when the intervention arbiter is performing training_cross(Y, P) as a loss function, L can be minimized by a random gradient descent method when the parameters are updated_crossAnd (Y, P) updating the parameters.

Correspondingly, in the process of training the target reinforcement learning model, after the target training result data and the intervention result data are obtained according to the intervention measures, the target reinforcement learning model can be reinforced and trained according to the target training result data and the intervention result data, so that the training efficiency and the convergence speed of the target reinforcement learning model are increased.

In the embodiment of the disclosure, besides useful actions can be screened according to the target intervention data, obviously erroneous and/or useless actions can be screened according to the target intervention data, and a high penalty value is given to the obviously erroneous actions, so that the target reinforcement learning model executes the obviously erroneous actions as few as possible, and directly deletes the useless and/or obviously erroneous actions in the action space.

For example, the target intervention data may be the action that should be output at the current time t1 as a1, and the training result data may be the action output at the current time t1 as a2, and the training result data may be filtered out as an obviously erroneous action.

In one specific example, the training process of the intervention arbiter is as follows:

in an optional embodiment of the present disclosure, the training the target reinforcement learning model according to the target training result data and the intervention result data may include: determining a target loss function according to the initial loss function and the intervention result data; taking the target training result data and the intervention result data as agent input data of a second agent in the target reinforcement learning model, and inputting the agent input data into the second agent to obtain agent output data of the second agent; updating agent parameters of the second agent based on the objective loss function, the agent input data, and the agent output data.

Wherein the initial loss function may be a loss function employed before the target reinforcement learning model adds no intervention. The target loss function may be a loss function employed after the target reinforcement learning model adds intervention.

Specifically, in the training process of the target reinforcement learning model, an intervention penalty term can be added to the original initial loss function of the target reinforcement learning model by using intervention result data output by the intervention discriminator, and in the training process, the model can be converged more quickly by continuously reducing the penalty of intervention. Correspondingly, target training result data and intervention result data obtained through intervention can be used as agent input data of a second agent in the target reinforcement learning model, and the agent input data is input into the second agent, so that agent output data of the second agent is obtained. Accordingly, the target reinforcement learning model may update agent parameters of the second agent based on the target loss function, agent input data, and agent output data. After the current training round is finished, the training round can be updated, and the operation that the target training result data and the intervention result data are used as the intelligent agent input data of the second intelligent agent in the target reinforcement learning model is returned to be continuously executed until the training termination condition of the target reinforcement learning model is reached. Optionally, the training termination condition of the target reinforcement learning model may be that the current training round reaches the total training number.

In an optional embodiment of the present disclosure, the determining a target loss function according to the initial loss function and the intervention result data may include: determining the target loss function based on the following formula:

L_total＝0.5*L_actor+0.5*L_critic+α(t)*DN(s,action)

wherein L is_totalRepresenting the target loss function, alpha (t) representing a penalty term, DN (s, action) representing the output of an intervention discriminator, alpha (t) DN (s, action) representing the intervention result data, epochs2 representing the total number of training iterations of the target reinforcement learning model, t representing the current number of iterations, and C representing a constant term which can be a larger constant. Initial loss function L_total0Can be as follows: l is_total0＝0.5*L_actor+0.5*L_critic。

From the above objective loss function, when the output of the intervention arbiter is 1, the action performed by the model in the current state s is considered to be unreasonable, and intervention by α (t) is required. In the early stage of training, the information of intervention can play a leading role, and the intervention effect is weaker and weaker as the training is carried out.

In one example, the training process of the target reinforcement learning model based on the intervention arbiter is as follows:

according to the technical scheme, the reinforcement learning intelligent body trained in a simple scene can be enhanced and improved by assisting the training reinforcement learning process through the auxiliary learning network, the transfer learning mode and the intervention strategy, so that the reinforcement learning intelligent body can deal with complex problems in similar scenes, and the training efficiency, the adaptability and the accuracy of the reinforcement learning model are improved.

It should be noted that any permutation and combination between the technical features in the above embodiments also belong to the scope of the present disclosure.

In one example, fig. 3 is a flowchart of a decision method provided by an embodiment of the present disclosure, which may be applied to a decision-making situation using a target reinforcement learning model obtained by efficient training of an auxiliary learning network, and the method may be performed by a decision-making apparatus, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a terminal device or a server device, and the specific device type of the electronic device is not limited in the embodiments of the present disclosure. Accordingly, as shown in fig. 3, the method includes the following operations:

and S310, acquiring state data of the target scene.

The target scenario may be a relevant scenario formed by an environment for training the target reinforcement learning model. For example, the target scene may be a scene formed by an environment of training the target reinforcement learning model itself, or may be a scene highly correlated with the environment of training the target reinforcement learning model, where the high correlation may be understood as that a difference between the environment of the target scene and the environment of training the target reinforcement learning model is negligible, or that no interference is caused to a decision behavior of the target reinforcement learning model, and the specific scene type of the target scene is not limited in the embodiments of the present disclosure. Illustratively, the target scene may be a complex scene. The state data may be a value of a current environmental state in the target scene.

And S320, inputting the state data into a target reinforcement learning model to obtain decision data aiming at the target scene.

The target reinforcement learning model is obtained by training through any one of the reinforcement learning model training methods. The decision data may be result data obtained by the objective reinforcement learning model making a decision based on the state data.

In the embodiment of the disclosure, after the training of the target reinforcement learning model is completed, state data of a target scene applicable to the target reinforcement learning model may be acquired, and the state data of the target scene is input to the target reinforcement learning model, so that an intelligent decision is made by the target reinforcement learning model according to the state data, a certain action to be executed by the intelligent agent under the state data is determined, and the action is executed, thereby obtaining decision data for the target scene, and achieving the purpose of maximizing return or achieving a specific target.

In an example, fig. 4 is a structural diagram of a reinforcement learning model training apparatus provided in an embodiment of the present disclosure, and the embodiment of the present disclosure is applicable to a case of efficiently training a reinforcement learning model by using an auxiliary learning network, and the apparatus is implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a terminal device or a server device, and the specific device type of the electronic device is not limited in the embodiments of the present disclosure.

Fig. 4 shows a reinforcement learning model training apparatus 400, which includes: a first agent parameter acquisition module 410, a second agent parameter determination module 420, an assisted learning network training module 430, and a target reinforcement learning model training module 440. Wherein the content of the first and second substances,

a first agent parameter obtaining module 410, configured to obtain a first agent parameter of a first agent obtained through reinforcement learning training according to a first scenario;

a second agent parameter determining module 420, configured to determine, according to the first agent parameter, a second agent parameter of a second agent of the target reinforcement learning model in a second scene;

an assistant learning network training module 430, configured to train an assistant learning network in the second scenario according to the second agent parameter;

and the target reinforcement learning model training module 440 is configured to train the target reinforcement learning model according to the trained target auxiliary learning network.

Optionally, the first agent parameter includes a control strategy parameter of the first agent; the second agent parameter determination module 420 is further configured to: initializing the second agent parameters according to the first agent parameters; determining a target second agent sub-parameter of the second agent parameters; freezing the target second agent sub-parameter.

Optionally, the learning-assisted network training module 430 is further configured to: initializing network parameters of the auxiliary learning network according to the second agent parameters; acquiring local sample data through a local greedy strategy in the second scene; and training the auxiliary learning network after the network parameters are initialized according to the local sample data.

Optionally, the learning-assisted network training module 430 is further configured to: acquiring local sample data through a local greedy strategy in the second scene; acquiring target correction sample data; determining sample data weights of the local sample data and the target correction sample data; and training the auxiliary learning network according to the local sample data, the target correction sample data and the sample data weight.

Optionally, the target reinforcement learning model training module 440 is further configured to: initializing model parameters in the target reinforcement learning model according to the network parameters of the target auxiliary learning network obtained by training; training the target reinforcement learning model in the second scenario.

Optionally, the model parameters in the target reinforcement learning model include control strategy parameters; the target reinforcement learning model training module 440 is further configured to: initializing control strategy parameters in the target reinforcement learning model according to the network parameters of the target auxiliary learning network obtained by training; and/or the model parameters in the target reinforcement learning model comprise evaluation strategy parameters, and the device further comprises an evaluation strategy parameter initialization module used for: and initializing the evaluation strategy parameters in the target reinforcement learning model according to preset configuration parameters.

Optionally, the target reinforcement learning model training module 440 is further configured to: acquiring training result data of a second agent in the target reinforcement learning model in the current training round; performing data screening on the training result data according to the target intervention data to obtain target training result data; classifying the target training result data according to an intervention discriminator to obtain intervention result data; and training the target reinforcement learning model according to the target training result data and the intervention result data.

Optionally, the target reinforcement learning model training module 440 is further configured to: determining a target loss function according to the initial loss function and the intervention result data; taking the target training result data and the intervention result data as agent input data of a second agent in the target reinforcement learning model, and inputting the agent input data into the second agent to obtain agent output data of the second agent; updating agent parameters of the second agent based on the objective loss function, the agent input data, and the agent output data.

Optionally, the target reinforcement learning model training module 440 is further configured to: determining the target loss function based on the following formula:

L_total＝0.5*L_actor+0.5*L_critic+α(t)*DN(s,action)

wherein L is_totalRepresenting said target loss function, L_actorLoss function, L, representing a model of a control strategy_criticThe method comprises the steps of representing a loss function of an evaluation strategy model, representing a penalty term by alpha (t), representing the output of an intervention discriminator by DN (s, action), representing the intervention result data by alpha (t) DN (s, action), representing the total training iteration number of a target reinforcement learning model by epochs2, representing the current iteration number, and representing a constant term by C.

The reinforcement learning model training device can execute the reinforcement learning model training method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a reinforcement learning model training method provided in any embodiment of the present disclosure.

In an example, fig. 5 is a structural diagram of a decision device provided in an embodiment of the present disclosure, where the embodiment of the present disclosure is applicable to a situation where a target reinforcement learning model obtained by efficient training using an auxiliary learning network is used for making a decision, and the decision device is implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a terminal device or a server device, and the specific device type of the electronic device is not limited in the embodiments of the present disclosure.

A decision-making apparatus 500 as shown in fig. 5 comprises: a status data acquisition module 510 and a decision data acquisition module 520. Wherein the content of the first and second substances,

a status data obtaining module 510, configured to obtain status data of a target scene;

a decision data obtaining module 520, configured to input the state data into a target reinforcement learning model, so as to obtain decision data for the target scene;

the target reinforcement learning model is obtained through training by any one of the reinforcement learning model training devices.

The decision device can execute the decision method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a decision method provided in any embodiment of the present disclosure.

In one example, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as a reinforcement learning model training method or a decision method. For example, in some embodiments, the reinforcement learning model training method or the decision-making method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When loaded into RAM 603 and executed by the computing unit 601, a computer program may perform one or more steps of the reinforcement learning model training method or the decision-making method described above. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the reinforcement learning model training method or the decision method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server that incorporates a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A reinforcement learning model training method comprises the following steps:

2. The method of claim 1, wherein the first agent parameter comprises a control strategy parameter of the first agent;

determining second agent parameters of a second agent of the target reinforcement learning model in a second scene according to the first agent parameters comprises:

initializing the second agent parameters according to the first agent parameters;

determining a target second agent sub-parameter of the second agent parameters;

freezing the target second agent sub-parameter.

3. The method of claim 1, wherein the training of a supplementary learning network in the second scenario according to the second agent parameters comprises:

initializing network parameters of the auxiliary learning network according to the second agent parameters;

acquiring local sample data through a local greedy strategy in the second scene;

and training the auxiliary learning network after the network parameters are initialized according to the local sample data.

4. The method of claim 1, the training a supplementary learning network in the second scenario according to the second agent parameters, comprising:

acquiring target correction sample data;

determining sample data weights of the local sample data and the target correction sample data;

and training the auxiliary learning network according to the local sample data, the target correction sample data and the sample data weight.

5. The method of claim 1, wherein the training of the target reinforcement learning model according to the trained target-assisted learning network comprises:

initializing model parameters in the target reinforcement learning model according to the network parameters of the target auxiliary learning network obtained by training;

training the target reinforcement learning model in the second scenario.

6. The method of claim 5, wherein the model parameters in the target reinforcement learning model comprise control strategy parameters;

initializing model parameters in the target reinforcement learning model according to the trained network parameters of the target auxiliary learning network, wherein the model parameters comprise:

initializing control strategy parameters in the target reinforcement learning model according to the network parameters of the target auxiliary learning network obtained by training; and/or

The model parameters in the target reinforcement learning model comprise evaluation strategy parameters, and the method further comprises:

and initializing the evaluation strategy parameters in the target reinforcement learning model according to preset configuration parameters.

7. The method of claim 5 or 6, wherein the training of the target reinforcement learning model in the second scenario comprises:

acquiring training result data of a second agent in the target reinforcement learning model in the current training round;

performing data screening on the training result data according to the target intervention data to obtain target training result data;

classifying the target training result data according to an intervention discriminator to obtain intervention result data;

and training the target reinforcement learning model according to the target training result data and the intervention result data.

8. The method of claim 7, wherein said training the target reinforcement learning model based on the target training result data and the intervention result data comprises:

determining a target loss function according to the initial loss function and the intervention result data;

taking the target training result data and the intervention result data as agent input data of a second agent in the target reinforcement learning model, and inputting the agent input data into the second agent to obtain agent output data of the second agent;

updating agent parameters of the second agent based on the objective loss function, the agent input data, and the agent output data.

9. The method of claim 8, wherein said determining a target loss function from an initial loss function and the intervention result data comprises:

determining the target loss function based on the following formula:

L_total＝0.5*L_actor+0.5*L_critic+α(t)*DN(s,action)

wherein L is_totalRepresenting said target loss function, L_actorLoss function, L, representing a model of a control strategy_criticRepresenting a loss function of the evaluation strategy model, alpha (t) representing a penalty term, DN (s, action) representing the output of the intervention arbiter, alpha (t) DN (s, action) representing the intervention result data, epoch representing the output of the intervention arbiters2 represents the total number of training iterations of the target reinforcement learning model, t represents the current number of iteration rounds, and C represents a constant term.

10. A method of decision making, comprising:

acquiring state data of a target scene;

wherein the target reinforcement learning model is obtained by training through the reinforcement learning model training method of any one of claims 1-9.

11. A reinforcement learning model training apparatus comprising:

12. The apparatus of claim 11, wherein the first agent parameter comprises a control strategy parameter of the first agent; the second agent parameter determination module is further configured to:

determining a target second agent sub-parameter of the second agent parameters;

freezing the target second agent sub-parameter.

13. The apparatus of claim 11, wherein the assisted learning network training module is further configured to:

14. The apparatus of claim 13, wherein the assisted learning network training module is further configured to:

acquiring target correction sample data;

15. The apparatus of claim 11, wherein the target reinforcement learning model training module is further to:

initializing model parameters in the target reinforcement learning model according to the network parameters of the auxiliary learning network of the target obtained by training;

training the target reinforcement learning model in the second scenario.

16. The apparatus of claim 15, wherein model parameters in the target reinforcement learning model comprise control strategy parameters; the target reinforcement learning model training module is further configured to:

The model parameters in the target reinforcement learning model comprise evaluation strategy parameters, and the device further comprises an evaluation strategy parameter initialization module for:

17. The apparatus of claim 15 or 16, wherein the target reinforcement learning model training module is further to:

18. The apparatus of claim 17, wherein the target reinforcement learning model training module is further configured to:

19. The apparatus of claim 18, wherein the target reinforcement learning model training module is further configured to:

determining the target loss function based on the following formula:

L_total＝0.5*L_actor+0.5*L_critic+α(t)*DN(s,action)

20. A decision-making device, comprising:

wherein the target reinforcement learning model is obtained by training through the reinforcement learning model training device of any one of claims 11-19.

21. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the reinforcement learning model training method of any one of claims 1-9 or the decision-making method of claim 10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the reinforcement learning model training method of any one of claims 1-9 or the decision method of claim 10.

23. A computer program product comprising a computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the reinforcement learning model training method of any one of claims 1-9 or the decision method of claim 10.