CN116702872A

CN116702872A - Reinforced learning method and device based on offline pre-training state transition transducer model

Info

Publication number: CN116702872A
Application number: CN202310737435.3A
Authority: CN
Inventors: 卢宗青; 周伯涵; 李可; 姜杰川
Original assignee: Peking University; Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Peking University; Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-09-05

Abstract

The invention discloses a reinforcement learning method based on an offline pre-training state transition transducer model, and belongs to the technical field of artificial intelligence. The method comprises the steps of obtaining a state transition transducer model based on offline pre-training of observed data of video, so that the state transition transducer model predicts a next state according to an input current state, and obtains a discrimination score of state transition from the current state to the next state; and obtaining a discrimination score of state transition in reinforcement learning as an intrinsic reward by using the state transition transducer model, so that an intelligent agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy. Compared with a baseline algorithm, the method provided by the invention has higher robustness, sample efficiency and performance, and has high research value in the fields of robot control, automatic driving and the like.

Description

Reinforced learning method and device based on offline pre-training state transition transducer model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a reinforcement learning method and device based on an offline pre-training state transition transducer model.

Background

Training reinforcement learning strategies from visual observations is a challenging study with difficulties mainly in dealing with high-dimensional input dependent large amounts of computational resources, lack of explicit motion information, complexity of visual data requiring powerful feature extraction techniques, time dependencies, etc.

Some of the existing training methods adopt an on-line reinforcement learning scheme from zero, the sampling efficiency of the method is low, effective sample exploration and high-difficulty exploration are difficult to perform, and the antagonism method of the on-line learning discriminator is easily influenced by misclassification caused by noise or local change in visual observation; some strategy training which only carries out reinforcement learning on specific tasks has weak generalization capability and is not suitable for processing open tasks. Therefore, the application range of the existing observation learning method is limited: many are only suitable for vector-based viewing environments and do not work well when applied to high-dimensional visual viewing or video games.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The first aspect of the present invention provides a reinforcement learning method based on an offline pretrained state transition transducer model, comprising:

based on the observation data of the video, performing offline pre-training on the state transition converter model to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state, and obtains a discrimination score of state transition from the current state to the next state;

and obtaining a discrimination score of state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward of reinforcement learning, so that an intelligent agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.

Preferably, the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.

Preferably, the offline pre-training of the transducer model to obtain the state transition transducer model based on the video observation data includes:

respectively inputting the states of two adjacent time steps into a feature encoder to obtain a corresponding state representation e _t And e _t+1 ；

Characterizing the state e _t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation

Will e _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->False state transition determination of (2)Respectively;

iterative training is carried out, so that the discrimination score of the real state transition is increased, and the discrimination score of the false state transition is reduced until the training target is reached.

Preferably, the learning and iteration of the strategy by the reinforcement learning agent according to the intrinsic reward includes: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:

wherein pi represents policy, ρ ₀ Representing an initial state distribution, a _t Is represented in the current state s _t The following is based on the policy distribution pi (|s) _t ) Action performed,(s) _t ，s _t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s _t ，s _t+1 ) Representation is directed by the state transition transducer model (s _t ，s _t+1 ) Given intrinsic rewards, < >>Indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.

Preferably, said will e _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;

the step of obtaining the discrimination score of the state transition in the reinforcement learning by using the state transition transducer model as the intrinsic reward of the reinforcement learning is as follows: the gap between the true state transition and the false state transition is utilized as an intrinsic reward for reinforcement learning.

Preferably, a self-supervision time sequence distance prediction method is utilized to learn time sequence continuous characteristic representations of state observation, meanwhile, an countermeasure learning method is adopted, and a single-step transfer rule is accurately predicted in the space of the characteristic representations through a judgment score guidance of a discriminator.

Preferably, the state transition in reinforcement learning is obtained as follows: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.

The second aspect of the present invention provides a reinforcement learning device based on an offline pretrained state transition transducer model, comprising:

the state transition converter model offline pre-training module is used for carrying out offline pre-training on the converter model based on the observation data of the video to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state and obtains a discrimination score of state transition from the current state to the next state;

and the reinforcement learning strategy training module is used for obtaining the discrimination score of the state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward so that the reinforcement learning intelligent body carries out strategy learning and iteration according to the intrinsic reward to obtain the optimal strategy.

A third aspect of the present invention provides a memory storing a plurality of instructions for implementing the method of reinforcement learning based on an offline pretrained state transition transducer model according to the first aspect.

A fourth aspect of the present invention provides an electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions loadable and executable by the processor to enable the processor to perform the method of reinforcement learning based on an offline pretrained state transition transducer model as described in the first aspect.

The beneficial effects of the invention are as follows: the invention provides a two-stage reinforcement learning method based on an offline pre-training state transition transducer model, which provides an innovative method for enabling an intelligent agent to learn from visual observation effectively. Wherein the state transition transducer model is capable of offline pre-training based solely on visual observations and then guiding the training of online reinforcement learning strategies without any environmental rewards. In addition, self-attention is integrated into each module to capture time variations by state transition discriminators and self-supervised time regression joint prediction potential transitions, thereby improving performance in downstream reinforcement learning tasks. Experiments on the strategies obtained by training in various Atari and Minecraft environments prove that the method provided by the invention has more robustness, sample efficiency and performance than a baseline algorithm. And, in some tasks, even performance comparable to strategies learned from explicit environmental rewards is achieved. Reinforcement learning from visual observation is performed, and the method provided by the invention has great potential for the situations that video demonstration is available, but environment interaction is limited and marking actions are expensive and dangerous, for example, the method has high research value in the fields of robot control, automatic driving and the like.

Drawings

FIG. 1 is a flow chart of a reinforcement learning method based on an offline pretrained state transition transducer model according to the present invention;

FIG. 2 is a schematic diagram of a reinforcement learning method based on an offline pretrained state transition transducer model according to the present invention;

fig. 3 is a schematic functional block diagram of a reinforcement learning device based on an offline pretrained state transition transducer model according to the present invention.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.

The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.

The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.

The display screen is used for displaying a user interface of each application program.

In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.

Example 1

As shown in fig. 1 and 2, an embodiment of the present invention provides a reinforcement learning method based on an offline pretrained state transition transducer model, including:

s101, performing offline pre-training on a transducer model based on video observation data to obtain a state transition transducer model, so that the state transition transducer model predicts the next state according to the input current state and obtains a discrimination score of state transition from the current state to the next state;

s102, using the state transition transducer model to obtain a discrimination score of state transition in reinforcement learning as an intrinsic reward, so that an agent for reinforcement learning performs policy learning and iteration according to the intrinsic reward to obtain an optimal policy.

The reinforcement learning method based on the offline pre-training state transition transducer model comprises two stages. In the first stage (stage one, offline pre-training), based on the observed data of the video, the offline pre-training obtains a state transition transducer model, which can effectively capture information in the demonstration video to predict hidden layer transition of the observed state.

In the second stage (stage two, online reinforcement learning), the state transition transducer model obtained in the first stage is utilized to provide an internal reward for the downstream reinforcement learning task, and the agent can only learn and iterate strategies from the independent internal reward without the guidance of environmental rewards.

In step S101, the current state may be extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video. As shown in FIG. 2, the current state of two adjacent time stepsAndare all obtained by stacking four adjacent frames of observation data.

Among them, the Atari environment is a classical arcade game environment, and since each task can be modeled as a markov decision process, it is a popular test environment where a test reinforcement learning algorithm is applied on visual control tasks. In the Atari environment, to ensure that the state reflects game dynamic information, the current state is obtained by stacking gray-scale game pictures observed by four adjacent frames.

The MineCroft environment is a recently growing 3D gaming environment, with a simulator interface provided by Minedojo, containing thousands of open-ended open-explored tasks. The performance of the agent to complete tasks in a complex MineCroft environmental scenario can more fully embody the performance of the algorithm. Since the Minedojo simulator supports only a single frame observation state transition, to align the three-dimensional state representation in Atari, the state in the mineshift environment is defined as the three-channel first-person perspective image currently observed by the agent.

In the embodiment of the invention, the video observation data can be obtained according to the following method:

wherein, the observation data of the Atari environment comes from Google Domamine (an open source reinforcement learning framework of Google). For each Atari task, the observation data set originates from the last stored hundred thousand frames of greyscale game frames resized to (84, 84) in the experience playback pool after 50 rounds of training by DQN (deep Q learning algorithm).

Observations of the MineCroft environment come from the related study Plan4MC (a method based on planning to solve the open MineCroft task). First, the Plan4MC agent is trained, and a first person game picture with the frame size of 160,256,3 of fifty thousand frames is collected by adopting a learned expert strategy to form an expert observation data set.

Step S101 is performed, as shown in fig. 2, where performing offline pre-training on the transducer model to obtain a state transition transducer model based on the observed data of the video may include:

the state of two adjacent time stepsAnd->Respectively inputting the state representations into a feature encoder to obtain corresponding state representations e _t And e _t+1 ；

Will e _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->Discrimination scores for false state transitions of (a);

iterative training is performed, so that the discrimination score of the real state transition is as high as possible, and the discrimination score of the false state transition is as low as possible until the training target is reached.

Further, the said will e _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;

In addition, in the embodiment of the invention, a self-supervision time sequence distance prediction method is utilized to learn the time sequence continuous characteristic representation of state observation, and meanwhile, an countermeasure learning method is adopted, and a discriminator is used for discriminating and scoring to guide the precise prediction of a single-step transfer rule in the space of the characteristic representation. Therefore, the pre-trained state transition converter and the arbiter provide intrinsic rewards for the online acquired observation sequence in the reinforcement learning process, so that the performance of the downstream reinforcement learning task is improved.

Step S102 is executed, in reinforcement learning, the agent acquires the current state of the environment, selects the executed action based on the current state of the environment according to the strategy, and interacts with the environment according to the selected executed action to generate state transition. And (3) obtaining a discrimination score of the state transition in reinforcement learning by using the state transition transducer model obtained through training in the step (S101) as an intrinsic reward of reinforcement learning, so that an agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.

Specifically, as shown in fig. 2, in the reinforcement learning strategy training, the agent acquires the current state of the visual observation environment, and then the agent performs strategy pi _θ Selecting an action a to be performed based on the current state _t Then the agent performs action a according to the selection _t Interact with the environment to create a state transition. Then, the state transition transducer model obtained by offline pre-training can be utilized to obtain the intrinsic rewards for the state transition condition generated by the interaction of the intelligent agent and the environmentFinally, the agent is awarded ∈>Updating policy pi _θ . And (5) performing iterative training until an optimal strategy is obtained.

In the embodiment of the present invention, the learning and iteration of the strategy by the reinforcement learning agent according to the intrinsic reward may include: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:

wherein pi represents policy, ρ ₀ Representing an initial state distribution, a _t Represents the distribution pi (|s) according to the policy in the current state t _t ) Action performed,(s) _t ，s _t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s _t ，s _t+1 ) Representation is directed by the state transition transducer model (s _t ，s _t+1 ) The inherent rewards that are given are,indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.

Example two

As shown in fig. 3, another aspect of the present invention further includes a functional module architecture that is completely consistent with the foregoing method flow, that is, the embodiment of the present invention further provides an reinforcement learning device based on an offline pretrained state transition transducer model, including:

the state transition converter model offline pre-training module 201 is configured to perform offline pre-training on the converter model based on the video observation data to obtain a state transition converter model, so that the state transition converter model predicts according to the input current state to obtain a next state, and obtains a discrimination score of state transition from the current state to the next state;

the reinforcement learning strategy training module 202 is configured to obtain a discrimination score of a state transition in reinforcement learning as an intrinsic reward by using the state transition transducer model, so that an agent for reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.

In the state transition transducer model offline pre-training module 201, the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.

Further, the offline pre-training of the transducer model to obtain the state transition transducer model based on the video observation data includes:

Said will e _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;

And learning the sequential continuous characteristic representation of state observation by using a self-supervision sequential distance prediction method, and simultaneously adopting an countermeasure learning method, and accurately predicting a single-step transfer rule in the space of the characteristic representation by judging and scoring guidance through a discriminator.

In the reinforcement learning strategy training module 202, the state transition in reinforcement learning is obtained as follows: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.

Further, the reinforcement learning agent performs policy learning and iteration according to the intrinsic rewards, and the obtaining of the optimal policy includes: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:

wherein pi represents policy, ρ ₀ Representing an initial state distribution, a _t Is represented in the current state s _t The following is based on the policy distribution pi (|s) _t ) Action performed,(s) _t ，s _t+1 ) Indicating whenThe transition from the previous time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s _t ，s _t+1 ) Representation is directed by the state transition transducer model (s _t ，s _t+1 ) Given intrinsic rewards, < >>Indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.

The device may be implemented by the reinforcement learning method based on the offline pretrained state transition transducer model provided in the first embodiment, and the specific implementation method may be described in the first embodiment, which is not described herein.

The invention also provides a memory, which stores a plurality of instructions for implementing the reinforcement learning method based on the offline pretrained state transition transducer model according to the first embodiment.

The invention also provides an electronic device comprising a processor and a memory connected with the processor, wherein the memory stores a plurality of instructions that can be loaded and executed by the processor to enable the processor to perform the reinforcement learning method based on the offline pretrained state transition transducer model according to the first embodiment.

By adopting the technical scheme provided by the invention, reinforcement learning is performed from visual observation, and the method has great potential for the situations that video demonstration is available, but environment interaction is limited and marking actions are expensive and dangerous, for example, the method has high research value in the fields of robot control, automatic driving and the like.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An reinforcement learning method based on an offline pre-training state transition transducer model is characterized by comprising the following steps:

2. The method of claim 1, wherein the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.

3. The method for reinforcement learning based on an offline pre-training state transition transducer model according to claim 1, wherein the offline pre-training of the transducer model based on the video observation data to obtain the state transition transducer model comprises:

Characterization of the Statee _t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation

4. The method for reinforcement learning based on offline pre-training state transition transducer model of claim 3, wherein e is set to _t 、e _t+1 Andrespectively input into a state transition discriminator to obtain a slave e _t To e _t+1 Discrimination score of true state transition of (c) and e) _t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;

5. The method for reinforcement learning based on offline pre-training state transition transducer model according to claim 3, wherein the time-series continuous feature representation of state observation is learned by using a self-supervision time-series distance prediction method, and simultaneously, a single-step transition law is accurately predicted in the space of the feature representation by using a countermeasure learning method through a discriminator discrimination score guidance.

6. The method for reinforcement learning based on offline pre-training state transition transducer model according to claim 1, wherein the state transition in reinforcement learning is obtained by the following method: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.

7. The method of claim 1, wherein the learning and iterating the strategy by the reinforcement learning agent according to the intrinsic reward to obtain the optimal strategy comprises: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:

wherein pi represents policy, ρ ₀ Representing an initial state distribution, a _t Is represented in the current state s _t The following is based on the policy distribution pi (|s) _t ) Action performed,(s) _t ，s _t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s _t ，s _t+1 ) Representation is directed by the state transition transducer model (s _t ，s _t+1 ) The inherent rewards that are given are,indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.

8. An apparatus for reinforcement learning based on an offline pretrained state transition transducer model, comprising:

9. A memory having stored thereon a plurality of instructions for implementing the offline pretrained state transition fransformer model-based reinforcement learning method of any one of claims 1-7.

10. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the offline pretrained state transition transporter model-based reinforcement learning method of any of claims 1-7.