CN116702872A - Reinforced learning method and device based on offline pre-training state transition transducer model - Google Patents

Reinforced learning method and device based on offline pre-training state transition transducer model Download PDF

Info

Publication number
CN116702872A
CN116702872A CN202310737435.3A CN202310737435A CN116702872A CN 116702872 A CN116702872 A CN 116702872A CN 202310737435 A CN202310737435 A CN 202310737435A CN 116702872 A CN116702872 A CN 116702872A
Authority
CN
China
Prior art keywords
state transition
state
reinforcement learning
training
transducer model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310737435.3A
Other languages
Chinese (zh)
Inventor
卢宗青
周伯涵
李可
姜杰川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Peking University
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Peking University
Priority to CN202310737435.3A priority Critical patent/CN116702872A/en
Publication of CN116702872A publication Critical patent/CN116702872A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7753Incorporation of unlabelled data, e.g. multiple instance learning [MIL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6027Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a reinforcement learning method based on an offline pre-training state transition transducer model, and belongs to the technical field of artificial intelligence. The method comprises the steps of obtaining a state transition transducer model based on offline pre-training of observed data of video, so that the state transition transducer model predicts a next state according to an input current state, and obtains a discrimination score of state transition from the current state to the next state; and obtaining a discrimination score of state transition in reinforcement learning as an intrinsic reward by using the state transition transducer model, so that an intelligent agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy. Compared with a baseline algorithm, the method provided by the invention has higher robustness, sample efficiency and performance, and has high research value in the fields of robot control, automatic driving and the like.

Description

Reinforced learning method and device based on offline pre-training state transition transducer model
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a reinforcement learning method and device based on an offline pre-training state transition transducer model.
Background
Training reinforcement learning strategies from visual observations is a challenging study with difficulties mainly in dealing with high-dimensional input dependent large amounts of computational resources, lack of explicit motion information, complexity of visual data requiring powerful feature extraction techniques, time dependencies, etc.
Some of the existing training methods adopt an on-line reinforcement learning scheme from zero, the sampling efficiency of the method is low, effective sample exploration and high-difficulty exploration are difficult to perform, and the antagonism method of the on-line learning discriminator is easily influenced by misclassification caused by noise or local change in visual observation; some strategy training which only carries out reinforcement learning on specific tasks has weak generalization capability and is not suitable for processing open tasks. Therefore, the application range of the existing observation learning method is limited: many are only suitable for vector-based viewing environments and do not work well when applied to high-dimensional visual viewing or video games.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
The first aspect of the present invention provides a reinforcement learning method based on an offline pretrained state transition transducer model, comprising:
based on the observation data of the video, performing offline pre-training on the state transition converter model to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state, and obtains a discrimination score of state transition from the current state to the next state;
and obtaining a discrimination score of state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward of reinforcement learning, so that an intelligent agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.
Preferably, the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.
Preferably, the offline pre-training of the transducer model to obtain the state transition transducer model based on the video observation data includes:
respectively inputting the states of two adjacent time steps into a feature encoder to obtain a corresponding state representation e t And e t+1
Characterizing the state e t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation
Will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->False state transition determination of (2)Respectively;
iterative training is carried out, so that the discrimination score of the real state transition is increased, and the discrimination score of the false state transition is reduced until the training target is reached.
Preferably, the learning and iteration of the strategy by the reinforcement learning agent according to the intrinsic reward includes: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:
wherein pi represents policy, ρ 0 Representing an initial state distribution, a t Is represented in the current state s t The following is based on the policy distribution pi (|s) t ) Action performed,(s) t ,s t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s t ,s t+1 ) Representation is directed by the state transition transducer model (s t ,s t+1 ) Given intrinsic rewards, < >>Indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.
Preferably, said will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;
the step of obtaining the discrimination score of the state transition in the reinforcement learning by using the state transition transducer model as the intrinsic reward of the reinforcement learning is as follows: the gap between the true state transition and the false state transition is utilized as an intrinsic reward for reinforcement learning.
Preferably, a self-supervision time sequence distance prediction method is utilized to learn time sequence continuous characteristic representations of state observation, meanwhile, an countermeasure learning method is adopted, and a single-step transfer rule is accurately predicted in the space of the characteristic representations through a judgment score guidance of a discriminator.
Preferably, the state transition in reinforcement learning is obtained as follows: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.
The second aspect of the present invention provides a reinforcement learning device based on an offline pretrained state transition transducer model, comprising:
the state transition converter model offline pre-training module is used for carrying out offline pre-training on the converter model based on the observation data of the video to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state and obtains a discrimination score of state transition from the current state to the next state;
and the reinforcement learning strategy training module is used for obtaining the discrimination score of the state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward so that the reinforcement learning intelligent body carries out strategy learning and iteration according to the intrinsic reward to obtain the optimal strategy.
A third aspect of the present invention provides a memory storing a plurality of instructions for implementing the method of reinforcement learning based on an offline pretrained state transition transducer model according to the first aspect.
A fourth aspect of the present invention provides an electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions loadable and executable by the processor to enable the processor to perform the method of reinforcement learning based on an offline pretrained state transition transducer model as described in the first aspect.
The beneficial effects of the invention are as follows: the invention provides a two-stage reinforcement learning method based on an offline pre-training state transition transducer model, which provides an innovative method for enabling an intelligent agent to learn from visual observation effectively. Wherein the state transition transducer model is capable of offline pre-training based solely on visual observations and then guiding the training of online reinforcement learning strategies without any environmental rewards. In addition, self-attention is integrated into each module to capture time variations by state transition discriminators and self-supervised time regression joint prediction potential transitions, thereby improving performance in downstream reinforcement learning tasks. Experiments on the strategies obtained by training in various Atari and Minecraft environments prove that the method provided by the invention has more robustness, sample efficiency and performance than a baseline algorithm. And, in some tasks, even performance comparable to strategies learned from explicit environmental rewards is achieved. Reinforcement learning from visual observation is performed, and the method provided by the invention has great potential for the situations that video demonstration is available, but environment interaction is limited and marking actions are expensive and dangerous, for example, the method has high research value in the fields of robot control, automatic driving and the like.
Drawings
FIG. 1 is a flow chart of a reinforcement learning method based on an offline pretrained state transition transducer model according to the present invention;
FIG. 2 is a schematic diagram of a reinforcement learning method based on an offline pretrained state transition transducer model according to the present invention;
fig. 3 is a schematic functional block diagram of a reinforcement learning device based on an offline pretrained state transition transducer model according to the present invention.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The method provided by the invention can be implemented in a terminal environment, and the terminal can comprise one or more of the following components: processor, memory and display screen. Wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method described in the embodiments below.
The processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and invoking data stored in the memory.
The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code, sets of codes, or instructions.
The display screen is used for displaying a user interface of each application program.
In addition, it will be appreciated by those skilled in the art that the structure of the terminal described above is not limiting and that the terminal may include more or fewer components, or may combine certain components, or a different arrangement of components. For example, the terminal further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and the like, which are not described herein.
Example 1
As shown in fig. 1 and 2, an embodiment of the present invention provides a reinforcement learning method based on an offline pretrained state transition transducer model, including:
s101, performing offline pre-training on a transducer model based on video observation data to obtain a state transition transducer model, so that the state transition transducer model predicts the next state according to the input current state and obtains a discrimination score of state transition from the current state to the next state;
s102, using the state transition transducer model to obtain a discrimination score of state transition in reinforcement learning as an intrinsic reward, so that an agent for reinforcement learning performs policy learning and iteration according to the intrinsic reward to obtain an optimal policy.
The reinforcement learning method based on the offline pre-training state transition transducer model comprises two stages. In the first stage (stage one, offline pre-training), based on the observed data of the video, the offline pre-training obtains a state transition transducer model, which can effectively capture information in the demonstration video to predict hidden layer transition of the observed state.
In the second stage (stage two, online reinforcement learning), the state transition transducer model obtained in the first stage is utilized to provide an internal reward for the downstream reinforcement learning task, and the agent can only learn and iterate strategies from the independent internal reward without the guidance of environmental rewards.
In step S101, the current state may be extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video. As shown in FIG. 2, the current state of two adjacent time stepsAndare all obtained by stacking four adjacent frames of observation data.
Among them, the Atari environment is a classical arcade game environment, and since each task can be modeled as a markov decision process, it is a popular test environment where a test reinforcement learning algorithm is applied on visual control tasks. In the Atari environment, to ensure that the state reflects game dynamic information, the current state is obtained by stacking gray-scale game pictures observed by four adjacent frames.
The MineCroft environment is a recently growing 3D gaming environment, with a simulator interface provided by Minedojo, containing thousands of open-ended open-explored tasks. The performance of the agent to complete tasks in a complex MineCroft environmental scenario can more fully embody the performance of the algorithm. Since the Minedojo simulator supports only a single frame observation state transition, to align the three-dimensional state representation in Atari, the state in the mineshift environment is defined as the three-channel first-person perspective image currently observed by the agent.
In the embodiment of the invention, the video observation data can be obtained according to the following method:
wherein, the observation data of the Atari environment comes from Google Domamine (an open source reinforcement learning framework of Google). For each Atari task, the observation data set originates from the last stored hundred thousand frames of greyscale game frames resized to (84, 84) in the experience playback pool after 50 rounds of training by DQN (deep Q learning algorithm).
Observations of the MineCroft environment come from the related study Plan4MC (a method based on planning to solve the open MineCroft task). First, the Plan4MC agent is trained, and a first person game picture with the frame size of 160,256,3 of fifty thousand frames is collected by adopting a learned expert strategy to form an expert observation data set.
Step S101 is performed, as shown in fig. 2, where performing offline pre-training on the transducer model to obtain a state transition transducer model based on the observed data of the video may include:
the state of two adjacent time stepsAnd->Respectively inputting the state representations into a feature encoder to obtain corresponding state representations e t And e t+1
Characterizing the state e t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation
Will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (a);
iterative training is performed, so that the discrimination score of the real state transition is as high as possible, and the discrimination score of the false state transition is as low as possible until the training target is reached.
Further, the said will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;
the step of obtaining the discrimination score of the state transition in the reinforcement learning by using the state transition transducer model as the intrinsic reward of the reinforcement learning is as follows: the gap between the true state transition and the false state transition is utilized as an intrinsic reward for reinforcement learning.
In addition, in the embodiment of the invention, a self-supervision time sequence distance prediction method is utilized to learn the time sequence continuous characteristic representation of state observation, and meanwhile, an countermeasure learning method is adopted, and a discriminator is used for discriminating and scoring to guide the precise prediction of a single-step transfer rule in the space of the characteristic representation. Therefore, the pre-trained state transition converter and the arbiter provide intrinsic rewards for the online acquired observation sequence in the reinforcement learning process, so that the performance of the downstream reinforcement learning task is improved.
Step S102 is executed, in reinforcement learning, the agent acquires the current state of the environment, selects the executed action based on the current state of the environment according to the strategy, and interacts with the environment according to the selected executed action to generate state transition. And (3) obtaining a discrimination score of the state transition in reinforcement learning by using the state transition transducer model obtained through training in the step (S101) as an intrinsic reward of reinforcement learning, so that an agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.
Specifically, as shown in fig. 2, in the reinforcement learning strategy training, the agent acquires the current state of the visual observation environment, and then the agent performs strategy pi θ Selecting an action a to be performed based on the current state t Then the agent performs action a according to the selection t Interact with the environment to create a state transition. Then, the state transition transducer model obtained by offline pre-training can be utilized to obtain the intrinsic rewards for the state transition condition generated by the interaction of the intelligent agent and the environmentFinally, the agent is awarded ∈>Updating policy pi θ . And (5) performing iterative training until an optimal strategy is obtained.
In the embodiment of the present invention, the learning and iteration of the strategy by the reinforcement learning agent according to the intrinsic reward may include: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:
wherein pi represents policy, ρ 0 Representing an initial state distribution, a t Represents the distribution pi (|s) according to the policy in the current state t t ) Action performed,(s) t ,s t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s t ,s t+1 ) Representation is directed by the state transition transducer model (s t ,s t+1 ) The inherent rewards that are given are,indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.
Example two
As shown in fig. 3, another aspect of the present invention further includes a functional module architecture that is completely consistent with the foregoing method flow, that is, the embodiment of the present invention further provides an reinforcement learning device based on an offline pretrained state transition transducer model, including:
the state transition converter model offline pre-training module 201 is configured to perform offline pre-training on the converter model based on the video observation data to obtain a state transition converter model, so that the state transition converter model predicts according to the input current state to obtain a next state, and obtains a discrimination score of state transition from the current state to the next state;
the reinforcement learning strategy training module 202 is configured to obtain a discrimination score of a state transition in reinforcement learning as an intrinsic reward by using the state transition transducer model, so that an agent for reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.
In the state transition transducer model offline pre-training module 201, the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.
Further, the offline pre-training of the transducer model to obtain the state transition transducer model based on the video observation data includes:
respectively inputting the states of two adjacent time steps into a feature encoder to obtain a corresponding state representation e t And e t+1
Characterizing the state e t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation
Will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (a);
iterative training is performed, so that the discrimination score of the real state transition is as high as possible, and the discrimination score of the false state transition is as low as possible until the training target is reached.
Said will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;
the step of obtaining the discrimination score of the state transition in the reinforcement learning by using the state transition transducer model as the intrinsic reward of the reinforcement learning is as follows: the gap between the true state transition and the false state transition is utilized as an intrinsic reward for reinforcement learning.
And learning the sequential continuous characteristic representation of state observation by using a self-supervision sequential distance prediction method, and simultaneously adopting an countermeasure learning method, and accurately predicting a single-step transfer rule in the space of the characteristic representation by judging and scoring guidance through a discriminator.
In the reinforcement learning strategy training module 202, the state transition in reinforcement learning is obtained as follows: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.
Further, the reinforcement learning agent performs policy learning and iteration according to the intrinsic rewards, and the obtaining of the optimal policy includes: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:
wherein pi represents policy, ρ 0 Representing an initial state distribution, a t Is represented in the current state s t The following is based on the policy distribution pi (|s) t ) Action performed,(s) t ,s t+1 ) Indicating whenThe transition from the previous time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s t ,s t+1 ) Representation is directed by the state transition transducer model (s t ,s t+1 ) Given intrinsic rewards, < >>Indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.
The device may be implemented by the reinforcement learning method based on the offline pretrained state transition transducer model provided in the first embodiment, and the specific implementation method may be described in the first embodiment, which is not described herein.
The invention also provides a memory, which stores a plurality of instructions for implementing the reinforcement learning method based on the offline pretrained state transition transducer model according to the first embodiment.
The invention also provides an electronic device comprising a processor and a memory connected with the processor, wherein the memory stores a plurality of instructions that can be loaded and executed by the processor to enable the processor to perform the reinforcement learning method based on the offline pretrained state transition transducer model according to the first embodiment.
By adopting the technical scheme provided by the invention, reinforcement learning is performed from visual observation, and the method has great potential for the situations that video demonstration is available, but environment interaction is limited and marking actions are expensive and dangerous, for example, the method has high research value in the fields of robot control, automatic driving and the like.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. An reinforcement learning method based on an offline pre-training state transition transducer model is characterized by comprising the following steps:
based on the observation data of the video, performing offline pre-training on the state transition converter model to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state, and obtains a discrimination score of state transition from the current state to the next state;
and obtaining a discrimination score of state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward of reinforcement learning, so that an intelligent agent of reinforcement learning performs strategy learning and iteration according to the intrinsic reward to obtain an optimal strategy.
2. The method of claim 1, wherein the current state is extracted from the video as follows: in an Atari environment, the current state is obtained by stacking four adjacent frames of observation data in the video; in the MineCroft environment, the current state includes the current frame of observation in the video.
3. The method for reinforcement learning based on an offline pre-training state transition transducer model according to claim 1, wherein the offline pre-training of the transducer model based on the video observation data to obtain the state transition transducer model comprises:
respectively inputting the states of two adjacent time steps into a feature encoder to obtain a corresponding state representation e t And e t+1
Characterization of the Statee t Inputting the obtained data into a transducer model, and predicting to obtain the next state representation
Will e t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (a);
iterative training is carried out, so that the discrimination score of the real state transition is increased, and the discrimination score of the false state transition is reduced until the training target is reached.
4. The method for reinforcement learning based on offline pre-training state transition transducer model of claim 3, wherein e is set to t 、e t+1 Andrespectively input into a state transition discriminator to obtain a slave e t To e t+1 Discrimination score of true state transition of (c) and e) t To->Discrimination scores for false state transitions of (2), further comprising: calculating the difference between the discrimination score of the false state transition and the discrimination score of the true state transition to obtain the difference between the true state transition and the false state transition;
the step of obtaining the discrimination score of the state transition in the reinforcement learning by using the state transition transducer model as the intrinsic reward of the reinforcement learning is as follows: the gap between the true state transition and the false state transition is utilized as an intrinsic reward for reinforcement learning.
5. The method for reinforcement learning based on offline pre-training state transition transducer model according to claim 3, wherein the time-series continuous feature representation of state observation is learned by using a self-supervision time-series distance prediction method, and simultaneously, a single-step transition law is accurately predicted in the space of the feature representation by using a countermeasure learning method through a discriminator discrimination score guidance.
6. The method for reinforcement learning based on offline pre-training state transition transducer model according to claim 1, wherein the state transition in reinforcement learning is obtained by the following method: in reinforcement learning, an agent acquires a current state of an environment, selects an executed action based on the current state of the environment according to a policy, and interacts with the environment according to the selected executed action to generate a state transition.
7. The method of claim 1, wherein the learning and iterating the strategy by the reinforcement learning agent according to the intrinsic reward to obtain the optimal strategy comprises: the reinforcement learning agents interact in the environment, strategy promotion is achieved by maximizing the following target J iterative update strategy under the excitation of the intrinsic rewards calculated by the state transition transducer model, and the optimal strategy is finally obtained:
wherein pi represents policy, ρ 0 Representing an initial state distribution, a t Is represented in the current state s t The following is based on the policy distribution pi (|s) t ) Action performed,(s) t ,s t+1 ) Indicating a transition from the current time state to the next time state,representing the state transfer function, gamma being the discount factor, r (s t ,s t+1 ) Representation is directed by the state transition transducer model (s t ,s t+1 ) The inherent rewards that are given are,indicating the desire, J indicates the desire to maximize the goal, i.e., to maximize the sum of the discount rewards.
8. An apparatus for reinforcement learning based on an offline pretrained state transition transducer model, comprising:
the state transition converter model offline pre-training module is used for carrying out offline pre-training on the converter model based on the observation data of the video to obtain a state transition converter model, so that the state transition converter model predicts the next state according to the input current state and obtains a discrimination score of state transition from the current state to the next state;
and the reinforcement learning strategy training module is used for obtaining the discrimination score of the state transition in reinforcement learning by using the state transition transducer model as an intrinsic reward so that the reinforcement learning intelligent body carries out strategy learning and iteration according to the intrinsic reward to obtain the optimal strategy.
9. A memory having stored thereon a plurality of instructions for implementing the offline pretrained state transition fransformer model-based reinforcement learning method of any one of claims 1-7.
10. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the offline pretrained state transition transporter model-based reinforcement learning method of any of claims 1-7.
CN202310737435.3A 2023-06-20 2023-06-20 Reinforced learning method and device based on offline pre-training state transition transducer model Pending CN116702872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310737435.3A CN116702872A (en) 2023-06-20 2023-06-20 Reinforced learning method and device based on offline pre-training state transition transducer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310737435.3A CN116702872A (en) 2023-06-20 2023-06-20 Reinforced learning method and device based on offline pre-training state transition transducer model

Publications (1)

Publication Number Publication Date
CN116702872A true CN116702872A (en) 2023-09-05

Family

ID=87825438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310737435.3A Pending CN116702872A (en) 2023-06-20 2023-06-20 Reinforced learning method and device based on offline pre-training state transition transducer model

Country Status (1)

Country Link
CN (1) CN116702872A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning
CN117953351A (en) * 2024-03-27 2024-04-30 之江实验室 Decision method based on model reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning
CN117953351A (en) * 2024-03-27 2024-04-30 之江实验室 Decision method based on model reinforcement learning

Similar Documents

Publication Publication Date Title
Greydanus et al. Visualizing and understanding atari agents
Mousavi et al. Deep reinforcement learning: an overview
Lei et al. Dynamic path planning of unknown environment based on deep reinforcement learning
CN107403426B (en) Target object detection method and device
CN105637540B (en) Method and apparatus for reinforcement learning
US20180268292A1 (en) Learning efficient object detection models with knowledge distillation
CN116702872A (en) Reinforced learning method and device based on offline pre-training state transition transducer model
CN111144580B (en) Hierarchical reinforcement learning training method and device based on imitation learning
CN110770759B (en) Neural network system
de la Cruz et al. Pre-training with non-expert human demonstration for deep reinforcement learning
CN111507378A (en) Method and apparatus for training image processing model
US11580378B2 (en) Reinforcement learning for concurrent actions
CN111602144A (en) Generating neural network system for generating instruction sequences to control agents performing tasks
Zieliński et al. 3D robotic navigation using a vision-based deep reinforcement learning model
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
EP2363251A1 (en) Robot with Behavioral Sequences on the basis of learned Petri Net Representations
CN111902812A (en) Electronic device and control method thereof
Bertoin et al. Local feature swapping for generalization in reinforcement learning
CN113407820B (en) Method for processing data by using model, related system and storage medium
Chen et al. Toward a brain-inspired system: Deep recurrent reinforcement learning for a simulated self-driving agent
Ji et al. Improving decision-making efficiency of image game based on deep Q-learning
Shao et al. Visual navigation with actor-critic deep reinforcement learning
CN115797517B (en) Data processing method, device, equipment and medium of virtual model
CN112121419A (en) Virtual object control method, device, electronic equipment and storage medium
Saito et al. Python reinforcement learning projects: eight hands-on projects exploring reinforcement learning algorithms using TensorFlow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination