CN117953351A

CN117953351A - Decision method based on model reinforcement learning

Info

Publication number: CN117953351A
Application number: CN202410355666.2A
Authority: CN
Inventors: 李帅龙; 林峰; 张晴; 严笑然; 薛均晓; 马萧; 陆亚飞
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-03-27
Filing date: 2024-03-27
Publication date: 2024-04-30

Abstract

The invention discloses a decision method based on model reinforcement learning, which comprises the following steps: acquiring a high-dimensional image dataset; learning corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method; constructing a reinforced learning world model by using a transducer architecture in a low-dimensional feature space; and (3) forward imagining a plurality of steps by using the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy. Compared with random decision, the method reduces the randomness of the decision, can improve the decision efficiency, makes the decision according to the existing decision capability of the intelligent agent, overcomes the defects of low sample efficiency and increased capability of processing uncertainty in a dynamic environment, and further achieves a better and steady strategy.

Description

Decision method based on model reinforcement learning

Technical Field

The invention belongs to the technical field of deep reinforcement learning, and particularly relates to a decision method based on model reinforcement learning.

Background

Deep reinforcement learning has been applied in many fields and is a very popular research direction in academia and industry. The widespread focus of deep reinforcement learning benefits from its strong adaptivity and learning capabilities. The strong adaptivity means that the deep reinforcement learning agent can automatically adjust network structure and parameters according to different environments and tasks. The strong learning ability means that the deep reinforcement learning agent can quickly learn useful knowledge from a large amount of experience data and improve the accuracy and efficiency of decision making. Meanwhile, deep reinforcement learning has a plurality of defects such as unstable training, poor interpretability, high data requirement and the like.

Model-based reinforcement learning has many advantages over model-free reinforcement learning. Model-based reinforcement learning can make decisions using learned world models, rather than relying solely on trial and error; the past experience can be effectively utilized to learn the world model; the intelligent agent can be helped to better predict future reward signals, so that more excellent strategies can be formulated; can help the agent to better handle uncertainties in the dynamic environment, thereby developing a more robust strategy. However, cascading errors may occur due to model errors of the learned world model, and may result in poor strategies by the agent using only the learned world model forward search. If combined with the model-free reinforcement learning method, the model-based reinforcement learning intelligent agent increases the training amount, thereby reducing the training efficiency.

Therefore, how to improve training efficiency based on model reinforcement learning and reduce training time of models is a problem to be solved urgently.

Disclosure of Invention

Aiming at the problems existing in the prior art, the embodiment of the application aims to provide a decision method based on model reinforcement learning.

According to a first aspect of an embodiment of the present application, there is provided a decision method based on model reinforcement learning, including:

acquiring a high-dimensional image dataset;

Learning corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;

constructing a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;

and (3) forward imagining a plurality of steps by using the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.

Further, learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning approach, comprising:

preprocessing all high-dimensional image data in the high-dimensional image data set;

inputting the preprocessed high-dimensional image data into an encoder and a decoder to obtain corresponding low-dimensional information characteristics;

Comparing the low-dimensional characteristic information with a word list with the size of N, and calculating the nearest embedded vector index through Euclidean distance;

Reconstructing original image information using a decoder based on the most recent embedded vector index;

And training the encoder-decoder based on the reconstruction loss between the high-dimensional image data and the reconstructed image, and obtaining low-dimensional features corresponding to all the high-dimensional image data after the training is completed.

Further, during the training of the encoder-decoder, the loss function also includes a loss for constraining the consistency of the encoding space and the word embedding space and a perceived loss for image recovery between the high-dimensional image data and the reconstructed image.

Further, the world model utilizes a combination of low-dimensional features and actions for several time steps to generate next state information.

Further, the built world model needs to be trained, and when the world model is trained, a cross entropy loss training state transfer function and a termination state function are used, and a reward function is trained by means of mean square error loss.

Further, forward searching is performed by utilizing the constructed world model to forward imagine a plurality of steps according to return of the imagined track, so as to obtain an optimal strategy, which comprises the following steps:

selecting the optimal n actions according to the current state information, and predicting information of a next state by utilizing a world model, wherein the information comprises n pieces of state information, n pieces of rewarding information and n pieces of termination state information, and n is a super parameter;

Selecting optimal (n-1) actions for each piece of predicted state information, and predicting information of a next state by using a world model, wherein the information comprises n-1 pieces of state information, n-1 pieces of rewarding information, n-1 pieces of termination state information and the like;

when predicting the information of the i-th step state, selecting the optimal n-i actions for each piece of the state information predicted in the last step, wherein if (n-i) <=1, selecting one action until a preset maximum step length is reached;

And obtaining the rewards corresponding to the tracks with the maximum step length, wherein the first step decision corresponding to the optimal rewards is the optimal strategy.

According to a second aspect of an embodiment of the present application, there is provided a decision device based on model reinforcement learning, including:

The acquisition module is used for acquiring the high-dimensional image data set;

the learning module is used for learning the corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;

the building module is used for building a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;

And the forward search module is used for forward imagining a plurality of steps by utilizing the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.

According to a third aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

According to a fourth aspect of an embodiment of the present application, there is provided an electronic apparatus including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

According to a fifth aspect of embodiments of the present application there is provided a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

According to the embodiment, the model is trained by the self-supervision learning method, so that the low-dimensional characteristics of the high-dimensional image data are obtained. The method reduces the randomness of decisions compared with random decisions, can improve the decision efficiency, makes decisions according to the existing decision capability of the intelligent agent, overcomes the defects of low sample efficiency and increased capability of processing uncertainty in a dynamic environment, and further achieves a better and steady strategy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram illustrating a decision method based on model reinforcement learning, according to an example embodiment.

FIG. 2 is a schematic diagram of self-supervised learning according to an exemplary embodiment.

FIG. 3 is a diagram illustrating an ith step world model-based prediction according to an exemplary embodiment.

FIG. 4 is a schematic illustration of an agent in accordance with an exemplary embodimentThe low-dimensional feature space in the state is subjected to a flow diagram of track collection through a learned world model.

FIG. 5 is a block diagram illustrating a decision device based on model reinforcement learning, according to an example embodiment.

Fig. 6 is a schematic diagram of an electronic device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

FIG. 1 is a schematic flow chart of a decision method based on model reinforcement learning, for example, the application in Atari-Pong game scene, applied to reinforcement learning agents, specifically comprising the following steps:

s1: and acquiring a high-dimensional image data set D in the Atari-Pong game scene.

Specifically, at the beginning, the agent model needs to be randomly initialized as it has not been trained. The reinforcement learning agent has the characteristic of real-time interaction, and needs to collect real-time data while training. The agent needs to constantly interact with the simulation environment in the Atari-Pong game scene to collect data.

S2: as shown in fig. 2, the corresponding low-dimensional features are learned from the high-dimensional image dataset using a self-supervised learning method. The method specifically comprises the following substeps:

S21: preprocessing all high-dimensional image data in the high-dimensional image data set;

Specifically, the Pong environment provides a state defaulting to Box (210, 160, 3), i.e., a color map of 3 channels, i.e., FIG. 2 . Because the value of each pixel point is 0-255, normalization processing is needed to facilitate training of the deep neural network (to accelerate training speed, prevent gradient from disappearing or exploding, improve optimization process, etc.). The deep network model is convenient to train through normalization processing, wherein the deep network model can extract effective information from high-dimensional data, allows end-to-end learning from input data directly to final output, and does not need human intervention in data processing and feature selection processes.

S22: inputting the preprocessed high-dimensional image data into an encoder and a decoder to obtain corresponding low-dimensional information characteristics;

Specifically, both the encoder and decoder are designed using a depth network model. Encoder E: Is realized by using a two-dimensional convolution network to convert the three-dimensional characteristics of an image into low-dimensional characteristics And further converted into Tokens low-dimensional feature information with dimension d of 512. The corresponding low-dimensional information feature Z is obtained:， Representing a collection of tokens, K represents the number of tokens required for the deep network model output, and d represents a representation vector for each Token.

S23: comparing the low-dimensional characteristic information with a word list with the size of N, and calculating the nearest embedded vector index through Euclidean distance;

Specifically, the vocabulary represents the collection of quantities Tokens The total number Tokens represents the expression range of Tokens. Euclidean distance comparison by means of a low-dimensional feature tensor output by an encoder and tensor dimension tensors for each TokenFinding the Token with the minimum Euclidean distance to the low-dimensional characteristic tensor output by the encoder: output tokens is the most recent embedded vector index. Wherein, when comparing, the dimension d of the last dimension in the vocabulary is consistent with the dimension d of the last dimension of the low-dimension feature, Is corresponding to the word listThe dimensions are embedded in a table.

S24: reconstructing original image information using a decoder based on the most recent embedded vector index;

Specifically, decoder D: Is realized by using a deconvolution network, and aims to reconstruct an original image by using Tokens The dimension is (210, 160,3). Wherein,Representing the conversion of K Tokens low-dimensional vectors into one-dimensional low-dimensional feature vectors, and further realizing the original image by using a deconvolution networkIs a reconstruction of (a).

S25: training the encoder-decoder based on reconstruction loss between the high-dimensional image data and the reconstructed image, and obtaining low-dimensional features corresponding to all the high-dimensional image data after training is completed;

in particular, learning low-dimensional feature vector sum Tokens by an encoder-decoder using convolutional neural network can be performed by The penalty (reconstruction penalty), commitment penalty (used to constrain the agreement of the coding space and the word embedding space), and perceptual penalty (used for image recovery) are network trained.

S3: as shown in fig. 3, in the low-dimensional feature space, a reinforcement-learned world model is constructed using a transducer architecture.

The world model uses Tokens of encoder output and tokens of motion in combination to generate information of next state, and the transducer is used as its network architecture. The image Tokens obtained in S2 is combined with each action Tokens (the action is to use a discrete action space representation (e.g., pong' S discrete action space is 0,1,2,3,4, 5) as a token) as a sequence input to the transducer architecture, i.e.: the transducer obtains L time steps of low dimensional features (each time step of low dimensional features is K Tokens: ) And L (K+1) tokens (i.e ) As input; using action embedded tablesAnd image Tokens embedding tableIs embedded intoIn the tensor, where a represents the action space, D represents the vector of tokens representing the dimension, and N represents the set of Token (vocabulary size) for the state. The tensor gets the next state image Tokens through M blocks in the transducer, which are composed of self-attention and nonlinear network, and gets the immediate rewarding and ending state of the state through the linear network by using the next state Tokens.

The built world model needs to be trained, and when the world model is trained, a cross entropy loss training state transfer function and a termination state function are used, and a mean square error loss training reward function is utilized.

S4: as shown in fig. 4, the constructed world model is utilized to imagine forward for a plurality of steps, and forward searching is carried out according to the return of the imagined track, so as to obtain the optimal strategy.

Note that, as shown in fig. 4,，，，，Representing the original image, the one-dimensional vector characteristic information, the obtained Tokens characteristic information, the reconstructed one-dimensional vector information and the reconstructed image information when the time step t=0 respectively; a represents policy information; the world model G represents the world model information described in S3; t=1 time step the content represents: using different strategies at t=0The first n sets of the next states Tokens with the highest instant prize obtained after the world model G conversion, where n is set manually, represent the maximum number of states involved in each step.

It should be noted that, in order to reduce the training amount, the present invention discards the method of using the conventional model-free reinforcement learning solution strategy. To avoid inefficiency of the random generation strategy, the present invention fully incorporates features based on world model reinforcement learning. The method reduces the links of training the strategy model and improves the efficiency of randomly generating the strategy.

Specifically, the decision process described in step S4 incorporates important features of model-based reinforcement learning, both to highlight the instant rewards of the reward function and to take into account the effect of future rewards. When making a decision, the agent imagines the step H forward by using the learned world model to obtain the return of the imagined track, and obtains the optimal decision according to the optimal return, comprising the following sub-steps:

S41: the intelligent agent is based on the current state information Selecting the best n actions, predicting information of next state by using the world model, wherein the information comprises n state informationN pieces of bonus informationN pieces of termination state information, etc.;

When the predicted world model is used for making a first step of prediction, the decisions corresponding to the n optimal instant rewards before the first step are selected and relevant information is stored (n is an optional super parameter, and if the action space k is smaller than n, k pieces of corresponding information are stored). It should be noted that each forward-looking trajectory constitutes a set of decision sequences for the trajectory. The track of the maximum reward can be obtained through the return function, and then the first step decision of the optimal track is obtained.

S42: for each predicted state informationSelecting the best (n-1) actions, and predicting the information of the next step state by using the world model, wherein the information comprises n-1 state informationN-1 bonus informationN-1 termination state information, etc.;

when the predicted world model is used for carrying out second-step prediction, the decisions corresponding to the optimal instant rewards of the second step (n-1) are respectively selected for the state information stored in the first step, and related information (n is an optional super parameter, k pieces of related information are selected if the action space k is smaller than (n-1), and 1 piece of corresponding information is stored if the action space k is smaller than or equal to 1).

S43: when predicting the information of the i-th step state, selecting the optimal n-i actions (if (n-i) <=1, selecting one action) for each piece of the state information predicted in the last step until a preset step length H is reached;

As shown in fig. 3, when the predicted world model is used to make the ith prediction, the present invention selects the decisions corresponding to the best instant rewards of the second step (n+1-i) for the state information stored in the (i-1) th step, and stores the relevant information (n is an optional super parameter, k relevant information is selected if the motion space k is smaller than (n+1-i), and 1 corresponding information is stored if (n+1-i) is smaller than or equal to 1). Up to a maximum step H.

S44: obtaining rewards corresponding to all the tracks with the H step length, wherein a first step decision corresponding to the optimal rewards is the optimal strategy;

specifically, the rewards have the following expression: r= , Is a discount factor, used to balance the rewards of step H.

The imagination track is that the intelligent agent carries out forward H-step deduction according to the learned world model, and predicts the state information, rewarding information and the like of the future H-step. The decision made by the agent based on the imagined trajectory does not require sample data and takes into account future effects.

It should be noted that, since the motion spaces corresponding to different decision environments are different, the super parameter n is also adjustable, so the specific track number also needs to be determined according to the specific environment.

For example, in the Pong game of the Atari game, the action space is 6, and the h=3-step forward imagination is performed and decision is made according to the learned world model. Assuming n=3 and h=2, the trace P1={（a=3,r=1）,(a=2,r=1)}, P2={（a=3,r=1）,(a=4,r=0.3)}, P3={（a=5,r=0.8）,(a=2,r=0.5)}, P4={（a=5,r=0.8）,(a=1,r=0.6)}, , P5={（a=2,r=0.4）,(a=1,r=0.7)} , P6={（a=2,r=0.4）,(a=3,r=0.2)}. obtains the return value of each trace through a function, and obtains the trace with the maximum return value as P1. So the optimal decision is a=3.

Compared with random decision, the method reduces the randomness of the decision, can improve the decision efficiency, and makes decisions according to the existing decision capability of the intelligent agent.

Corresponding to the embodiment of the decision method based on model reinforcement learning, the application also provides an embodiment of the decision device based on model reinforcement learning.

FIG. 5 is a block diagram illustrating a decision device based on model reinforcement learning, according to an example embodiment. Referring to fig. 5, the apparatus may include:

An acquisition module 21 for acquiring a high-dimensional image dataset;

a learning module 22 for learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning method;

A building module 23, configured to build a reinforcement-learning world model in a low-dimensional feature space by using a transducer architecture;

The forward search module 24 is configured to forward imagine a number of steps using the constructed world model, and perform forward search according to the return of the imagined track, so as to obtain an optimal policy.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Correspondingly, the application also provides a computer program product, comprising a computer program, wherein the computer program is executed by a processor to realize the decision method based on model reinforcement learning.

Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the model reinforcement learning-based decision method as described above. As shown in fig. 6, a hardware structure diagram of an apparatus with any data processing capability where a decision device based on model reinforcement learning is located according to an embodiment of the present application is shown in fig. 6, and in addition to a processor, a memory and a network interface shown in fig. 6, the apparatus with any data processing capability in the embodiment generally includes other hardware according to an actual function of the apparatus with any data processing capability, which is not described herein again.

Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a model reinforcement learning based decision method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), an SD card, a flash memory card (FLASH CARD), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims

1. A model reinforcement learning-based decision method, comprising:

acquiring a high-dimensional image dataset;

2. The method of claim 1, wherein learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning method comprises:

3. The method of claim 2, wherein the loss function further comprises a loss for constraining the coding space and word embedding space to coincide and a perceived loss for image recovery between the high-dimensional image data and the reconstructed image during the training of the encoder-decoder.

4. The method of claim 1, wherein the world model utilizes a combination of low-dimensional features and actions of several time steps to generate the next state information.

5. The method of claim 1, wherein the constructed world model requires training, and wherein the reward function is trained using a cross entropy loss training state transfer function and a termination state function while the world model is trained.

6. The method of claim 1, wherein utilizing the constructed world model to imagine forward for a number of steps, performing a forward search based on rewards of imagined trajectories to derive an optimal strategy, comprising:

Selecting optimal (n-1) actions for each piece of predicted state information, and predicting information of a next state by using a world model, wherein the information comprises n-1 pieces of state information, n-1 pieces of rewarding information and n-1 pieces of termination state information;

7. A model reinforcement learning-based decision-making apparatus, comprising:

8. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-6.

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any of claims 1-6.