CN117953351A - Decision method based on model reinforcement learning - Google Patents

Decision method based on model reinforcement learning Download PDF

Info

Publication number
CN117953351A
CN117953351A CN202410355666.2A CN202410355666A CN117953351A CN 117953351 A CN117953351 A CN 117953351A CN 202410355666 A CN202410355666 A CN 202410355666A CN 117953351 A CN117953351 A CN 117953351A
Authority
CN
China
Prior art keywords
information
dimensional
learning
world model
dimensional image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410355666.2A
Other languages
Chinese (zh)
Inventor
李帅龙
林峰
张晴
严笑然
薛均晓
马萧
陆亚飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202410355666.2A priority Critical patent/CN117953351A/en
Publication of CN117953351A publication Critical patent/CN117953351A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a decision method based on model reinforcement learning, which comprises the following steps: acquiring a high-dimensional image dataset; learning corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method; constructing a reinforced learning world model by using a transducer architecture in a low-dimensional feature space; and (3) forward imagining a plurality of steps by using the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy. Compared with random decision, the method reduces the randomness of the decision, can improve the decision efficiency, makes the decision according to the existing decision capability of the intelligent agent, overcomes the defects of low sample efficiency and increased capability of processing uncertainty in a dynamic environment, and further achieves a better and steady strategy.

Description

Decision method based on model reinforcement learning
Technical Field
The invention belongs to the technical field of deep reinforcement learning, and particularly relates to a decision method based on model reinforcement learning.
Background
Deep reinforcement learning has been applied in many fields and is a very popular research direction in academia and industry. The widespread focus of deep reinforcement learning benefits from its strong adaptivity and learning capabilities. The strong adaptivity means that the deep reinforcement learning agent can automatically adjust network structure and parameters according to different environments and tasks. The strong learning ability means that the deep reinforcement learning agent can quickly learn useful knowledge from a large amount of experience data and improve the accuracy and efficiency of decision making. Meanwhile, deep reinforcement learning has a plurality of defects such as unstable training, poor interpretability, high data requirement and the like.
Model-based reinforcement learning has many advantages over model-free reinforcement learning. Model-based reinforcement learning can make decisions using learned world models, rather than relying solely on trial and error; the past experience can be effectively utilized to learn the world model; the intelligent agent can be helped to better predict future reward signals, so that more excellent strategies can be formulated; can help the agent to better handle uncertainties in the dynamic environment, thereby developing a more robust strategy. However, cascading errors may occur due to model errors of the learned world model, and may result in poor strategies by the agent using only the learned world model forward search. If combined with the model-free reinforcement learning method, the model-based reinforcement learning intelligent agent increases the training amount, thereby reducing the training efficiency.
Therefore, how to improve training efficiency based on model reinforcement learning and reduce training time of models is a problem to be solved urgently.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the application aims to provide a decision method based on model reinforcement learning.
According to a first aspect of an embodiment of the present application, there is provided a decision method based on model reinforcement learning, including:
acquiring a high-dimensional image dataset;
Learning corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;
constructing a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;
and (3) forward imagining a plurality of steps by using the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.
Further, learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning approach, comprising:
preprocessing all high-dimensional image data in the high-dimensional image data set;
inputting the preprocessed high-dimensional image data into an encoder and a decoder to obtain corresponding low-dimensional information characteristics;
Comparing the low-dimensional characteristic information with a word list with the size of N, and calculating the nearest embedded vector index through Euclidean distance;
Reconstructing original image information using a decoder based on the most recent embedded vector index;
And training the encoder-decoder based on the reconstruction loss between the high-dimensional image data and the reconstructed image, and obtaining low-dimensional features corresponding to all the high-dimensional image data after the training is completed.
Further, during the training of the encoder-decoder, the loss function also includes a loss for constraining the consistency of the encoding space and the word embedding space and a perceived loss for image recovery between the high-dimensional image data and the reconstructed image.
Further, the world model utilizes a combination of low-dimensional features and actions for several time steps to generate next state information.
Further, the built world model needs to be trained, and when the world model is trained, a cross entropy loss training state transfer function and a termination state function are used, and a reward function is trained by means of mean square error loss.
Further, forward searching is performed by utilizing the constructed world model to forward imagine a plurality of steps according to return of the imagined track, so as to obtain an optimal strategy, which comprises the following steps:
selecting the optimal n actions according to the current state information, and predicting information of a next state by utilizing a world model, wherein the information comprises n pieces of state information, n pieces of rewarding information and n pieces of termination state information, and n is a super parameter;
Selecting optimal (n-1) actions for each piece of predicted state information, and predicting information of a next state by using a world model, wherein the information comprises n-1 pieces of state information, n-1 pieces of rewarding information, n-1 pieces of termination state information and the like;
when predicting the information of the i-th step state, selecting the optimal n-i actions for each piece of the state information predicted in the last step, wherein if (n-i) <=1, selecting one action until a preset maximum step length is reached;
And obtaining the rewards corresponding to the tracks with the maximum step length, wherein the first step decision corresponding to the optimal rewards is the optimal strategy.
According to a second aspect of an embodiment of the present application, there is provided a decision device based on model reinforcement learning, including:
The acquisition module is used for acquiring the high-dimensional image data set;
the learning module is used for learning the corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;
the building module is used for building a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;
And the forward search module is used for forward imagining a plurality of steps by utilizing the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.
According to a third aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.
According to a fourth aspect of an embodiment of the present application, there is provided an electronic apparatus including:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.
According to a fifth aspect of embodiments of the present application there is provided a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of the method according to the first aspect.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
According to the embodiment, the model is trained by the self-supervision learning method, so that the low-dimensional characteristics of the high-dimensional image data are obtained. The method reduces the randomness of decisions compared with random decisions, can improve the decision efficiency, makes decisions according to the existing decision capability of the intelligent agent, overcomes the defects of low sample efficiency and increased capability of processing uncertainty in a dynamic environment, and further achieves a better and steady strategy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow diagram illustrating a decision method based on model reinforcement learning, according to an example embodiment.
FIG. 2 is a schematic diagram of self-supervised learning according to an exemplary embodiment.
FIG. 3 is a diagram illustrating an ith step world model-based prediction according to an exemplary embodiment.
FIG. 4 is a schematic illustration of an agent in accordance with an exemplary embodimentThe low-dimensional feature space in the state is subjected to a flow diagram of track collection through a learned world model.
FIG. 5 is a block diagram illustrating a decision device based on model reinforcement learning, according to an example embodiment.
Fig. 6 is a schematic diagram of an electronic device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
FIG. 1 is a schematic flow chart of a decision method based on model reinforcement learning, for example, the application in Atari-Pong game scene, applied to reinforcement learning agents, specifically comprising the following steps:
s1: and acquiring a high-dimensional image data set D in the Atari-Pong game scene.
Specifically, at the beginning, the agent model needs to be randomly initialized as it has not been trained. The reinforcement learning agent has the characteristic of real-time interaction, and needs to collect real-time data while training. The agent needs to constantly interact with the simulation environment in the Atari-Pong game scene to collect data.
S2: as shown in fig. 2, the corresponding low-dimensional features are learned from the high-dimensional image dataset using a self-supervised learning method. The method specifically comprises the following substeps:
S21: preprocessing all high-dimensional image data in the high-dimensional image data set;
Specifically, the Pong environment provides a state defaulting to Box (210, 160, 3), i.e., a color map of 3 channels, i.e., FIG. 2 . Because the value of each pixel point is 0-255, normalization processing is needed to facilitate training of the deep neural network (to accelerate training speed, prevent gradient from disappearing or exploding, improve optimization process, etc.). The deep network model is convenient to train through normalization processing, wherein the deep network model can extract effective information from high-dimensional data, allows end-to-end learning from input data directly to final output, and does not need human intervention in data processing and feature selection processes.
S22: inputting the preprocessed high-dimensional image data into an encoder and a decoder to obtain corresponding low-dimensional information characteristics;
Specifically, both the encoder and decoder are designed using a depth network model. Encoder E: Is realized by using a two-dimensional convolution network to convert the three-dimensional characteristics of an image into low-dimensional characteristics And further converted into Tokens low-dimensional feature information with dimension d of 512. The corresponding low-dimensional information feature Z is obtained: Representing a collection of tokens, K represents the number of tokens required for the deep network model output, and d represents a representation vector for each Token.
S23: comparing the low-dimensional characteristic information with a word list with the size of N, and calculating the nearest embedded vector index through Euclidean distance;
Specifically, the vocabulary represents the collection of quantities Tokens The total number Tokens represents the expression range of Tokens. Euclidean distance comparison by means of a low-dimensional feature tensor output by an encoder and tensor dimension tensors for each TokenFinding the Token with the minimum Euclidean distance to the low-dimensional characteristic tensor output by the encoder: output tokens is the most recent embedded vector index. Wherein, when comparing, the dimension d of the last dimension in the vocabulary is consistent with the dimension d of the last dimension of the low-dimension feature, Is corresponding to the word listThe dimensions are embedded in a table.
S24: reconstructing original image information using a decoder based on the most recent embedded vector index;
Specifically, decoder D: Is realized by using a deconvolution network, and aims to reconstruct an original image by using Tokens The dimension is (210, 160,3). Wherein,Representing the conversion of K Tokens low-dimensional vectors into one-dimensional low-dimensional feature vectors, and further realizing the original image by using a deconvolution networkIs a reconstruction of (a).
S25: training the encoder-decoder based on reconstruction loss between the high-dimensional image data and the reconstructed image, and obtaining low-dimensional features corresponding to all the high-dimensional image data after training is completed;
in particular, learning low-dimensional feature vector sum Tokens by an encoder-decoder using convolutional neural network can be performed by The penalty (reconstruction penalty), commitment penalty (used to constrain the agreement of the coding space and the word embedding space), and perceptual penalty (used for image recovery) are network trained.
S3: as shown in fig. 3, in the low-dimensional feature space, a reinforcement-learned world model is constructed using a transducer architecture.
The world model uses Tokens of encoder output and tokens of motion in combination to generate information of next state, and the transducer is used as its network architecture. The image Tokens obtained in S2 is combined with each action Tokens (the action is to use a discrete action space representation (e.g., pong' S discrete action space is 0,1,2,3,4, 5) as a token) as a sequence input to the transducer architecture, i.e.: the transducer obtains L time steps of low dimensional features (each time step of low dimensional features is K Tokens: ) And L (K+1) tokens (i.e ) As input; using action embedded tablesAnd image Tokens embedding tableIs embedded intoIn the tensor, where a represents the action space, D represents the vector of tokens representing the dimension, and N represents the set of Token (vocabulary size) for the state. The tensor gets the next state image Tokens through M blocks in the transducer, which are composed of self-attention and nonlinear network, and gets the immediate rewarding and ending state of the state through the linear network by using the next state Tokens.
The built world model needs to be trained, and when the world model is trained, a cross entropy loss training state transfer function and a termination state function are used, and a mean square error loss training reward function is utilized.
S4: as shown in fig. 4, the constructed world model is utilized to imagine forward for a plurality of steps, and forward searching is carried out according to the return of the imagined track, so as to obtain the optimal strategy.
Note that, as shown in fig. 4,Representing the original image, the one-dimensional vector characteristic information, the obtained Tokens characteristic information, the reconstructed one-dimensional vector information and the reconstructed image information when the time step t=0 respectively; a represents policy information; the world model G represents the world model information described in S3; t=1 time step the content represents: using different strategies at t=0The first n sets of the next states Tokens with the highest instant prize obtained after the world model G conversion, where n is set manually, represent the maximum number of states involved in each step.
It should be noted that, in order to reduce the training amount, the present invention discards the method of using the conventional model-free reinforcement learning solution strategy. To avoid inefficiency of the random generation strategy, the present invention fully incorporates features based on world model reinforcement learning. The method reduces the links of training the strategy model and improves the efficiency of randomly generating the strategy.
Specifically, the decision process described in step S4 incorporates important features of model-based reinforcement learning, both to highlight the instant rewards of the reward function and to take into account the effect of future rewards. When making a decision, the agent imagines the step H forward by using the learned world model to obtain the return of the imagined track, and obtains the optimal decision according to the optimal return, comprising the following sub-steps:
S41: the intelligent agent is based on the current state information Selecting the best n actions, predicting information of next state by using the world model, wherein the information comprises n state informationN pieces of bonus informationN pieces of termination state information, etc.;
When the predicted world model is used for making a first step of prediction, the decisions corresponding to the n optimal instant rewards before the first step are selected and relevant information is stored (n is an optional super parameter, and if the action space k is smaller than n, k pieces of corresponding information are stored). It should be noted that each forward-looking trajectory constitutes a set of decision sequences for the trajectory. The track of the maximum reward can be obtained through the return function, and then the first step decision of the optimal track is obtained.
S42: for each predicted state informationSelecting the best (n-1) actions, and predicting the information of the next step state by using the world model, wherein the information comprises n-1 state informationN-1 bonus informationN-1 termination state information, etc.;
when the predicted world model is used for carrying out second-step prediction, the decisions corresponding to the optimal instant rewards of the second step (n-1) are respectively selected for the state information stored in the first step, and related information (n is an optional super parameter, k pieces of related information are selected if the action space k is smaller than (n-1), and 1 piece of corresponding information is stored if the action space k is smaller than or equal to 1).
S43: when predicting the information of the i-th step state, selecting the optimal n-i actions (if (n-i) <=1, selecting one action) for each piece of the state information predicted in the last step until a preset step length H is reached;
As shown in fig. 3, when the predicted world model is used to make the ith prediction, the present invention selects the decisions corresponding to the best instant rewards of the second step (n+1-i) for the state information stored in the (i-1) th step, and stores the relevant information (n is an optional super parameter, k relevant information is selected if the motion space k is smaller than (n+1-i), and 1 corresponding information is stored if (n+1-i) is smaller than or equal to 1). Up to a maximum step H.
S44: obtaining rewards corresponding to all the tracks with the H step length, wherein a first step decision corresponding to the optimal rewards is the optimal strategy;
specifically, the rewards have the following expression: r= , Is a discount factor, used to balance the rewards of step H.
The imagination track is that the intelligent agent carries out forward H-step deduction according to the learned world model, and predicts the state information, rewarding information and the like of the future H-step. The decision made by the agent based on the imagined trajectory does not require sample data and takes into account future effects.
It should be noted that, since the motion spaces corresponding to different decision environments are different, the super parameter n is also adjustable, so the specific track number also needs to be determined according to the specific environment.
For example, in the Pong game of the Atari game, the action space is 6, and the h=3-step forward imagination is performed and decision is made according to the learned world model. Assuming n=3 and h=2, the trace P1={(a=3,r=1),(a=2,r=1)}, P2={(a=3,r=1),(a=4,r=0.3)}, P3={(a=5,r=0.8),(a=2,r=0.5)}, P4={(a=5,r=0.8),(a=1,r=0.6)}, , P5={(a=2,r=0.4),(a=1,r=0.7)} , P6={(a=2,r=0.4),(a=3,r=0.2)}. obtains the return value of each trace through a function, and obtains the trace with the maximum return value as P1. So the optimal decision is a=3.
Compared with random decision, the method reduces the randomness of the decision, can improve the decision efficiency, and makes decisions according to the existing decision capability of the intelligent agent.
Corresponding to the embodiment of the decision method based on model reinforcement learning, the application also provides an embodiment of the decision device based on model reinforcement learning.
FIG. 5 is a block diagram illustrating a decision device based on model reinforcement learning, according to an example embodiment. Referring to fig. 5, the apparatus may include:
An acquisition module 21 for acquiring a high-dimensional image dataset;
a learning module 22 for learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning method;
A building module 23, configured to build a reinforcement-learning world model in a low-dimensional feature space by using a transducer architecture;
The forward search module 24 is configured to forward imagine a number of steps using the constructed world model, and perform forward search according to the return of the imagined track, so as to obtain an optimal policy.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Correspondingly, the application also provides a computer program product, comprising a computer program, wherein the computer program is executed by a processor to realize the decision method based on model reinforcement learning.
Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the model reinforcement learning-based decision method as described above. As shown in fig. 6, a hardware structure diagram of an apparatus with any data processing capability where a decision device based on model reinforcement learning is located according to an embodiment of the present application is shown in fig. 6, and in addition to a processor, a memory and a network interface shown in fig. 6, the apparatus with any data processing capability in the embodiment generally includes other hardware according to an actual function of the apparatus with any data processing capability, which is not described herein again.
Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a model reinforcement learning based decision method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), an SD card, a flash memory card (FLASH CARD), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims (10)

1. A model reinforcement learning-based decision method, comprising:
acquiring a high-dimensional image dataset;
Learning corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;
constructing a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;
and (3) forward imagining a plurality of steps by using the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.
2. The method of claim 1, wherein learning the corresponding low-dimensional features from the high-dimensional image dataset using a self-supervised learning method comprises:
preprocessing all high-dimensional image data in the high-dimensional image data set;
inputting the preprocessed high-dimensional image data into an encoder and a decoder to obtain corresponding low-dimensional information characteristics;
Comparing the low-dimensional characteristic information with a word list with the size of N, and calculating the nearest embedded vector index through Euclidean distance;
Reconstructing original image information using a decoder based on the most recent embedded vector index;
And training the encoder-decoder based on the reconstruction loss between the high-dimensional image data and the reconstructed image, and obtaining low-dimensional features corresponding to all the high-dimensional image data after the training is completed.
3. The method of claim 2, wherein the loss function further comprises a loss for constraining the coding space and word embedding space to coincide and a perceived loss for image recovery between the high-dimensional image data and the reconstructed image during the training of the encoder-decoder.
4. The method of claim 1, wherein the world model utilizes a combination of low-dimensional features and actions of several time steps to generate the next state information.
5. The method of claim 1, wherein the constructed world model requires training, and wherein the reward function is trained using a cross entropy loss training state transfer function and a termination state function while the world model is trained.
6. The method of claim 1, wherein utilizing the constructed world model to imagine forward for a number of steps, performing a forward search based on rewards of imagined trajectories to derive an optimal strategy, comprising:
selecting the optimal n actions according to the current state information, and predicting information of a next state by utilizing a world model, wherein the information comprises n pieces of state information, n pieces of rewarding information and n pieces of termination state information, and n is a super parameter;
Selecting optimal (n-1) actions for each piece of predicted state information, and predicting information of a next state by using a world model, wherein the information comprises n-1 pieces of state information, n-1 pieces of rewarding information and n-1 pieces of termination state information;
when predicting the information of the i-th step state, selecting the optimal n-i actions for each piece of the state information predicted in the last step, wherein if (n-i) <=1, selecting one action until a preset maximum step length is reached;
And obtaining the rewards corresponding to the tracks with the maximum step length, wherein the first step decision corresponding to the optimal rewards is the optimal strategy.
7. A model reinforcement learning-based decision-making apparatus, comprising:
The acquisition module is used for acquiring the high-dimensional image data set;
the learning module is used for learning the corresponding low-dimensional features from the high-dimensional image dataset by using a self-supervision learning method;
the building module is used for building a reinforced learning world model by using a transducer architecture in a low-dimensional feature space;
And the forward search module is used for forward imagining a plurality of steps by utilizing the constructed world model, and carrying out forward search according to the return of the imagined track to obtain an optimal strategy.
8. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-6.
9. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any of claims 1-6.
CN202410355666.2A 2024-03-27 2024-03-27 Decision method based on model reinforcement learning Pending CN117953351A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410355666.2A CN117953351A (en) 2024-03-27 2024-03-27 Decision method based on model reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410355666.2A CN117953351A (en) 2024-03-27 2024-03-27 Decision method based on model reinforcement learning

Publications (1)

Publication Number Publication Date
CN117953351A true CN117953351A (en) 2024-04-30

Family

ID=90793076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410355666.2A Pending CN117953351A (en) 2024-03-27 2024-03-27 Decision method based on model reinforcement learning

Country Status (1)

Country Link
CN (1) CN117953351A (en)

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260040A (en) * 2020-05-06 2020-06-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Video game decision method based on intrinsic rewards
CN113015983A (en) * 2018-10-24 2021-06-22 Hrl实验室有限责任公司 Autonomous system including continuous learning world model and related methods
CN113920484A (en) * 2021-10-15 2022-01-11 湖南师范大学 Monocular RGB-D feature and reinforcement learning based end-to-end automatic driving decision method
US20220254150A1 (en) * 2021-02-05 2022-08-11 Salesforce.Com, Inc. Exceeding the limits of visual-linguistic multi-task learning
CN115017817A (en) * 2022-06-17 2022-09-06 上海碳索能源服务股份有限公司 Method, system, terminal and medium for optimizing energy efficiency of refrigeration machine room
US20230004843A1 (en) * 2021-06-30 2023-01-05 International Business Machines Corporation Decision optimization utilizing tabular data
CN115827844A (en) * 2022-12-12 2023-03-21 之江实验室 Knowledge graph question-answering method and system based on spark ql statement generation
US20230108874A1 (en) * 2020-02-10 2023-04-06 Deeplife Generative digital twin of complex systems
WO2023065617A1 (en) * 2021-10-21 2023-04-27 北京邮电大学 Cross-modal retrieval system and method based on pre-training model and recall and ranking
KR20230089563A (en) * 2021-12-13 2023-06-20 한국과학기술원 Computing apparatus and method for implementing end-to-end deep learning framework for sample-efficient image-based reinforcement learning
CN116402142A (en) * 2023-03-14 2023-07-07 南京大学 Reinforced learning strategy interpretable method based on decision path diagram
CN116663578A (en) * 2023-06-16 2023-08-29 东北大学 Neural machine translation method based on strategy gradient method improvement
CN116700003A (en) * 2023-06-29 2023-09-05 浙江中控技术股份有限公司 Method and system for constructing reinforcement learning environment by using process industry historical data
CN116702872A (en) * 2023-06-20 2023-09-05 北京智源人工智能研究院 Reinforced learning method and device based on offline pre-training state transition transducer model
WO2023184939A1 (en) * 2022-03-28 2023-10-05 福州大学 Deep-reinforcement-learning-based adaptive efficient resource allocation method for cloud data center
CN116888602A (en) * 2020-12-17 2023-10-13 乌姆奈有限公司 Interpretable transducer
CN117273057A (en) * 2023-08-25 2023-12-22 中国电子科技集团公司信息科学研究院 Multi-agent collaborative countermeasure decision-making method and device based on reinforcement learning
CN117289691A (en) * 2023-04-12 2023-12-26 西交利物浦大学 Training method for path planning agent for reinforcement learning in navigation scene
CN117454965A (en) * 2023-09-12 2024-01-26 北京理工大学 Random transducer model-based modeling deep reinforcement learning method
CN117725435A (en) * 2023-12-27 2024-03-19 上海交通大学 Multi-mode large model adaptation method and storage medium
CN117765252A (en) * 2023-12-07 2024-03-26 佛山科学技术学院 breast cancer identification system and method based on Swin transducer and contrast learning

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113015983A (en) * 2018-10-24 2021-06-22 Hrl实验室有限责任公司 Autonomous system including continuous learning world model and related methods
US20230108874A1 (en) * 2020-02-10 2023-04-06 Deeplife Generative digital twin of complex systems
CN111260040A (en) * 2020-05-06 2020-06-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Video game decision method based on intrinsic rewards
CN116888602A (en) * 2020-12-17 2023-10-13 乌姆奈有限公司 Interpretable transducer
US20220254150A1 (en) * 2021-02-05 2022-08-11 Salesforce.Com, Inc. Exceeding the limits of visual-linguistic multi-task learning
US20230004843A1 (en) * 2021-06-30 2023-01-05 International Business Machines Corporation Decision optimization utilizing tabular data
CN113920484A (en) * 2021-10-15 2022-01-11 湖南师范大学 Monocular RGB-D feature and reinforcement learning based end-to-end automatic driving decision method
WO2023065617A1 (en) * 2021-10-21 2023-04-27 北京邮电大学 Cross-modal retrieval system and method based on pre-training model and recall and ranking
KR20230089563A (en) * 2021-12-13 2023-06-20 한국과학기술원 Computing apparatus and method for implementing end-to-end deep learning framework for sample-efficient image-based reinforcement learning
WO2023184939A1 (en) * 2022-03-28 2023-10-05 福州大学 Deep-reinforcement-learning-based adaptive efficient resource allocation method for cloud data center
CN115017817A (en) * 2022-06-17 2022-09-06 上海碳索能源服务股份有限公司 Method, system, terminal and medium for optimizing energy efficiency of refrigeration machine room
CN115827844A (en) * 2022-12-12 2023-03-21 之江实验室 Knowledge graph question-answering method and system based on spark ql statement generation
CN116402142A (en) * 2023-03-14 2023-07-07 南京大学 Reinforced learning strategy interpretable method based on decision path diagram
CN117289691A (en) * 2023-04-12 2023-12-26 西交利物浦大学 Training method for path planning agent for reinforcement learning in navigation scene
CN116663578A (en) * 2023-06-16 2023-08-29 东北大学 Neural machine translation method based on strategy gradient method improvement
CN116702872A (en) * 2023-06-20 2023-09-05 北京智源人工智能研究院 Reinforced learning method and device based on offline pre-training state transition transducer model
CN116700003A (en) * 2023-06-29 2023-09-05 浙江中控技术股份有限公司 Method and system for constructing reinforcement learning environment by using process industry historical data
CN117273057A (en) * 2023-08-25 2023-12-22 中国电子科技集团公司信息科学研究院 Multi-agent collaborative countermeasure decision-making method and device based on reinforcement learning
CN117454965A (en) * 2023-09-12 2024-01-26 北京理工大学 Random transducer model-based modeling deep reinforcement learning method
CN117765252A (en) * 2023-12-07 2024-03-26 佛山科学技术学院 breast cancer identification system and method based on Swin transducer and contrast learning
CN117725435A (en) * 2023-12-27 2024-03-19 上海交通大学 Multi-mode large model adaptation method and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YIN ZHANG: "Slow Wave Substrate-Integrated Waveguide With Miniaturized Dimensions and Broadened Bandwidth", 《 IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES ( VOLUME: 69, ISSUE: 8, AUGUST 2021)》, 30 April 2021 (2021-04-30) *
刘建伟: "基于值函数和策略梯度的深度强化学习综述", 《计算机学报》, 30 June 2019 (2019-06-30) *
刘强;姜峰;: "基于深度强化学习的群体对抗策略研究", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) *
韩伟;: "效用驱动的Markov强化学习", 计算机工程与应用, no. 04, 1 February 2009 (2009-02-01) *

Similar Documents

Publication Publication Date Title
CN112634296B (en) RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation
US20210019619A1 (en) Machine learnable system with conditional normalizing flow
CN110796111A (en) Image processing method, device, equipment and storage medium
CN111461325B (en) Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem
Ma et al. Towards fast and robust real image denoising with attentive neural network and PID controller
CN111738435A (en) Online sparse training method and system based on mobile equipment
EP3767533A1 (en) A machine learnable system with normalizing flow
Wehenkel et al. Diffusion priors in variational autoencoders
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
Zhao et al. Comprehensive and delicate: An efficient transformer for image restoration
Li et al. Automatic densenet sparsification
Zebang et al. Densely connected AutoEncoders for image compression
CN116740223A (en) Method for generating image based on text
Liu et al. Coordfill: Efficient high-resolution image inpainting via parameterized coordinate querying
CN117953351A (en) Decision method based on model reinforcement learning
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN116342420A (en) Method and system for enhancing mixed degraded image
CN116095183A (en) Data compression method and related equipment
CN116385454A (en) Medical image segmentation method based on multi-stage aggregation
CN113361510B (en) Hyper-distributed network model training method and device, electronic equipment and storage medium
CN115546537A (en) Image multi-attribute combined editing method based on generation countermeasure model
CN114282741A (en) Task decision method, device, equipment and storage medium
Zhang Generative Adversarial Networks for Image Synthesis
Shankarampeta et al. Few-Shot Class Incremental Learning with Generative Feature Replay.
Yan et al. Optimized single-image super-resolution reconstruction: A multimodal approach based on reversible guidance and cyclical knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination