CN114265674A

CN114265674A - Task planning method based on reinforcement learning under time sequence logic constraint and related device

Info

Publication number: CN114265674A
Application number: CN202111155540.3A
Authority: CN
Inventors: 田戴荧; 丁玉隆; 蒋卓; 崔金强; 商成思; 尉越
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-04-01

Abstract

The application discloses a task planning method based on reinforcement learning under the time sequence logic constraint and a related device, wherein the method comprises the steps of converting a task to be planned into a deterministic finite automaton; determining a state action track based on the deterministic finite automaton and an initial strategy; inputting the state action track and the external reward of each state action pair into a feedforward neural network, and outputting the internal reward of each state action pair through the feedforward neural network; and determining a first objective function and a first return value of the initial strategy based on the external rewards and the internal rewards, and updating strategy parameters of the initial strategy based on the first objective function and the first return value to obtain a target strategy corresponding to the task to be planned. According to the application, the time sequence characteristic of the task is captured through the attention mechanism, so that the executive end can quickly learn the task with the time sequence relation in the sparse reward environment, the sparse reward problem under the LTL constraint can be solved in different environments, and the optimal strategy can be learned through reinforcement learning.

Description

Task planning method based on reinforcement learning under time sequence logic constraint and related device

Technical Field

The present invention relates to the field of reinforcement learning technologies, and in particular, to a task planning method and related apparatus based on reinforcement learning under sequential logic constraints.

Background

Today Linear Temporal Logic (LTL) is of wide interest and exhibits excellent performance in a wide range of applications. The main advantage of LTL is its expressive power, allowing various advanced behaviors to be defined beyond the traditional range of motion planning. Many strategy generation methods under the LTL constraint are proposed as in the literature (Baier C, Katoen J p.

Much of the learner's attention has focused on the application of reinforcement learning in sequential logic planning because it can handle more complex tasks and more uncertain environments. However, when the linear time sequence logic LTL is combined with reinforcement learning, the historical dependency and the time sequence of the linear time sequence logic LTL need to be considered to construct an appropriate time reward, otherwise, the reinforcement learning method cannot converge quickly or even learn an optimal strategy due to the delayed reward of the LTL task and the sparseness of the reward.

In order to solve the above problems, the existing reinforcement learning algorithm generally completes management and transfer of system states by converting an LTL task into a Deterministic Rabin Automaton (multiplicative Automaton) and an environment building multiplicative Automaton, and calculates an acceptable maximum end component representing an acceptance state in order to obtain an instant reward for a smart agent, all instant rewards that can be converted into the acceptable maximum end component are 1, and are 0 otherwise. However, the reinforcement learning algorithm in this way still has the problem that an optimal strategy is generated.

Thus, the prior art has yet to be improved and enhanced.

Disclosure of Invention

The technical problem to be solved by the present application is to provide a task planning method and related apparatus based on reinforcement learning under the constraint of sequential logic, aiming at the deficiencies of the prior art.

In order to solve the above technical problem, a first aspect of the embodiments of the present application provides a task planning method based on reinforcement learning under a sequential logic constraint, where the method includes:

converting the task to be planned into a deterministic finite automaton;

determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and an initial strategy corresponding to the task to be planned, wherein each state action pair in the dynamic action track corresponds to an external reward;

inputting the state action track and each state action to a preset feedforward neural network for the corresponding external reward, and outputting each state action to the corresponding internal reward through the feedforward neural network, wherein the feedforward neural network is configured with a self-attention mechanism;

determining a first objective function and a first return value corresponding to the initial strategy based on each external reward and each internal reward, and updating strategy parameters of the initial strategy based on the first objective function and the first return value;

and continuing to execute the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained.

The task planning method based on reinforcement learning under the time sequence logic constraint, wherein the determining of the state action trajectory corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned specifically includes:

acquiring the current state of an environment where an execution end in a task to be planned is located;

performing action sampling based on the current state and an initial strategy corresponding to the task to be planned to obtain an action;

controlling an execution end to execute the action to obtain a next state, and checking a conversion state of the next state in the deterministic finite automata;

if the conversion state meets a first preset condition, configuring corresponding external rewards for the state action pairs and finishing the state action pairs to obtain state action tracks, wherein the state action pairs comprise states and actions;

if the conversion state meets a second preset condition, configuring a preset external reward for the state action pair;

and taking the next state as the current state, and continuing to execute the step of sampling the action based on the current state and the initial strategy corresponding to the task to be planned to obtain the action until the conversion state violates the sequential logic or the length of the track belonging to the acceptable state set or the state action track reaches a preset length threshold.

The task planning method based on reinforcement learning under the time sequence logic constraint is characterized in that the first preset condition is that the conversion state violates time sequence logic or belongs to an acceptable state set; the second predetermined condition is that the transition state does not violate sequential logic and does not belong to a set of acceptable states, or that the transition state of the next state in a deterministic finite automaton is not checked.

The task planning method based on reinforcement learning under the time sequence logic constraint, wherein if the transition state does not violate the time sequence logic and does not belong to an acceptable state set, after configuring a preset external reward for a state action pair composed of the current state and the action, the method further comprises:

resetting the state of the deterministic finite automaton to an initial state of the deterministic finite automaton.

The task planning method based on reinforcement learning under the time sequence logic constraint is characterized in that the feed-forward neural network comprises a self-attention module and a full-connection module; the step of inputting the state action track and each state action to a preset feedforward neural network for the corresponding external reward, and the step of outputting each state action through the feedforward neural network for the corresponding internal reward specifically comprises the following steps:

inputting the state action track and the external reward corresponding to each state action into the self-attention module, and outputting the time sequence characteristic vector corresponding to each state action pair through the self-attention module;

and inputting the time sequence characteristic vector corresponding to each state action into the full-connection module, and inputting the internal reward corresponding to each action state through the full-connection module.

The task planning method based on reinforcement learning under the time sequence logic constraint is characterized in that the time slot characteristic vector corresponding to the action state is as follows:

y＝ωv+x

wherein v represents a value vector, x represents a time series feature vector, q represents a query vector, k represents a key vector, softmax represents a softmax function, and dimv represents a spatial dimension of the value vector v.

The task planning method based on reinforcement learning under the time sequence logic constraint is characterized in that the calculation formulas of the first objective function and the return value are respectively as follows:

wherein, J^ex+inRepresenting a first objective function, gamma representing a loss factor, lambda representing a hyper-parameter,

indicating the inherent reward for time i,

representing the extrinsic reward at time T, η representing a network parameter of the feedforward neural network, T representing the number of states of action, s_tShape representing time tState a of_tRepresenting the action at time t, s_iIndicates the state at time i, a_iIndicating the operation at time i.

The task planning method based on reinforcement learning under the sequential logic constraint is characterized in that the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned is continuously executed until a target strategy corresponding to the task to be planned is obtained, and the method further comprises the following steps of:

and determining a second objective function and a second return value corresponding to the initial strategy based on each external reward, and updating network parameters of the feedforward neural network by the second objective function and the second return value.

A second aspect of the embodiments of the present application provides a task planning apparatus based on reinforcement learning under a sequential logic constraint, where the task planning apparatus includes:

the conversion module is used for converting the tasks to be planned into the deterministic finite automata;

the determining module is used for determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and an initial strategy corresponding to the task to be planned, wherein each state action pair in the dynamic action track corresponds to an external reward;

the feedforward network module is used for inputting the state action track and each state action to a preset feedforward neural network corresponding to each external reward, and outputting each state action to a corresponding internal reward through the feedforward neural network, wherein the feedforward neural network is configured with a self-attention mechanism;

the updating module is used for determining a first objective function and a first return value corresponding to the initial strategy based on each external reward and each internal reward, and updating strategy parameters of the initial strategy based on the first objective function and the first return value;

and the execution module is used for continuously executing the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained.

A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs which are executable by one or more processors to implement steps in a brute-force learning based mission planning method under sequential logic constraints as described in any of the above.

A fourth aspect of the embodiments of the present application provides a terminal device, including: a processor, a memory and a communication bus; the memory has stored thereon computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the reinforcement learning based mission planning method under the sequential logic constraints as described in any of the above.

Has the advantages that: compared with the prior art, the application provides a task planning method and a related device based on reinforcement learning under the time sequence logic constraint, wherein the method comprises the steps of converting a task to be planned into a deterministic finite automaton; determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and an initial strategy corresponding to the task to be planned; inputting the state action track and the external rewards of the state actions to the preset feedforward neural network, and outputting the internal rewards of the state actions to the state actions through the feedforward neural network; determining a first objective function and a first return value corresponding to the initial strategy based on each external reward and each internal reward, and updating strategy parameters of the initial strategy based on the first objective function and the first return value; and continuing to execute the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained. According to the application, the time sequence characteristic of the task is captured through the attention mechanism, so that the execution end can quickly learn the task with the time sequence relation in the sparse rewarding environment, the sparse rewarding problem under the LTL constraint can be solved in different environments, and the optimal strategy can be learned through reinforcement learning.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without any inventive work.

Fig. 1 is a flowchart of a task planning method based on reinforcement learning under the temporal logic constraint provided in the present application.

Fig. 2 is a schematic flow chart diagram of a reinforcement learning-based task planning method under the temporal logic constraint provided by the present application.

Fig. 3 is a schematic structural diagram of a task planning apparatus based on reinforcement learning under the constraint of the temporal logic provided in the present application.

Fig. 4 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

In order to make the purpose, technical scheme and effect of the present application clearer and clearer, the present application is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

The inventor has found through research that Linear Temporal Logic (LTL) is receiving wide attention and shows excellent performance in a wide range of applications. The main advantage of LTL is its expressive power, allowing various high-level behaviors to be defined beyond the traditional range of motion planning. Many strategy generation methods under the LTL constraint are proposed as in the literature (Baier C, Katoen J p.

In order to converge the algorithm to an optimal strategy, the scholars consider a markov decision process based long-term performance optimization. To reveal the effect of uncertainty on this convergence process, researchers have proposed first estimating a Markov environment. Since calculating the acceptable maximum end component consumes a lot of computing resources, in order to avoid the calculation and thus reduce the amount of calculation, a popular method is to collect rewards according to the Rabin acceptance condition of the product automaton. Another equally efficient method is to generate a smaller-scale multiplicative automaton relative to the original system by converting the LTL into a finite deterministic automaton. It is noted that all the above works require a multiplicative automaton, and thus need to grasp all the environmental information, which is not always feasible in practical applications. Therefore, the most common approach is the modeless approach, approximating the state action values through a neural network and deriving the real-time rewards from a deterministic rabin automaton.

In the case of a real-world complex scenario, although rewards may be specified according to the LTL task, due to the sparsity of rewards, standard reinforcement learning algorithms are difficult to learn and difficult to converge quickly. Furthermore, in a sparse reward environment, the robot is not able to ascertain the temporal relationship between actions and states, considering that the transmission of the effect of one action on the outcome may be delayed. This problem can be modeled as a timing credit allocation problem (Hung C, Lillicrap T, Abramson J, et al. optimizing agent remaining time scales by transforming value [ J ]. Nature communications,2019,10(1): 1-12.). To solve this problem, the literature (Ng A Y, Harada D, Russell S.policy innovative requirements transformations: the Theory and application to the revised mapping [ C ]. Icml 1999,99: 278-. Furthermore, the problem of sparse rewards in the distribution of credit-grant-time Science Society,2009: 2601-. However, the designed intrinsic reward does not fully describe the time attribute of the whole task, so that the algorithm cannot quickly find the optimal strategy in the LTL-type sparse reward environment with time sequence dependence.

Based on the above, in the embodiment of the application, the task to be planned is converted into a deterministic finite automaton; determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and an initial strategy corresponding to the task to be planned; inputting the state action track and each state action pair into a preset feedforward neural network for corresponding external rewards, and outputting each state action pair corresponding internal rewards through the feedforward neural network; determining a first objective function and a first return value corresponding to the initial strategy based on each external reward and each internal reward, and updating strategy parameters of the initial strategy based on the first objective function and the first return value; and continuing to execute the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained. According to the application, the time sequence characteristic of the task is captured through the attention mechanism, so that the executive end can quickly learn the task with the time sequence relation in the sparse reward environment, the sparse reward problem under the LTL constraint can be solved in different environments, and the optimal strategy can be learned through reinforcement learning.

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The embodiment provides a task planning method based on reinforcement learning under the constraint of sequential logic, as shown in fig. 1, the method includes:

and S10, converting the task to be planned into the deterministic finite automata.

Specifically, the planning task is a given task, and the task to be planned can be declared to be a form that can be understood by a terminal device through a syntax of the timeline logic LTL, where the terminal device may be a mobile phone, a computer, a cloud server, and the like. For example, the task to be planned may be that the robot must first go to a place where the key is stored, pick up the key, etc. before moving to the area where the box is stored. In addition, when converting a task to be planned into a deterministic finite automaton, a standard conversion method for converting a time-linear logic LTL into a deterministic finite automaton may be adopted, for example, the method described in Baier C, Katoen J p.

S20, determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned.

Specifically, the state action track includes a plurality of state action pairs, each of the plurality of state action pairs includes a state and an action corresponding to the state, each of the plurality of state action pairs corresponds to an external reward, and the external reward is an instant reward obtained by executing the action at the execution end. The initial policy is preset, for example, the initial policy may be a uniformly distributed random policy, that is, each of several actions in the initial policy has the same probability, so that one of the actions is sampled as an action.

In an implementation manner of this embodiment, the determining, based on the deterministic finite automaton and the initial policy corresponding to the task to be planned, the state action trajectory corresponding to the task to be planned specifically includes:

Specifically, the current state s is a state in an environment where the execution end is located, for example, if the execution end is a robot, then the current state is a state of the environment where the robot is located. After the current state s is obtained, sampling an action a based on the current state s and the initial policy pi to obtain an action a corresponding to the current state s, where the action a is an action to be executed by an execution end based on the initial policy pi in the current state s. After acquiring the action, the execution end executes the action a to obtain a next state s ', wherein the next state s' is a state subsequent to the current state in time sequence, that is, after the execution end executes the action in the current state, the current state is changed to the next state.

In the process of obtaining the next state s ' by executing the action, the transition state of the next state s ' in the deterministic finite automata is obtained by using a label function, namely the corresponding transition of the process of (s, a, s ') in the deterministic finite automata B is found, and then the state action pair (s, a) is externally awarded respectively based on the preset conditions met by the transition state, wherein the preset conditions comprise a first preset condition and a second preset condition, and the first preset condition is that the transition state violates the sequential logic or belongs to an acceptable state set; the second predetermined condition is that the transition state does not violate sequential logic and does not belong to an acceptable set of states, or that the transition state of the next state in a deterministic finite automaton is not checked. That is, when a transition state is that the transition state violates sequential logic or belongs to a set of acceptable states, configuring a corresponding external reward for a state action pair and ending the state action pair to obtain a state action trajectory; when a transition state does not violate sequential logic and does not belong to the set of acceptable states, or when the transition state of the next state in the deterministic finite automaton is not checked, an external reward is preset for the state action pair configuration.

In one implementation manner of this embodiment, the violating sequential logic and the part belonging to the acceptable state set respectively correspond to external rewards, when the transition state is the violating sequential logic, the external rewards corresponding to the violating sequential logic are configured, and when the transition state is the part belonging to the acceptable state set, the external rewards corresponding to the part belonging to the acceptable state set are configured. In one specific implementation, the external reward corresponding to violating the sequential logic is-1, and the external reward corresponding to the set of acceptable states is 1. That is, when the state is transitioned to violate sequential logic, indicating that (s, a, s') violates the LTL task constraint, a negative prize value of the external prize r will be assigned to the state-action pair_ex-1 and end state motion trajectory D; when the state is in the acceptable state set, it indicates that the current state in the deterministic finite automaton after execution of (s, a, s') is in the acceptable state setState set, indicating that the task has been completed, at which time the external award r, which assigns a positive award value, is acted upon by a state action_ex1 and terminates the state action trajectory. In addition, the external reward is preset to be pre-configured for facilitating the learning process, e.g., -0.2, etc. Of course, in practical applications, the training process can be adjusted according to the actual reinforcement learning.

Further, when the transition state of the next state in the deterministic finite automata is not checked, a state in which the next state s 'is not present in the deterministic finite automata B is described, i.e. (s, a, s') does not trigger a transition in the deterministic finite automata B, and at this time, a preset external reward is configured to the state-action pair. In addition, in order to avoid dead cycles caused by the fact that the execution end runs all the time, in the process of detecting the transition state, the track length of the state action track can be monitored, if the transition state does not reach the state violating the time sequence logic or belongs to the acceptable state set, and the track length of the state action track reaches the preset length threshold value T, the state action track is similarly terminated, the external reward for the last state action pair is set to be the external reward when the time sequence logic is violated, for example, the external reward corresponding to the violation time sequence logic is-1, and the external reward for the last state action pair is set to be-1.

In an implementation manner of this embodiment, after configuring a preset external reward for a state action pair composed of the current state and an action if the transition state does not violate sequential logic and does not belong to an acceptable state set, the method further includes: resetting the state of the deterministic finite automaton to an initial state of the deterministic finite automaton. Therefore, when the state action track is continuously acquired after the strategy parameters of the initial strategy are updated subsequently, the state in the deterministic finite automaton can be ensured to be the initial state, so that the task can be restarted.

And S30, inputting the state action track and each state action to a preset feedforward neural network corresponding to the external reward, and outputting each state action to a corresponding internal reward through the feedforward neural network.

Specifically, the feedforward neural network is configured with a self-attention mechanism for acquiring internal rewards corresponding to each state action in the state action track, that is, the input item of the feedforward neural network is the state action track and the external rewards corresponding to each state action, and the output item is the internal rewards corresponding to each state action. Wherein each state action performs a learning process on the respective corresponding internal reward as a dense reward relative to the external reward, and each state action uses the respective corresponding internal reward for facilitating updating of only policy parameters of the initial policy to ensure maximum cumulative external reward.

In one implementation of this embodiment, the feedforward neural network includes a self-attention module and a fully-connected module; the inputting the state action track and each state action to a preset feedforward neural network for the corresponding external reward, and the outputting each state action through the feedforward neural network for the corresponding internal reward specifically comprises:

Specifically, the timing characteristic vector is determined by the self-attention module based on the state action track and the external reward corresponding to each state action, that is, the state action track and the timing characteristic of each state action corresponding to each external reward are used for encoding the linear timing logic LTL task, and the state action track and the timing characteristic of each state action encoded by each corresponding external reward are captured by the self-attention module, wherein the self-attention module is configured with a self-attention mechanism. In one particular implementation, the state action is performed for each state in the state action trajectoryPair(s)_i,a_i) Of the series temporal feature vector x_iThree separate linear transformations need to be performed, and the self-attention module treats the state action pairs and the corresponding coded representation of the external reward as a set of query key value pairs (q, k, v), and the dimension of each vector in the query key value pairs (q, k, v) is n, where n is the track length of the state action track D.

In addition, after the query key-value pair (q, k, v) is randomly initialized, in the training process of the feedforward neural network, the parameter update of the query key-value pair (q, k, v) is updated synchronously, wherein the update formula of the parameter update of the query key-value pair (q, k, v) may be:

q＝W_qx

k＝W_kx

v＝W_vx

wherein q is a query vector, k is a key vector, v is a value vector, x is a time series characteristic vector, W_q、W_k、W_vAre weight coefficients.

Further, the output term from the attention module may be the sum of a weighted average over the value vector v and the time series feature vector x, where the weight ω of the value vector is determined by querying the dot product of q and all k using the softmax function. Based on this, the slot characteristic vector corresponding to the action state is:

y＝ωv+x

In one implementation of the present embodiment, a preset mask may be applied to the weight matrix before performing the softmax operation, so that each layer in the feedforward neural network focuses mainly on its own observation, where the mask is used to segment different training phases (i.e., different state action pairs) according to which different training phases are usedAnd setting so that the passing mask of the training is not influenced by other training stages. For example, the masks of the training stages other than the present training stage are set to 0, and after multiplying the weight matrix, the other training stages have no influence on the present training. In addition, the mask does not mask future observations, so the feed-forward neural network can learn the relationship between current operation and future returns. The fully connected module of the feedforward neural network may then utilize the output terms of the self-attention module to generate the internal reward r_inWherein the internal award r_inThe calculation formula of (c) may be:

r_in＝FC_η(y)

wherein r is_inRepresenting internal rewards, y representing an output item from the attention module, FC_η(.) represents a fully connected module.

And S40, determining a first objective function and a first return value corresponding to the initial strategy based on the external rewards and the internal rewards, and updating strategy parameters of the initial strategy based on the first objective function and the first return value.

Specifically, the first objective function and the first return value are both an objective function and a return value of a policy gradient algorithm, wherein the first objective function and the first return value are calculated based on an external reward and an internal reward, and calculation formulas of the first objective function and the first return value are respectively as follows:

wherein, J^ex+inRepresenting a first objective function, gamma representing a loss factor, and lambda representing a hyper-parameter, for balancing the internal prize and the external prize,

indicating the inherent reward for time i,

representing the extrinsic reward at time T, η representing a network parameter of the feedforward neural network, T representing the number of states of action, s_tIndicates the state at time t, a_tRepresents the movement at time t, s_iIndicates the state at time i, a_iIndicating the motion at time i.

Further, after the first objective function and the first return value are obtained, the policy parameters of the initial policy are updated in a regular policy gradient manner, where the policy parameters may be updated according to a formula:

wherein theta represents a policy parameter before update, theta' represents a policy parameter after update, alpha represents a weight coefficient, G^ex+in(s_t,a_t) Is the first return value, pi is the initial policy, s_tIndicates the state at time t, a_tIndicating the action at time t.

In an implementation manner of this embodiment, the feedforward neural network and the initial strategy are trained synchronously, and accordingly, the step of determining the state action trajectory corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned is continuously performed until the step of obtaining the target strategy corresponding to the task to be planned, where the method further includes:

Specifically, the second objective function and the second return value are both an objective function and a return value of a policy gradient algorithm, wherein the second objective function and the second return value are obtained by calculation based on an external reward, and calculation formulas of the second objective function and the second return value are respectively:

wherein, J^exRepresenting a second objective function, gamma representing a loss factor,

denotes the external reward at time T, T denotes the number of action states, s_tIndicates the state at time t, a_tIndicating the action at time t.

In addition, when the feedforward neural network is updated, the network parameters of the feedforward neural network are updated only by using the second objective function and the second return value determined based on the external reward, so that the internal reward can be ensured to help maximize the accumulated external reward. In one particular implementation, the partial derivatives are as follows:

s50, continuing to execute the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained.

Specifically, after strategy parameters of the initial strategy and network parameters of the feedforward neural network are updated, the updated initial strategy is used as an initial test, the updated feedforward neural network is used as a feedforward neural network to perform next round of reinforcement learning, and the like until a target strategy corresponding to a task to be planned is obtained.

In summary, the present embodiment provides a task planning method based on reinforcement learning under the constraint of sequential logic, where the method includes converting a task to be planned into a deterministic finite automaton; determining a state action track corresponding to the task to be planned based on the deterministic finite automaton and an initial strategy corresponding to the task to be planned; inputting the state action track and the external rewards corresponding to the state actions into a preset feedforward neural network, and outputting the internal rewards corresponding to the state actions through the feedforward neural network; determining a first objective function and a first return value corresponding to the initial strategy based on each external reward and each internal reward, and updating strategy parameters of the initial strategy based on the first objective function and the first return value; and continuing to execute the step of determining the state action track corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until a target strategy corresponding to the task to be planned is obtained. According to the method and the device, the time sequence characteristic of the task is captured through the attention mechanism, so that the execution end can rapidly learn the task with the time sequence relation in the sparse rewarding environment, the sparse rewarding problem under the LTL constraint can be solved in different environments, and the optimal strategy can be learned through reinforcement learning.

Based on the task planning method based on reinforcement learning under the time sequence logic constraint, the embodiment provides a task planning device based on reinforcement learning under the time sequence logic constraint, as shown in fig. 3, the task planning device includes:

a conversion module 100, configured to convert a task to be planned into a deterministic finite automaton;

a determining module 200, configured to determine, based on the deterministic finite automaton and an initial policy corresponding to the task to be planned, a state action trajectory corresponding to the task to be planned, where each state action pair in the dynamic action trajectory corresponds to an external reward;

a feedforward network module 300, configured to input the state action trajectory and the external reward corresponding to each state action into a preset feedforward neural network, and output the internal reward corresponding to each state action through the feedforward neural network, where the feedforward neural network is configured with a self-attention mechanism;

an updating module 400, configured to determine, based on each external reward and each internal reward, a first objective function and a first return value corresponding to the initial policy, and update a policy parameter of the initial policy based on the first objective function and the first return value;

and the executing module 500 is configured to continue to execute the step of determining the state action trajectory corresponding to the task to be planned based on the deterministic finite automaton and the initial policy corresponding to the task to be planned until a target policy corresponding to the task to be planned is obtained.

Based on the reinforcement learning based mission planning method under the time sequence logic constraint, the embodiment provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs can be executed by one or more processors to implement the steps in the reinforcement learning based mission planning method under the time sequence logic constraint according to the embodiment.

Based on the above task planning method based on reinforcement learning under the sequential logic constraint, the present application further provides a terminal device, as shown in fig. 4, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store software programs, computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and the data processing by executing the software program, instructions or modules stored in the memory 22, that is, implements the method in the above-described embodiment.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific working process of the above-mentioned training sample set obtaining apparatus, the specific process loaded and executed by the storage medium and the multiple instruction processors in the terminal device are already described in detail in the above-mentioned method, and are not stated herein any more.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A task planning method based on reinforcement learning under the constraint of time sequence logic is characterized by comprising the following steps:

converting the task to be planned into a deterministic finite automaton;

inputting the state action track and each state action to a preset feedforward neural network for corresponding external rewards, and outputting each state action to corresponding internal rewards through the feedforward neural network, wherein the feedforward neural network is configured with a self-attention mechanism;

2. The reinforcement learning-based task planning method under the sequential logic constraint according to claim 1, wherein the determining a state action trajectory corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned specifically comprises:

controlling an execution end to execute the action to obtain a next state, and checking the conversion state of the next state in the deterministic finite automata;

and taking the next state as the current state, and continuing to execute the step of sampling the action based on the current state and the initial strategy corresponding to the task to be planned to obtain the action until the conversion state violates the sequential logic or the track length of the action track belonging to the acceptable state set or the state reaches a preset length threshold.

3. The method for task planning based on reinforcement learning under the constraint of the sequential logic according to claim 2, wherein the first preset condition is that the transition state violates the sequential logic or belongs to an acceptable state set; the second predetermined condition is that the transition state does not violate sequential logic and does not belong to an acceptable set of states, or that the transition state of the next state in a deterministic finite automaton is not checked.

4. The method of claim 2, wherein if the transition state does not violate the sequential logic and does not belong to the set of acceptable states, then after configuring a preset external reward for the state-action pair consisting of the current state and the action, the method further comprises:

5. The reinforcement learning-based mission planning method under the sequential logic constraint of claim 1, wherein the feedforward neural network comprises a self-attention module and a fully-connected module; the inputting the state action track and each state action to the corresponding external reward of each preset feedforward neural network, and the outputting each state action through the feedforward neural network to the corresponding internal reward of each preset feedforward neural network specifically comprises:

inputting the state action track and the external reward input corresponding to each state action pair into the self-attention module, and outputting the time sequence characteristic vector corresponding to each state action pair through the self-attention module;

6. The method for task planning based on reinforcement learning under the sequential logic constraint of claim 5, wherein the time slot characteristic vector corresponding to the action state is:

y＝ωv+x

7. The reinforcement learning-based mission planning method under the sequential logic constraint according to claim 1, wherein the calculation formulas of the first objective function and the return value are respectively:

indicating the inherent reward for time i,

representing the extrinsic reward at time T, η representing a network parameter of the feedforward neural network, T representing the number of states of action, s_tIndicates the state at time t, a_tRepresents the movement at time t, s_iIndicates the state at time i, a_iIndicating the operation at time i.

8. The reinforcement learning-based task planning method under the sequential logic constraint according to claim 1, wherein the step of continuously executing the step of determining the state action trajectory corresponding to the task to be planned based on the deterministic finite automaton and the initial strategy corresponding to the task to be planned until the step of obtaining the target strategy corresponding to the task to be planned, further comprises:

and determining a second objective function and a second return value corresponding to the initial strategy based on each external reward, and updating the network parameters of the feedforward neural network by the second objective function and the second return value.

9. A mission planning device based on reinforcement learning under the constraint of time sequence logic is characterized in that the mission planning device comprises:

10. A computer readable storage medium, storing one or more programs, which are executable by one or more processors, for performing the steps in the reinforcement learning based mission planning method under sequential logic constraints as recited in any of claims 1-8.

11. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in a reinforcement learning based mission planning method under the temporal logic constraints of any of claims 1-8.