CN115374933A

CN115374933A - Intelligent planning and decision-making method for landing behavior of multi-node detector

Info

Publication number: CN115374933A
Application number: CN202210729574.7A
Authority: CN
Inventors: 赵清杰; 陈涌泉; 耿明灏; 王鑫; 杨和星
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-11-22

Abstract

The invention relates to an intelligent planning and decision-making method for landing behaviors of a multi-node detector, and belongs to the technical field of spacecraft control and artificial intelligence. The invention utilizes the multi-agent reinforcement learning to finish the planning decision of the detector landing behavior end to end, adopts the training mode of 'centralized training distributed execution', uses local observation as the input of the Actor network and uses global observation as the input of the Critic network, and achieves the aim of information interaction between agents. Meanwhile, the meta-learning method based on the model is integrated into the method, and a GRU module is embedded into an Actor network and a Critic network to realize storage and utilization of historical experiences, so that the intelligent agent has better robustness in the face of uncertain environments, finally, the obtained strategy can guide each node of the detector to correspond to an uncertain asteroid environment, and the probability of rebound, rolling and overturning of the detector during landing is obviously reduced.

Description

Intelligent planning and decision-making method for landing behavior of multi-node detector

Technical Field

The invention relates to an intelligent planning and decision-making method for landing behaviors of a multi-node detector, and belongs to the technical field of spacecraft control and artificial intelligence.

Background

In the asteroid detection task, the landing control of the detector is a key link and is also a technical difficulty. When the single-node detector lands under the condition of microgravity, the single-node detector is easy to bounce or overturn, and the multi-node flexible connection design is one of important measures for solving the problem.

The conventional research method mainly aims at the landing problem of a single-node deep space probe system. Because the environment of the deep space asteroid is unknown, the environment dynamics modeling is not accurate enough, the navigation, guidance and control method of the traditional detector cannot play the best performance, and finally the landing fails.

By means of cooperative landing of the multi-node detector and the adoption of a multi-agent reinforcement learning method, planning and decision of landing behaviors are carried out end to end, the uncertain environment of the small planet can be effectively responded, and the landing success rate is improved. Each node of the detector is an intelligent agent, environmental information and self state can be obtained on a sensing layer through a self-contained sensor (such as a navigation camera, a laser altimeter and the like), a strategy is learned for each node by combining with demand constraint of successful landing and a multi-agent reinforcement learning algorithm, the nodes are guided to interact with the environment, the nodes are enabled to generate maneuvering behaviors such as displacement and rotation, and finally the whole multi-node detector can land at a target point safely and stably.

Disclosure of Invention

The invention aims to solve the technical problem of automatic landing control of a multi-node detector and provides an intelligent planning and decision-making method for the landing behavior of the multi-node detector.

The innovation points of the invention are as follows: a multi-agent element near-end strategy optimization algorithm (Meta-MAPPO) is provided for the first time on the basis of a near-end strategy optimization algorithm (PPO), an element learning technology is adopted to deal with unknown factors existing in a deep space asteroid environment, and the robustness of a multi-node detector landing action planning decision is improved.

The multi-agent element near-end strategy optimization algorithm is shown in fig. 1, and adopts an Actor-criticic architecture similar to the PPO algorithm, and each agent comprises an Actor network and a criticic network. In order to overcome the instability of a multi-agent environment, a centralized training distribution execution framework is adopted, and the criticic network of each agent evaluates the Actor network of the agent by using global observation information including the observation information of other agents. The Actor network of the agent then interacts with the environment using the respective local observations as inputs, respectively outputting actions. In order to enable the learning strategy to adapt to the uncertain environment of the deep space asteroid, a meta-learning mode is adopted for improvement, the training process is divided into an Outer loop layer and an Inner loop layer, and the intelligent agent is enabled to learn the strategy which can adapt to different environment distributions.

The invention is realized by adopting the following technical scheme.

An intelligent planning and decision-making method for landing behaviors of a multi-node detector comprises the following steps:

step 1: and defining each node of the multi-node deep space probe as an intelligent agent.

Step 2: a multi-agent reinforcement learning model is constructed by adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method.

Each agent comprises an Actor network and a Critic network, the Critic network uses global observation information including observation information of other agents, and the Critic network is used for evaluating the Actor network of the agent. The Actor network interacts with the environment using the respective local observations as inputs, and outputs actions separately.

As shown in fig. 3, GRU modules are embedded in the Actor network and the Critic network to record historical experience, so that the learning strategy can adapt to the uncertain environment of the deep space asteroid. The operating mechanism of the GRU module is shown in formulas 1-4:

r _t ＝σ(W _r ·[h _t-1 ,x _t ]) (1)

z _t ＝σ(W _z ·[h _t-1 ,x _t ]) (2)

wherein r is _t Denotes a reset gate in the GRU module, z _t Denotes the update gate, σ denotes the sigmoid function, W _r Denotes a reset gate parameter, W _z Indicating that the door parameters are to be updated,

parameter, h, representing the state of a candidate hidden layer _t Representing the hidden layer state at time t, h _t-1 Representing the hidden layer state, x, at time t-1 _t The inputs of the GRU modules are represented,

representing candidate hidden layer states.

And step 3: initializing a parameter theta of the Actor network, a parameter phi of the Critic network and a hyper-parameter in the algorithm.

And 4, step 4: an experience pool D = [ ], for collection of data for subsequent execution of multiple rounds.

And 5: initialization track list τ = [, ]]Hidden status of Actor network

And hidden state of Critic network

Obtaining local observations of each agent over an environment

Wherein the content of the first and second substances,

an initial moment hidden state of the Actor network representing the nth agent,

indicating the initial hidden state of the Critic network of the nth agent.

Step 6: each agent is according to Actor network

Selection actions

Obtaining a new hidden state

Instant rewards earned upon interaction with the environment

And new local observation

Wherein, the first and the second end of the pipe are connected with each other,

the Actor network for the ith agent, R represents the reward function,

the hidden state of the Actor network of the ith agent at the moment t-1 is shown, s _t Global observation, s, representing time t _t+1 Representing a global observation at time t + 1.

And 7: local observation vector o _t Stitching as a global observation s _t Then according to the Critic network

Obtaining a state cost function

And new hidden state

Wherein the content of the first and second substances,

a Critic network representing the ith agent,

indicating the hidden state of the criticic network of the ith agent at time t-1.

And 8: will trace { o } _t-1 ,a _t ,r _t ,o _t ,s _t ,h _t,π ,h _t,V Add to trace τ;

and step 9: and if the round is not finished, jumping to the step 6, otherwise, acquiring data of the whole track tau, calculating the cumulative discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D.

Specifically, step 9 includes the steps of:

step 9.1: calculating a cumulative discount reward G for each agent, a cumulative discount reward G for agent i ⁱ As shown in equation 5:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r ⁱ The instant reward is given; t represents a time step.

Step 9.2: the dominance function a of each agent is calculated.

Dominance function of agent i

As shown in equation 6:

wherein i is the number of the agent; t is one complete interaction timeA (c) is added; gamma belongs to [0, 1) is a discount coefficient which is used for measuring the discount degree of the future reward at the current moment; r is ⁱ The instant reward is given;

is the output value of the Critic network;

representing a local observation of agent i at time k;

representing a local observation of agent i at time k + 1;

representing the action taken by agent i at time k.

Step 10: if the experience pool D does not collect enough data, jumping to the step 5; otherwise, performing multiple parameter updates on the batch of data by the Actor network and the Critic network.

Specifically, the method comprises the following steps:

step 10.1: computing

By maximizing the objective function, update theta, the objective function J ⁱ (θ) is shown in equation 7:

wherein i is the number of the agent; t is the time of one complete interaction;

representing the updated policy;

representing the policy used to gather the batch of data;

is expressed in a state s as a dominance function _t Down strategy pi output action a _t Relative to the average value; clip () is a cut-down function used to constrain the variation range of new and old policies; ε is the clipping parameter used to control the extent of the constraint.

Step 10.2: computing

By minimizing a loss function, updating phi, the loss function L ⁱ (φ) is as shown in formula 8:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient which is used for measuring the discount degree of the future reward at the current moment;

the instant reward is given;

is the output value of the Critic network.

Step 11: and (4) judging whether the maximum training times is reached, if so, ending the training, otherwise, skipping to the step 4.

Advantageous effects

Compared with the prior art, the method of the invention has the following advantages:

1. the invention utilizes multi-agent reinforcement learning to finish the planning decision of the detector landing behavior end to end, adopts a training mode of 'centralized training distributed execution', and achieves the aim of information interaction between agents by using local observation as the input of an Actor network and global observation as the input of a Critic network.

2. According to the invention, a model-based meta-learning method is integrated into the method, and a GRU module is embedded into an Actor and Critic network to realize storage and utilization of historical experience, so that an intelligent agent has better robustness in the face of uncertain environments, and finally, the obtained strategy can guide each node of the detector to correspond to an uncertain asteroid environment, and the probability of bounce, roll and overturn of the detector during landing is remarkably reduced.

Drawings

FIG. 1 is an architectural diagram of the method of the present invention;

FIG. 2 is a training paradigm classification for multi-agent reinforcement learning;

FIG. 3 is a GRU module architecture diagram;

FIG. 4 is a schematic diagram of a clipping function;

FIG. 5 is a graph of the average prize for the example.

Detailed Description

The method of the present invention will be described in further detail with reference to the accompanying drawings and examples.

Examples

According to the technical background of the multi-node deep space probe landing behavior decision, the embodiment is used for training a reinforcement learning model and simulating the deep space environment where the asteroid is located. Objects in the simulation environment are asteroids and multinode probes. Each node of the detector is provided with a seeker used for tracking the position of a target landing point relative to the nodes of the detector, the nodes can obtain the information of the position, the speed, the angular speed and the position of the target landing point, and the translation and rotation motions are performed by controlling the thrust of a propeller on the nodes.

step 1: and defining each node of the multi-node deep space detector as an intelligent agent, and modeling the detector. The overall model architecture is shown in fig. 1.

The initial mass of the detector nodes, the propeller failure probability, the noise of various metrology instruments and the like are randomly set within a certain range, and relevant parameters are shown in table 1:

TABLE 1 setting of random parameters of detector nodes

Step 2: and designing a reward function, and feeding back the action selection of the detector node in the simulation environment.

The specific reward function settings for the task are shown in table 2.

TABLE 2 reward function settings

Wherein, final landing reward: if the multi-node detector successfully lands, the reward of 10 is obtained, otherwise, the reward is 0. The rewards are divided into those during landing and those at final landing. If the final landing is successful, the reward value 10 is obtained, and the successful landing mark is satisfied at the end of the turn: distance between node and target landing site

Is located in the range

Inner, final velocity v _f Less than a threshold value v _lim And the final angular velocity w _f Less than a threshold value w _lim 。

Angular velocity variation: when the size of the node angular velocity w exceeds a limit value w _lim Then, a negative reward is given to it, by the factor k _w The size of the reward is adjusted.

Speed change: when the speed v of the node changes, calculating a recommended speed according to the time required for the current node to move to the target landing point at the speed, taking the difference between the current speed and the recommended speed as a negative reward, and passing through a coefficient kappa _v The size of the reward is adjusted.

Posture change: when the attitude att of the node is changed, calculating the current attitude att and the target attitude att _target As a negative reward, by a factor of k _att Adjustable prizeThe magnitude of the excitation.

Distance change between nodes: when the distance d between the nodes changes, the nodes are separated from the maximum distance d _max Or minimum distance d _min Is taken as a negative reward by the factor k _d The size of the reward is adjusted.

The center of the seeker picture deviates from a target point: when the picture of the seeker changes, the deviation degree of the target landing point in the picture and the center of the picture is used as a negative reward, and the negative reward passes through a coefficient r _angle The size of the reward is adjusted.

Variation in fuel consumption: when the propeller works, fuel consumption occurs, the sum of the thrust of all the propellers at the current time step is taken as a negative reward, and the negative reward is obtained through a coefficient kappa _fuel The size of the reward is adjusted.

And step 3: as shown in FIG. 2, a multi-agent reinforcement learning model is constructed by adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method. The overall model architecture is shown in fig. 1.

Each agent comprises an Actor network and a Critic network, the Critic network evaluates the Actor network of the agent by using global observation information including observation information of other agents, and the Actor network uses respective local observation as input and respectively outputs action to interact with the environment.

r _t ＝σ(W _r ·[h _t-1 ,x _t ]) (1)

z _t ＝σ(W _z ·[h _t-1 ,x _t ]) (2)

and 4, step 4: initializing a parameter theta of the Actor network, a parameter phi of the Critic network and a hyper-parameter in the algorithm. As shown in table 3:

TABLE 3 model hyper-parameter settings

And 5: initializing an experience pool D = [ ], and collecting data by subsequently executing a plurality of rounds;

step 6: initialization track list τ = [, ]]Hidden status of Actor network

And hidden state of Critic network

Obtaining local observations of each agent over an environment

And 7: each agent is based on the Actor network

Selection actions

Obtaining a new hidden state

Instant rewards earned upon interaction with the environment

And new local observations

And 8: local observation vector o _t Stitching into a global observation s _t Then, according to the Critic network

Obtaining a state cost function

And new hidden state

And step 9: will trace { o _t-1 ,a _t ,r _t ,o _t ,s _t ,h _t,π ,h _t,V Is added to the trace τ;

step 10: if the round is not finished, jumping to the step 7; otherwise, acquiring data of the whole track tau, calculating the accumulated discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D;

step 10.1: calculating a cumulative discount reward G for each agent, a cumulative discount reward G for agent i ⁱ As shown in equation 5:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r ⁱ Is an instant prize.

Step 10.2: calculating the dominance function A of each agent and the dominance function of agent i

As shown in equation 6:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r ⁱ The instant reward is given;

is the output value of the Critic network.

Step 11: if the experience pool D does not collect enough data, jumping to the step 6; otherwise, performing parameter updating on the batch of data by the Actor network and the criticic network for multiple times;

step 11.1: calculating out

θ is updated by maximizing the objective function, which is shown in equation 7:

representing the updated policy;

representing a policy used to gather the batch of data;

is expressed in state s as a dominance function _t Down strategy pi output action a _t Relative to the average value; clip () is a cut-down function, as shown in fig. 4, used to constrain the variation amplitude of the old and new policies; ε is the clipping parameter used to control the range of the constraint.

Step 11.2: computing

Is updated by minimizing the loss function, which is shown in equation 8:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment;

is an instant reward;

is the output value of the Critic network.

Step 12: and (5) judging whether the maximum training times are reached, if so, ending the training, otherwise, skipping to the step 5.

Comparative analysis of experimental results

The average round awards between IPPO, MAPP and Meta-MAPP algorithms were compared primarily by experimental comparisons of the different methods. The IPPO algorithm is an algorithm for solving the problem of multiple intelligent agents by adopting a distributed training distributed execution mode, and no information interaction exists among the intelligent agents; MAPPO is a multi-agent extended version of a conventional near-end policy optimization algorithm implemented using a centralized training profile.

As can be seen from fig. 5, the IPPO algorithm has the worst performance, and is difficult to achieve a good convergence state, and the turn reward of the IPPO algorithm finally converges to about-50, which is at a lower level, so that an agent without interactive information is difficult to obtain a good strategy in a multi-node deep space probe landing behavior decision scenario. Both MAPP and Meta-MAPP algorithms can converge to a better result, but the convergence result of the Meta-MAPP algorithm is better than that of the MAPP algorithm, and the Meta-MAPP algorithm converges in 13000 rounds, while the MAPP algorithm converges in 23000 rounds. In addition, the fluctuation of the Meta-MAPPO reward curve after convergence is smaller, which shows that the learned strategy is more robust.

Claims

1. An intelligent planning and decision-making method for landing behaviors of a multi-node detector is characterized by comprising the following steps of:

step 1: defining each node of the multi-node deep space probe as an agent;

and 2, step: adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method to construct a multi-agent reinforcement learning model;

each agent comprises an Actor network and a criticic network, and the criticic network uses global observation information including observation information of other agents and is used for evaluating the Actor network of the agent; the Actor network uses respective local observations as inputs and outputs actions respectively to interact with the environment;

a GRU module is embedded in an Actor network and a Critic network to record historical experience, so that a learning strategy can adapt to the uncertain environment of the deep space asteroid;

the operating mechanism of the GRU module is shown in formulas 1-4:

r _t ＝σ(W _r ·[h _t-1 ，x _t ]) (1)

z _t ＝σ(W _z ·[h _t-1 ，x _t ]) (2)

parameter, h, representing the state of a candidate hidden layer _t Representing the hidden layer state at time t, h _t-1 Representing the hidden layer state at time t-1, x _t An input of the GRU module is indicated,

representing a candidate hidden layer state;

and step 3: initializing a parameter theta of an Actor network, a parameter phi of a Critic network and a hyper-parameter in an algorithm;

and 4, step 4: initializing an experience pool D = [ ], and collecting data by subsequently executing a plurality of rounds;

and 5: initialization track list τ = [ ]]Hidden status of Actor network

And hidden state of Critic network

Obtaining local observations of each agent over an environment

Wherein the content of the first and second substances,

an initial moment hidden state of the Actor network representing the nth agent,

representing an initial moment hidden state of the Critic network of the nth agent;

step 6: each agent is according to Actor network

Selection action

Obtaining a new hidden state

Instant rewards earned upon interaction with the environment

And new local observation

Wherein the content of the first and second substances,

denotes the Actor network of the ith agent, R denotes the reward function,

representing the hidden state, s, of the Actor network of the ith agent at time t-1 _t Representing the global observation at time t, s _t+1 Represents a global observation at time t + 1;

and 7: the local observation vector o _t Stitching into a global observation s _t Then according to the Critic network

Obtaining a state cost function

And new hidden state

a Critic network representing the ith agent,

representing the hidden state of the criticic network of the ith agent at the moment t-1;

and step 8: will trace { o _t-1 ，a _t ，r _t ，o _t ，s _t ，h _t，π ，h _t，V Is added to the trace τ;

and step 9: if the round is not finished, jumping to the step 6, otherwise, obtaining the data of the whole track tau, calculating the accumulated discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D;

step 10: if the experience pool D does not collect enough data, jumping to the step 5; otherwise, performing parameter updating on the batch of data by the Actor network and the criticic network for multiple times;

step 11: judging whether the maximum training times are reached, if so, ending the training, otherwise, skipping to the step 4;

2. the intelligent planning and decision-making method for landing behavior of multi-node probe as claimed in claim 1, wherein step 9 comprises the following steps:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) as a discount coefficient used for measuring the discount degree of the future reward at the current moment; r is ⁱ Is an instant reward; t represents a time step;

step 9.2: calculating a dominance function A of each agent;

dominance function of agent i

As shown in equation 6:

wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r is ⁱ The instant reward is given;

is the output value of the Critic network;

representing a local observation of agent i at time k;

representing a local observation of agent i at time k + 1;

representing the action taken by agent i at time k.

3. The intelligent planning and decision-making method for landing behavior of multi-node probe as claimed in claim 1, wherein the step 10 comprises the following steps:

step 10.1: computing

By maximizing the objective function, update theta, objective function J ⁱ (θ) is shown in equation 7:

representing the updated policy;

representing the policy used to gather the batch of data;

is expressed in state s as a dominance function _t Down strategy pi output action a _t The degree of quality of the action value of (c) relative to the average value; clip () is a cut-down function used to constrain the variation range of new and old policies; epsilon is a clipping parameter used for controlling the range of the constraint;

step 10.2: computing

By minimizing the loss function, updating phi, the loss function L ⁱ (φ) is as shown in formula 8:

the instant reward is given;

is the output value of the Critic network.