CN115374933A - Intelligent planning and decision-making method for landing behavior of multi-node detector - Google Patents

Intelligent planning and decision-making method for landing behavior of multi-node detector Download PDF

Info

Publication number
CN115374933A
CN115374933A CN202210729574.7A CN202210729574A CN115374933A CN 115374933 A CN115374933 A CN 115374933A CN 202210729574 A CN202210729574 A CN 202210729574A CN 115374933 A CN115374933 A CN 115374933A
Authority
CN
China
Prior art keywords
agent
network
representing
reward
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210729574.7A
Other languages
Chinese (zh)
Inventor
赵清杰
陈涌泉
耿明灏
王鑫
杨和星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210729574.7A priority Critical patent/CN115374933A/en
Publication of CN115374933A publication Critical patent/CN115374933A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an intelligent planning and decision-making method for landing behaviors of a multi-node detector, and belongs to the technical field of spacecraft control and artificial intelligence. The invention utilizes the multi-agent reinforcement learning to finish the planning decision of the detector landing behavior end to end, adopts the training mode of 'centralized training distributed execution', uses local observation as the input of the Actor network and uses global observation as the input of the Critic network, and achieves the aim of information interaction between agents. Meanwhile, the meta-learning method based on the model is integrated into the method, and a GRU module is embedded into an Actor network and a Critic network to realize storage and utilization of historical experiences, so that the intelligent agent has better robustness in the face of uncertain environments, finally, the obtained strategy can guide each node of the detector to correspond to an uncertain asteroid environment, and the probability of rebound, rolling and overturning of the detector during landing is obviously reduced.

Description

Intelligent planning and decision-making method for landing behavior of multi-node detector
Technical Field
The invention relates to an intelligent planning and decision-making method for landing behaviors of a multi-node detector, and belongs to the technical field of spacecraft control and artificial intelligence.
Background
In the asteroid detection task, the landing control of the detector is a key link and is also a technical difficulty. When the single-node detector lands under the condition of microgravity, the single-node detector is easy to bounce or overturn, and the multi-node flexible connection design is one of important measures for solving the problem.
The conventional research method mainly aims at the landing problem of a single-node deep space probe system. Because the environment of the deep space asteroid is unknown, the environment dynamics modeling is not accurate enough, the navigation, guidance and control method of the traditional detector cannot play the best performance, and finally the landing fails.
By means of cooperative landing of the multi-node detector and the adoption of a multi-agent reinforcement learning method, planning and decision of landing behaviors are carried out end to end, the uncertain environment of the small planet can be effectively responded, and the landing success rate is improved. Each node of the detector is an intelligent agent, environmental information and self state can be obtained on a sensing layer through a self-contained sensor (such as a navigation camera, a laser altimeter and the like), a strategy is learned for each node by combining with demand constraint of successful landing and a multi-agent reinforcement learning algorithm, the nodes are guided to interact with the environment, the nodes are enabled to generate maneuvering behaviors such as displacement and rotation, and finally the whole multi-node detector can land at a target point safely and stably.
Disclosure of Invention
The invention aims to solve the technical problem of automatic landing control of a multi-node detector and provides an intelligent planning and decision-making method for the landing behavior of the multi-node detector.
The innovation points of the invention are as follows: a multi-agent element near-end strategy optimization algorithm (Meta-MAPPO) is provided for the first time on the basis of a near-end strategy optimization algorithm (PPO), an element learning technology is adopted to deal with unknown factors existing in a deep space asteroid environment, and the robustness of a multi-node detector landing action planning decision is improved.
The multi-agent element near-end strategy optimization algorithm is shown in fig. 1, and adopts an Actor-criticic architecture similar to the PPO algorithm, and each agent comprises an Actor network and a criticic network. In order to overcome the instability of a multi-agent environment, a centralized training distribution execution framework is adopted, and the criticic network of each agent evaluates the Actor network of the agent by using global observation information including the observation information of other agents. The Actor network of the agent then interacts with the environment using the respective local observations as inputs, respectively outputting actions. In order to enable the learning strategy to adapt to the uncertain environment of the deep space asteroid, a meta-learning mode is adopted for improvement, the training process is divided into an Outer loop layer and an Inner loop layer, and the intelligent agent is enabled to learn the strategy which can adapt to different environment distributions.
The invention is realized by adopting the following technical scheme.
An intelligent planning and decision-making method for landing behaviors of a multi-node detector comprises the following steps:
step 1: and defining each node of the multi-node deep space probe as an intelligent agent.
Step 2: a multi-agent reinforcement learning model is constructed by adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method.
Each agent comprises an Actor network and a Critic network, the Critic network uses global observation information including observation information of other agents, and the Critic network is used for evaluating the Actor network of the agent. The Actor network interacts with the environment using the respective local observations as inputs, and outputs actions separately.
As shown in fig. 3, GRU modules are embedded in the Actor network and the Critic network to record historical experience, so that the learning strategy can adapt to the uncertain environment of the deep space asteroid. The operating mechanism of the GRU module is shown in formulas 1-4:
r t =σ(W r ·[h t-1 ,x t ]) (1)
z t =σ(W z ·[h t-1 ,x t ]) (2)
Figure BDA0003712554410000021
Figure BDA0003712554410000022
wherein r is t Denotes a reset gate in the GRU module, z t Denotes the update gate, σ denotes the sigmoid function, W r Denotes a reset gate parameter, W z Indicating that the door parameters are to be updated,
Figure BDA0003712554410000023
parameter, h, representing the state of a candidate hidden layer t Representing the hidden layer state at time t, h t-1 Representing the hidden layer state, x, at time t-1 t The inputs of the GRU modules are represented,
Figure BDA0003712554410000024
representing candidate hidden layer states.
And step 3: initializing a parameter theta of the Actor network, a parameter phi of the Critic network and a hyper-parameter in the algorithm.
And 4, step 4: an experience pool D = [ ], for collection of data for subsequent execution of multiple rounds.
And 5: initialization track list τ = [, ]]Hidden status of Actor network
Figure BDA0003712554410000025
Figure BDA0003712554410000031
And hidden state of Critic network
Figure BDA0003712554410000032
Obtaining local observations of each agent over an environment
Figure BDA0003712554410000033
Wherein the content of the first and second substances,
Figure BDA0003712554410000034
an initial moment hidden state of the Actor network representing the nth agent,
Figure BDA0003712554410000035
indicating the initial hidden state of the Critic network of the nth agent.
Step 6: each agent is according to Actor network
Figure BDA0003712554410000036
Selection actions
Figure BDA0003712554410000037
Obtaining a new hidden state
Figure BDA0003712554410000038
Instant rewards earned upon interaction with the environment
Figure BDA0003712554410000039
And new local observation
Figure BDA00037125544100000310
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037125544100000311
the Actor network for the ith agent, R represents the reward function,
Figure BDA00037125544100000312
the hidden state of the Actor network of the ith agent at the moment t-1 is shown, s t Global observation, s, representing time t t+1 Representing a global observation at time t + 1.
And 7: local observation vector o t Stitching as a global observation s t Then according to the Critic network
Figure BDA00037125544100000313
Obtaining a state cost function
Figure BDA00037125544100000314
And new hidden state
Figure BDA00037125544100000315
Wherein the content of the first and second substances,
Figure BDA00037125544100000316
a Critic network representing the ith agent,
Figure BDA00037125544100000317
indicating the hidden state of the criticic network of the ith agent at time t-1.
And 8: will trace { o } t-1 ,a t ,r t ,o t ,s t ,h t,π ,h t,V Add to trace τ;
and step 9: and if the round is not finished, jumping to the step 6, otherwise, acquiring data of the whole track tau, calculating the cumulative discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D.
Specifically, step 9 includes the steps of:
step 9.1: calculating a cumulative discount reward G for each agent, a cumulative discount reward G for agent i i As shown in equation 5:
Figure BDA00037125544100000318
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r i The instant reward is given; t represents a time step.
Step 9.2: the dominance function a of each agent is calculated.
Dominance function of agent i
Figure BDA00037125544100000319
As shown in equation 6:
Figure BDA00037125544100000320
wherein i is the number of the agent; t is one complete interaction timeA (c) is added; gamma belongs to [0, 1) is a discount coefficient which is used for measuring the discount degree of the future reward at the current moment; r is i The instant reward is given;
Figure BDA00037125544100000321
is the output value of the Critic network;
Figure BDA00037125544100000322
representing a local observation of agent i at time k;
Figure BDA00037125544100000323
representing a local observation of agent i at time k + 1;
Figure BDA0003712554410000041
representing the action taken by agent i at time k.
Step 10: if the experience pool D does not collect enough data, jumping to the step 5; otherwise, performing multiple parameter updates on the batch of data by the Actor network and the Critic network.
Specifically, the method comprises the following steps:
step 10.1: computing
Figure BDA0003712554410000042
By maximizing the objective function, update theta, the objective function J i (θ) is shown in equation 7:
Figure BDA0003712554410000043
wherein i is the number of the agent; t is the time of one complete interaction;
Figure BDA0003712554410000044
representing the updated policy;
Figure BDA0003712554410000045
representing the policy used to gather the batch of data;
Figure BDA0003712554410000046
is expressed in a state s as a dominance function t Down strategy pi output action a t Relative to the average value; clip () is a cut-down function used to constrain the variation range of new and old policies; ε is the clipping parameter used to control the extent of the constraint.
Step 10.2: computing
Figure BDA0003712554410000047
By minimizing a loss function, updating phi, the loss function L i (φ) is as shown in formula 8:
Figure BDA0003712554410000048
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient which is used for measuring the discount degree of the future reward at the current moment;
Figure BDA0003712554410000049
the instant reward is given;
Figure BDA00037125544100000410
is the output value of the Critic network.
Step 11: and (4) judging whether the maximum training times is reached, if so, ending the training, otherwise, skipping to the step 4.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
1. the invention utilizes multi-agent reinforcement learning to finish the planning decision of the detector landing behavior end to end, adopts a training mode of 'centralized training distributed execution', and achieves the aim of information interaction between agents by using local observation as the input of an Actor network and global observation as the input of a Critic network.
2. According to the invention, a model-based meta-learning method is integrated into the method, and a GRU module is embedded into an Actor and Critic network to realize storage and utilization of historical experience, so that an intelligent agent has better robustness in the face of uncertain environments, and finally, the obtained strategy can guide each node of the detector to correspond to an uncertain asteroid environment, and the probability of bounce, roll and overturn of the detector during landing is remarkably reduced.
Drawings
FIG. 1 is an architectural diagram of the method of the present invention;
FIG. 2 is a training paradigm classification for multi-agent reinforcement learning;
FIG. 3 is a GRU module architecture diagram;
FIG. 4 is a schematic diagram of a clipping function;
FIG. 5 is a graph of the average prize for the example.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings and examples.
Examples
According to the technical background of the multi-node deep space probe landing behavior decision, the embodiment is used for training a reinforcement learning model and simulating the deep space environment where the asteroid is located. Objects in the simulation environment are asteroids and multinode probes. Each node of the detector is provided with a seeker used for tracking the position of a target landing point relative to the nodes of the detector, the nodes can obtain the information of the position, the speed, the angular speed and the position of the target landing point, and the translation and rotation motions are performed by controlling the thrust of a propeller on the nodes.
An intelligent planning and decision-making method for landing behaviors of a multi-node detector comprises the following steps:
step 1: and defining each node of the multi-node deep space detector as an intelligent agent, and modeling the detector. The overall model architecture is shown in fig. 1.
The initial mass of the detector nodes, the propeller failure probability, the noise of various metrology instruments and the like are randomly set within a certain range, and relevant parameters are shown in table 1:
TABLE 1 setting of random parameters of detector nodes
Figure BDA0003712554410000051
Step 2: and designing a reward function, and feeding back the action selection of the detector node in the simulation environment.
The specific reward function settings for the task are shown in table 2.
TABLE 2 reward function settings
Figure BDA0003712554410000061
Wherein, final landing reward: if the multi-node detector successfully lands, the reward of 10 is obtained, otherwise, the reward is 0. The rewards are divided into those during landing and those at final landing. If the final landing is successful, the reward value 10 is obtained, and the successful landing mark is satisfied at the end of the turn: distance between node and target landing site
Figure BDA0003712554410000062
Is located in the range
Figure BDA0003712554410000063
Inner, final velocity v f Less than a threshold value v lim And the final angular velocity w f Less than a threshold value w lim
Angular velocity variation: when the size of the node angular velocity w exceeds a limit value w lim Then, a negative reward is given to it, by the factor k w The size of the reward is adjusted.
Speed change: when the speed v of the node changes, calculating a recommended speed according to the time required for the current node to move to the target landing point at the speed, taking the difference between the current speed and the recommended speed as a negative reward, and passing through a coefficient kappa v The size of the reward is adjusted.
Posture change: when the attitude att of the node is changed, calculating the current attitude att and the target attitude att target As a negative reward, by a factor of k att Adjustable prizeThe magnitude of the excitation.
Distance change between nodes: when the distance d between the nodes changes, the nodes are separated from the maximum distance d max Or minimum distance d min Is taken as a negative reward by the factor k d The size of the reward is adjusted.
The center of the seeker picture deviates from a target point: when the picture of the seeker changes, the deviation degree of the target landing point in the picture and the center of the picture is used as a negative reward, and the negative reward passes through a coefficient r angle The size of the reward is adjusted.
Variation in fuel consumption: when the propeller works, fuel consumption occurs, the sum of the thrust of all the propellers at the current time step is taken as a negative reward, and the negative reward is obtained through a coefficient kappa fuel The size of the reward is adjusted.
And step 3: as shown in FIG. 2, a multi-agent reinforcement learning model is constructed by adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method. The overall model architecture is shown in fig. 1.
Each agent comprises an Actor network and a Critic network, the Critic network evaluates the Actor network of the agent by using global observation information including observation information of other agents, and the Actor network uses respective local observation as input and respectively outputs action to interact with the environment.
As shown in fig. 3, GRU modules are embedded in the Actor network and the Critic network to record historical experience, so that the learning strategy can adapt to the uncertain environment of the deep space asteroid. The operating mechanism of the GRU module is shown in formulas 1-4:
r t =σ(W r ·[h t-1 ,x t ]) (1)
z t =σ(W z ·[h t-1 ,x t ]) (2)
Figure BDA0003712554410000071
Figure BDA0003712554410000072
and 4, step 4: initializing a parameter theta of the Actor network, a parameter phi of the Critic network and a hyper-parameter in the algorithm. As shown in table 3:
TABLE 3 model hyper-parameter settings
Figure BDA0003712554410000073
And 5: initializing an experience pool D = [ ], and collecting data by subsequently executing a plurality of rounds;
step 6: initialization track list τ = [, ]]Hidden status of Actor network
Figure BDA0003712554410000074
Figure BDA0003712554410000075
And hidden state of Critic network
Figure BDA0003712554410000076
Obtaining local observations of each agent over an environment
Figure BDA0003712554410000077
And 7: each agent is based on the Actor network
Figure BDA0003712554410000078
Selection actions
Figure BDA0003712554410000079
Obtaining a new hidden state
Figure BDA00037125544100000710
Instant rewards earned upon interaction with the environment
Figure BDA00037125544100000711
And new local observations
Figure BDA00037125544100000712
And 8: local observation vector o t Stitching into a global observation s t Then, according to the Critic network
Figure BDA00037125544100000713
Obtaining a state cost function
Figure BDA00037125544100000714
And new hidden state
Figure BDA00037125544100000715
And step 9: will trace { o t-1 ,a t ,r t ,o t ,s t ,h t,π ,h t,V Is added to the trace τ;
step 10: if the round is not finished, jumping to the step 7; otherwise, acquiring data of the whole track tau, calculating the accumulated discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D;
step 10.1: calculating a cumulative discount reward G for each agent, a cumulative discount reward G for agent i i As shown in equation 5:
Figure BDA0003712554410000081
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r i Is an instant prize.
Step 10.2: calculating the dominance function A of each agent and the dominance function of agent i
Figure BDA00037125544100000810
As shown in equation 6:
Figure BDA0003712554410000082
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r i The instant reward is given;
Figure BDA0003712554410000083
is the output value of the Critic network.
Step 11: if the experience pool D does not collect enough data, jumping to the step 6; otherwise, performing parameter updating on the batch of data by the Actor network and the criticic network for multiple times;
step 11.1: calculating out
Figure BDA0003712554410000084
θ is updated by maximizing the objective function, which is shown in equation 7:
Figure BDA0003712554410000085
wherein i is the number of the agent; t is the time of one complete interaction;
Figure BDA0003712554410000086
representing the updated policy;
Figure BDA0003712554410000087
representing a policy used to gather the batch of data;
Figure BDA0003712554410000088
is expressed in state s as a dominance function t Down strategy pi output action a t Relative to the average value; clip () is a cut-down function, as shown in fig. 4, used to constrain the variation amplitude of the old and new policies; ε is the clipping parameter used to control the range of the constraint.
Step 11.2: computing
Figure BDA0003712554410000089
Is updated by minimizing the loss function, which is shown in equation 8:
Figure BDA0003712554410000091
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment;
Figure BDA0003712554410000092
is an instant reward;
Figure BDA0003712554410000093
is the output value of the Critic network.
Step 12: and (5) judging whether the maximum training times are reached, if so, ending the training, otherwise, skipping to the step 5.
Comparative analysis of experimental results
The average round awards between IPPO, MAPP and Meta-MAPP algorithms were compared primarily by experimental comparisons of the different methods. The IPPO algorithm is an algorithm for solving the problem of multiple intelligent agents by adopting a distributed training distributed execution mode, and no information interaction exists among the intelligent agents; MAPPO is a multi-agent extended version of a conventional near-end policy optimization algorithm implemented using a centralized training profile.
As can be seen from fig. 5, the IPPO algorithm has the worst performance, and is difficult to achieve a good convergence state, and the turn reward of the IPPO algorithm finally converges to about-50, which is at a lower level, so that an agent without interactive information is difficult to obtain a good strategy in a multi-node deep space probe landing behavior decision scenario. Both MAPP and Meta-MAPP algorithms can converge to a better result, but the convergence result of the Meta-MAPP algorithm is better than that of the MAPP algorithm, and the Meta-MAPP algorithm converges in 13000 rounds, while the MAPP algorithm converges in 23000 rounds. In addition, the fluctuation of the Meta-MAPPO reward curve after convergence is smaller, which shows that the learned strategy is more robust.

Claims (3)

1. An intelligent planning and decision-making method for landing behaviors of a multi-node detector is characterized by comprising the following steps of:
step 1: defining each node of the multi-node deep space probe as an agent;
and 2, step: adopting a paradigm of 'centralized training distributed execution' and fusing a model-based meta learning method to construct a multi-agent reinforcement learning model;
each agent comprises an Actor network and a criticic network, and the criticic network uses global observation information including observation information of other agents and is used for evaluating the Actor network of the agent; the Actor network uses respective local observations as inputs and outputs actions respectively to interact with the environment;
a GRU module is embedded in an Actor network and a Critic network to record historical experience, so that a learning strategy can adapt to the uncertain environment of the deep space asteroid;
the operating mechanism of the GRU module is shown in formulas 1-4:
r t =σ(W r ·[h t-1 ,x t ]) (1)
z t =σ(W z ·[h t-1 ,x t ]) (2)
Figure FDA0003712554400000011
Figure FDA0003712554400000012
wherein r is t Denotes a reset gate in the GRU module, z t Denotes the update gate, σ denotes the sigmoid function, W r Denotes a reset gate parameter, W z Indicating that the door parameters are to be updated,
Figure FDA0003712554400000013
parameter, h, representing the state of a candidate hidden layer t Representing the hidden layer state at time t, h t-1 Representing the hidden layer state at time t-1, x t An input of the GRU module is indicated,
Figure FDA0003712554400000014
representing a candidate hidden layer state;
and step 3: initializing a parameter theta of an Actor network, a parameter phi of a Critic network and a hyper-parameter in an algorithm;
and 4, step 4: initializing an experience pool D = [ ], and collecting data by subsequently executing a plurality of rounds;
and 5: initialization track list τ = [ ]]Hidden status of Actor network
Figure FDA0003712554400000015
Figure FDA0003712554400000016
And hidden state of Critic network
Figure FDA0003712554400000017
Obtaining local observations of each agent over an environment
Figure FDA0003712554400000018
Wherein the content of the first and second substances,
Figure FDA0003712554400000019
an initial moment hidden state of the Actor network representing the nth agent,
Figure FDA00037125544000000110
representing an initial moment hidden state of the Critic network of the nth agent;
step 6: each agent is according to Actor network
Figure FDA00037125544000000111
Selection action
Figure FDA00037125544000000112
Obtaining a new hidden state
Figure FDA00037125544000000113
Instant rewards earned upon interaction with the environment
Figure FDA00037125544000000114
And new local observation
Figure FDA00037125544000000115
Wherein the content of the first and second substances,
Figure FDA0003712554400000021
denotes the Actor network of the ith agent, R denotes the reward function,
Figure FDA0003712554400000022
representing the hidden state, s, of the Actor network of the ith agent at time t-1 t Representing the global observation at time t, s t+1 Represents a global observation at time t + 1;
and 7: the local observation vector o t Stitching into a global observation s t Then according to the Critic network
Figure FDA0003712554400000023
Obtaining a state cost function
Figure FDA0003712554400000024
And new hidden state
Figure FDA0003712554400000025
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003712554400000026
a Critic network representing the ith agent,
Figure FDA0003712554400000027
representing the hidden state of the criticic network of the ith agent at the moment t-1;
and step 8: will trace { o t-1 ,a t ,r t ,o t ,s t ,h t,π ,h t,V Is added to the trace τ;
and step 9: if the round is not finished, jumping to the step 6, otherwise, obtaining the data of the whole track tau, calculating the accumulated discount reward G and the advantage function A of each agent on the basis, and putting the data into an experience pool D;
step 10: if the experience pool D does not collect enough data, jumping to the step 5; otherwise, performing parameter updating on the batch of data by the Actor network and the criticic network for multiple times;
step 11: judging whether the maximum training times are reached, if so, ending the training, otherwise, skipping to the step 4;
2. the intelligent planning and decision-making method for landing behavior of multi-node probe as claimed in claim 1, wherein step 9 comprises the following steps:
step 9.1: calculating a cumulative discount reward G for each agent, a cumulative discount reward G for agent i i As shown in equation 5:
Figure FDA0003712554400000028
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) as a discount coefficient used for measuring the discount degree of the future reward at the current moment; r is i Is an instant reward; t represents a time step;
step 9.2: calculating a dominance function A of each agent;
dominance function of agent i
Figure FDA0003712554400000029
As shown in equation 6:
Figure FDA00037125544000000210
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient used for measuring the discount degree of the future rewards at the current moment; r is i The instant reward is given;
Figure FDA00037125544000000211
is the output value of the Critic network;
Figure FDA00037125544000000212
representing a local observation of agent i at time k;
Figure FDA00037125544000000213
representing a local observation of agent i at time k + 1;
Figure FDA0003712554400000031
representing the action taken by agent i at time k.
3. The intelligent planning and decision-making method for landing behavior of multi-node probe as claimed in claim 1, wherein the step 10 comprises the following steps:
step 10.1: computing
Figure FDA0003712554400000032
By maximizing the objective function, update theta, objective function J i (θ) is shown in equation 7:
Figure FDA0003712554400000033
wherein i is the number of the agent; t is the time of one complete interaction;
Figure FDA0003712554400000034
representing the updated policy;
Figure FDA0003712554400000035
representing the policy used to gather the batch of data;
Figure FDA0003712554400000036
is expressed in state s as a dominance function t Down strategy pi output action a t The degree of quality of the action value of (c) relative to the average value; clip () is a cut-down function used to constrain the variation range of new and old policies; epsilon is a clipping parameter used for controlling the range of the constraint;
step 10.2: computing
Figure FDA0003712554400000037
By minimizing the loss function, updating phi, the loss function L i (φ) is as shown in formula 8:
Figure FDA0003712554400000038
wherein i is the number of the agent; t is the time of one complete interaction; gamma belongs to [0, 1) is a discount coefficient which is used for measuring the discount degree of the future reward at the current moment;
Figure FDA0003712554400000039
the instant reward is given;
Figure FDA00037125544000000310
is the output value of the Critic network.
CN202210729574.7A 2022-06-24 2022-06-24 Intelligent planning and decision-making method for landing behavior of multi-node detector Pending CN115374933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210729574.7A CN115374933A (en) 2022-06-24 2022-06-24 Intelligent planning and decision-making method for landing behavior of multi-node detector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210729574.7A CN115374933A (en) 2022-06-24 2022-06-24 Intelligent planning and decision-making method for landing behavior of multi-node detector

Publications (1)

Publication Number Publication Date
CN115374933A true CN115374933A (en) 2022-11-22

Family

ID=84062159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210729574.7A Pending CN115374933A (en) 2022-06-24 2022-06-24 Intelligent planning and decision-making method for landing behavior of multi-node detector

Country Status (1)

Country Link
CN (1) CN115374933A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN117215196A (en) * 2023-10-17 2023-12-12 成都正扬博创电子技术有限公司 Ship-borne comprehensive control computer intelligent decision-making method based on deep reinforcement learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN116562332B (en) * 2023-07-10 2023-09-12 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN117215196A (en) * 2023-10-17 2023-12-12 成都正扬博创电子技术有限公司 Ship-borne comprehensive control computer intelligent decision-making method based on deep reinforcement learning
CN117215196B (en) * 2023-10-17 2024-04-05 成都正扬博创电子技术有限公司 Ship-borne comprehensive control computer intelligent decision-making method based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN115374933A (en) Intelligent planning and decision-making method for landing behavior of multi-node detector
Grefenstette et al. Learning sequential decision rules using simulation models and competition
CN112286218B (en) Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN111260026B (en) Navigation migration method based on meta reinforcement learning
CN111580544A (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN113821041B (en) Multi-robot collaborative navigation and obstacle avoidance method
CN115016534A (en) Unmanned aerial vehicle autonomous obstacle avoidance navigation method based on memory reinforcement learning
CN114967713B (en) Underwater vehicle buoyancy discrete change control method based on reinforcement learning
CN116136945A (en) Unmanned aerial vehicle cluster countermeasure game simulation method based on anti-facts base line
CN112146879A (en) Rolling bearing fault intelligent diagnosis method and system
CN113359704B (en) Self-adaptive SAC-PID method suitable for complex unknown environment
CN114083543A (en) Active fault diagnosis method for space manipulator
CN114626505A (en) Mobile robot deep reinforcement learning control method
CN117332693A (en) Slope stability evaluation method based on DDPG-PSO-BP algorithm
Zhang et al. Universal value iteration networks: When spatially-invariant is not universal
CN115906673A (en) Integrated modeling method and system for combat entity behavior model
CN115964898A (en) Bignty game confrontation-oriented BC-QMIX on-line multi-agent behavior decision modeling method
CN114911157A (en) Robot navigation control method and system based on partial observable reinforcement learning
CN114396949A (en) Mobile robot no-priori map navigation decision-making method based on DDPG
Zhang et al. Gliding control of underwater gliding snake-like robot based on reinforcement learning
CN113807005A (en) Bearing residual life prediction method based on improved FPA-DBN
Manela Deep reinforcement learning for complex manipulation tasks with sparse feedback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination