CN115826601A - Unmanned aerial vehicle path planning method based on reverse reinforcement learning - Google Patents

Unmanned aerial vehicle path planning method based on reverse reinforcement learning Download PDF

Info

Publication number
CN115826601A
CN115826601A CN202211437557.2A CN202211437557A CN115826601A CN 115826601 A CN115826601 A CN 115826601A CN 202211437557 A CN202211437557 A CN 202211437557A CN 115826601 A CN115826601 A CN 115826601A
Authority
CN
China
Prior art keywords
expert
function
theta
network
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211437557.2A
Other languages
Chinese (zh)
Inventor
杨秀霞
张毅
王晨蕾
杨林
李文强
姜子劼
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval Aeronautical University
Original Assignee
Naval Aeronautical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval Aeronautical University filed Critical Naval Aeronautical University
Priority to CN202211437557.2A priority Critical patent/CN115826601A/en
Publication of CN115826601A publication Critical patent/CN115826601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, and aims to solve the problems that a depth certainty strategy gradient algorithm is low in convergence speed, difficult in reward function setting and the like when an unmanned aerial vehicle safety collision avoidance path is planned. Firstly, acquiring a demonstration track data set of an expert-controlled UAV obstacle avoidance based on simulator software; secondly, a mixed sampling mechanism is adopted, high-quality expert demonstration track data are fused in self-exploration data, and network parameters are updated, so that the algorithm exploration cost is reduced; and finally, solving the optimal reward function implicit in the expert experience according to the maximum entropy reverse reinforcement learning algorithm, thereby solving the problem of difficult setting of the reward function in complex tasks. The comparison experiment result shows that the method can effectively improve the algorithm training efficiency and has better obstacle avoidance performance.

Description

Unmanned aerial vehicle path planning method based on reverse reinforcement learning
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle path planning, and particularly relates to an unmanned aerial vehicle path planning method based on reverse reinforcement learning.
Background
With the further opening of the UAV (Unmanned Aerial Vehicle) field, the flight safety of the UAV is greatly threatened by the dense dynamic obstacles in complex environments such as cities, mountains and the like. Traditional path planning algorithms, such as heuristic algorithms like a, D, etc., and a graph theory-based through-view method, voronoi graph method, etc., can only deal with simple environments in which obstacle information is known in advance. However, the terrain of cities and mountains is complex and changeable, and the specific parameters of obstacles are difficult to obtain, so that the application range of the traditional obstacle avoidance algorithm is limited.
Different from the traditional path planning method, the navigation method based on reinforcement learning uses a learning mode of biological acquired perception development for reference, and continuously optimizes the obstacle avoidance strategy through interaction with the environment, so that the dependence on obstacle modeling and supervised learning is avoided, and the navigation method has strong generalization capability and robustness. Particularly, in recent years, the problem of 'exponential explosion' of a high-dimensional environment state space and a decision space is effectively solved by utilizing the strong perception and function fitting capability of deep reinforcement learning, and a new thought is provided for the path planning problem of the UAV in the dense dynamic obstacle environment. Deep reinforcement learning algorithms such as DDPG (Deep deterministic Policy gradient) algorithm, asynchronous advantageous AC (advanced AC) algorithm, TRPO (reliable Policy Optimization) algorithm and near-end Policy Optimization (PPO) algorithm are proposed in succession by Sliver, *** Deep Mind team, john Schulman of Berkeley university and OpenAI.
Although these methods have significant advantages in UAV path planning, it often requires exploration of large random obstacle environment samples to try new strategies, which are prone to fall into local optimality. Therefore, the invention provides a DDPG algorithm fusing expert experience loss, which introduces an expert demonstration track sample with a high reward value on the basis of a self-exploration sample so as to save an exploration space; meanwhile, an expert experience loss gradient function is introduced to optimize network parameters and obtain an optimal strategy.
The DDPG algorithm fused with the expert experience loss solves the problem of strategy iterative optimization, but the design of a reward function still has strong subjectivity, and the reward obtained through interaction with the environment is usually sparse, so that the algorithm is extremely difficult to converge during training, and the path planning effect is poor. When the expert finishes the obstacle avoidance task, the strategy of the expert is often optimal, so that the expert experience is learned from the expert demonstration track to construct the reward function, which is more suitable for the practical requirements than the artificially designed reward function form to a great extent, and under the condition of the given expert track, the algorithm for reversely deducing the reward function hidden in the expert experience is Inverse Reinforcement Learning (IRL). IRL can be divided into two broad categories, maximum margin and maximum entropy, but the method based on maximum margin tends to be ambiguous in that different reward functions with random preferences can be derived from the same expert strategy. The maximum entropy model is completely constructed based on known data (namely expert tracks), and unknown information is not distributed by any subjective assumption, so that the ambiguity problem is effectively avoided.
Disclosure of Invention
In order to overcome the problems in the prior art, the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning.
The technical scheme for solving the technical problems is as follows:
an unmanned aerial vehicle path planning method based on reverse reinforcement learning comprises the following steps:
step 1, collecting an expert demonstration track data set and a self-exploration track data set of an expert operating UAV obstacle avoidance;
step 2, constructing an experience pool, wherein the experience pool is composed of an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample;
step 3, based on the DDPG, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating solving of an optimal strategy;
step 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving a hidden probability model generating a track under the condition that the track is demonstrated by known experts;
and 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.
Further, the establishing of the experience pool in the step 2 specifically includes the following steps:
experience pool demonstration of trajectory data set T by expert expert And self-exploring trajectory dataset T discover And a mixed sampling mechanism is adopted to sample from the two data sets respectively to form a final training sample T:
T=α·T expert +β·T discover (1)
where α is the slave training set T expert Specific gravity of middle sampling, beta is from training set T discover Specific gravity of medium sample.
Further, the DDPG algorithm guided by the expert experience loss function introduced in the step 3 comprises an online policy network mu (s | theta) μ ) On-line value function network Q (s, a | theta [ ]) Q ) Target policy network μ' (s | θ) μ' ) And a network of target value functions Q' (s, a | theta) Q' )。
Further, an online value function network Q (s, a | θ) Q ) The optimization of the parameters specifically comprises the following steps:
according to the Bellman equation, at the ith training time step, the action target value y of the online value function network i Comprises the following steps:
y t =r t +γQ'(s t+1 ,μ'(s t+1μ' )|θ Q' ) (2)
the action target value and the actual output Q(s) of the online value function network i ,a iQ ) Error delta between i Comprises the following steps:
δ i =y i -Q(s i ,a iQ ) (3)
substituting equation (3) into equation (2) yields the loss function of the online value function network:
Figure BDA0003947318330000031
minimizing the loss function J (theta) by gradient descent method Q ) Function network parameter theta to online values Q Performing optimization updating to make J (theta) Q ) For network parameter theta Q Derivation to know the gradient value
Figure BDA0003947318330000032
Comprises the following steps:
Figure BDA0003947318330000033
the updating of the network parameters of the online value function is performed according to equation (5).
Further, the optimization of the online policy network parameters specifically comprises the following steps:
optimizing the network parameters of the online strategy by two parts, namely an expert demonstration track sample and a self-exploration sample;
for expert demonstration trajectory data, an online policy network is based on the current expert state
Figure BDA0003947318330000041
Predicted just-in-time strategy a i With real expert strategy
Figure BDA0003947318330000042
Mean square error of (J) expμ ) The prediction output strategy of the network is continuously driven to the expert strategy by introducing the expert experience loss:
Figure BDA0003947318330000043
in the formula (I), the compound is shown in the specification,
Figure BDA0003947318330000044
basing online policy networks on current expert state
Figure BDA0003947318330000045
A predicted just-in-time policy;
make expert experience lose J expμ ) The policy network parameter θ μ Obtaining the gradient value by derivation
Figure BDA0003947318330000046
Is composed of
Figure BDA0003947318330000047
On-line strategy gradient value according to original DDPG algorithm
Figure BDA0003947318330000048
Updating the parameter theta μ
Figure BDA0003947318330000049
Using a fusion gradient
Figure BDA00039473183300000410
Updating parameters of the online policy network:
Figure BDA00039473183300000411
wherein, lambda is a fusion gradient regulatory factor.
Further, the updating of the target network parameters is based on the online network parameters and adopts a soft updating mode:
Figure BDA00039473183300000412
wherein τ <1.
Further, the constructing the reward function in step 4 includes the following steps:
the known experts manipulate UAV obstacle avoidance generated trajectory ζ:
ζ={(s 1 ,a 1 ),(s 2 ,a 2 ),…(s n ,a n )} (11)
the reward value r (ζ) for that track is then:
Figure BDA00039473183300000413
fitting the reward function with a linear combination of a limited number of significant feature functions f (-) then
Figure BDA00039473183300000414
In the formula (f) i Is the i-th characteristic component of the reward function, θ i Is the ith component of the reward function weight vector; n is the number of the feature vectors in the reward function;
if the Euclidean distance d of the UAV relative to the obstacle, the relative distance heading angle psi d Angle of climbing with relative distance
Figure BDA0003947318330000051
The moving speed v and the heading angle psi of the UAV relative to the obstacle v Angle of climb of relative movement speed
Figure BDA0003947318330000052
The information belongs to important characteristics in the UAV obstacle avoidance process, so
Figure BDA0003947318330000053
Definition F (ζ) is the sum of the characteristic components of the respective states in equation (14), in the specific form:
Figure BDA0003947318330000054
substituting equation (15) into equation (13), the prize value for each track is expressed as:
r(ζ)=θ T F(ζ) (16)。
further, the step 4 of solving the reward function based on the maximum entropy inverse reinforcement learning algorithm specifically includes the following steps:
given m expert tracks, the characteristic expectation of an expert is:
Figure BDA0003947318330000055
with the expert trajectories known, assume a potential probability distribution of p (ζ) i | θ), then the expert trajectory is characterized by:
Figure BDA0003947318330000056
the maximum entropy model is constructed based on the expert track in the formula, and the problem of solving the maximum entropy is converted into an optimization problem:
Figure BDA0003947318330000057
wherein p = p (ζ) i |θ);
Converting the optimization problem into a dual form:
Figure BDA0003947318330000058
in the formula, λ j 、λ 0 Is a lagrange multiplier;
let the loss function L (p) derive the expert demonstration trajectory distribution probability p, and obtain:
Figure BDA0003947318330000061
and (3) making the above expression equal to 0, obtaining a maximum entropy probability model of the expert demonstration track:
Figure BDA0003947318330000062
in the formula of j A weight vector theta corresponding to a feature function in the reward function;
Figure BDA0003947318330000063
in the formula, Z (theta) is a distribution item, namely the sum of all possible expert track probabilities;
in the probability model shown in the above formula, the higher the probability of the occurrence of the expert trajectory, i.e. the higher the Z (θ), the closer the reward function setting is to the optimal strategy implied in the expert example; the method can convert the optimal reward function solving into the entropy of the maximized expert track distribution for optimization:
Figure BDA0003947318330000064
and (3) converting the above equation into a minimized negative log-likelihood function solution loss quantity of the characteristic component weight theta of the reward function:
Figure BDA0003947318330000065
and (3) calculating an expert track prediction distribution function Z (theta) under the current strategy:
Figure BDA0003947318330000066
in the formula, T samp To representThe expert tracks under the current strategy, and n represents the number of the expert tracks under the current strategy;
for continuous expert state in sampled expert track
Figure BDA0003947318330000067
And corresponding real expert strategy
Figure BDA0003947318330000068
Discretization is performed, and random batch sampling is performed from the discretization, and the formula (25) is converted into:
Figure BDA0003947318330000069
in the above equation, the loss function J (θ) is:
Figure BDA0003947318330000071
the loss function J (θ) is differentiated from the weight θ of the reward function, and the optimal reward function is solved by a gradient descent method, so that:
Figure BDA0003947318330000072
in conclusion, the global optimal solution r of the reward function is finally learned through the equation (29) * (s i ,a i )。
Further, the training of the DDPG in the step 5 until the DDPG completes the flight mission with an optimal strategy under an optimal reward function implied by the expert trajectory includes the following steps:
randomly initializing online policy network Q (s, a | θ) Q ) And an online value function network mu (s | theta) μ ) Network parameter θ of μ And theta Q Initializing the target networks mu 'and Q' and their weights, initializing the reward function weight theta i (ii) a Initializing an experience pool, and demonstrating a track data set T by the acquired experts expert Storing the data into an experience pool;
a) The online strategy network obtains an action a = pi theta(s) + eta based on the current state s t Wherein η t Is random noise, and the action selection strategy pi depends on the design of a reward function;
b) Performing an action a with the environment in an interactive manner to obtain a new state s' and an instant reward value r;
c) Self-exploring sample data (s, a, r, s') generated by interacting with environment, namely T discover Storing the data into an experience pool;
d)s=s′;
e) Randomly sampling N sample data from experience pool, training, estimating distribution function Z (theta) according to formula (26), minimizing target value shown in formula (28), and weighting reward function theta i Optimizing to obtain an optimal reward function;
f) Updating the value function network parameter theta according to equation (5) Q If the training data T is equal to T expert Then updating the strategy network parameters according to the formula (7), if the training data T is equal to T discover Then updating the policy network parameter θ according to equation (8) μ
g) Updating the target network parameter θ according to equation (10) Q′ And theta μ′
h) When s' is in the termination state, the current iteration is ended, otherwise go to step a).
Compared with the prior art, the invention has the following technical effects:
the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, aiming at the problem of UAV path planning. Firstly, acquiring a demonstration track data set of an expert-controlled UAV obstacle avoidance based on simulator software; secondly, a mixed sampling mechanism is adopted, high-quality expert demonstration track data are fused in self-exploration data, and network parameters are updated, so that the exploration cost is reduced; and finally, solving the optimal reward function according to the maximum entropy reverse reinforcement learning algorithm, and solving the problems that the reward function is difficult to design and the experience of the invisible expert cannot be fully mined.
Drawings
FIG. 1 is a schematic diagram of a simulator assembly and data display according to the present invention;
FIG. 2 is a block diagram of simulator training of the present invention;
FIG. 3 is a diagram of an expert demonstration trace data set of the present invention;
FIG. 4 is a schematic diagram of a testing environment of the present invention;
FIG. 5 is a block diagram of a hybrid sampling mechanism of the present invention;
FIG. 6 is a diagram of the improved DDPG algorithm training framework of the present invention;
FIG. 7 is a graph of reward values for fused expert experience loss in accordance with the present invention;
FIG. 8 is a graph of reward values for the present invention based on maximum entropy inverse reinforcement learning;
figure 9 is a graph of prize values for the integrated improved DDPG algorithm of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
In order to solve the problems of low convergence speed, difficult reward function setting and the like when a depth certainty strategy gradient algorithm is used for planning a safe collision avoidance path of an unmanned aerial vehicle, the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, which comprises the following steps: step 1, collecting an expert demonstration track data set and a self-exploration track data set of an expert operating UAV obstacle avoidance; step 2, establishing an experience pool, wherein the experience pool is composed of an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample; step 3, constructing the DDPG, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating solving of an optimal strategy; step 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving a hidden probability model generating a track under the condition that the track is demonstrated by known experts; and 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.
The following is a description of the above steps:
step 1, collecting an expert demonstration track data set and a self-exploration track data set of an expert operating UAV obstacle avoidance;
acquiring an expert demonstration track based on a complex obstacle scene carried by a simulator Phoenix RC software, and constructing a simple obstacle scene based on Python to generate a self-exploration data sample by interaction of the UAV and the environment.
The acquisition of the expert demonstration track is based on professional radio control flight simulation software Phoenix RC developed by running game companies. Expert demonstration traces were collected in a Phoenix RC simulator, using mainly the following three components: 1) A remote controller module: matching each channel of the remote controller with the functions of the UAV model, including the rod stroke of a rudder, an elevator, an accelerator and an aileron, so as to control the movement of the UAV; 2) A UAV model module: the device comprises a plurality of types such as fixed wings and rotary wings; 3) A scene module: hundreds of three-dimensional simulation obstacle environments are provided, and variable simulation reality such as wind power illumination can be customized. In a simulation environment, data such as UAV flight speed, course angle, GPS positioning, gyroscope, barometer and the like can be acquired through an interface function and displayed in real time. The components and data display are shown in FIG. 1.
The framework of the obstacle avoidance training of the manually manipulated UAV model in the Phoenix RC simulator is shown in fig. 2. After the environment state information is obtained in the three-dimensional simulated obstacle environment, the environment state information comprises obstacle positions, the distance between the UAV and the obstacles and the like, and experts manually operate the rudder, the elevator, the accelerator and the aileron stick stroke of the remote controller, and continuously adjust the course angle, the pitch angle, the flying speed and the like of the UAV model to avoid the obstacles.
Part of the expert demonstration traces collected from the Phoenix RC simulator are shown in figure 3. The benefits of collecting the obstacle environment data set in the simulator are: 1) Obstacle setting and scene types in the simulator are complex and changeable, and the degree of closeness to the real world is high; 2) Training is completely carried out in a simulation scene, and the unmanned aerial vehicle can be manually operated to simulate various different maneuvering actions so as to determine the optimal flight strategy; 3) The simulator visually displays the monocular RGB image at each moment in the obstacle avoidance process and parameters such as the course angle, the flight speed and the like of the UAV without using a complex sensor for sensing and measuring; 4) Without concern for collision damage and safety issues.
And constructing a three-dimensional obstacle environment Q shown in FIG. 4 for testing the performance of the algorithm, so that the UAV generates a self-exploration data set in interaction with the environment. The obstacle environment is L in length, W in width and H in height, dynamic and static obstacles with different threat degrees exist in the environment, and the space position, the movement speed and the influence range of the obstacles are unknown.
And 2, constructing an experience pool, wherein the experience pool is formed by an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample.
In the UAV obstacle avoidance training, in order to avoid resource waste caused by random inefficient exploration in the UAV initial training phase, diversification of samples is realized as much as possible, and further, the upper limit implied by the expert strategy is broken through, as shown in fig. 5, in the experience pool of the present embodiment, an expert demonstrates a track data set T expert And self-exploring trajectory dataset T discover And a mixed sampling mechanism is adopted to sample from the two data sets respectively to form a final training sample T:
T=α·T expert +β·T discover (1)
where α is the slave training set T expert Specific gravity of middle sampling, beta is from training set T discover Specific gravity of medium sample.
And 3, introducing an expert experience loss function to guide iterative updating of DDPG parameters based on DDPG, and accelerating solving of an optimal strategy.
Aiming at the defects of large exploration space and low sample reward value in the initial stage of the original DDPG algorithm, the improved DDPG algorithm optimization strategy iteration fusing expert experience loss is provided. Training samples of an original DDPG algorithm only contain a data set generated by interacting with the environment and self-exploring, and the embodiment adopts a mixed sampling mechanism and introduces part of expert demonstration track samples on the basis of self-exploring samples. For an expert track data set, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating to solve an optimal strategy; the self-exploring data samples are still updated according to the original DDPG algorithm.
The DDPG algorithm introduced with the guidance of the expert experience loss function comprisesOnline policy network mu (s | theta) μ ) An online value function network Q (s, a | theta |) Q ) Target policy network mu' (s | theta) μ' ) And a target value function network Q' (s, a | theta) Q' ) And fourthly, the method comprises the following steps.
According to the Bellman equation, at the ith training time step, the action target value y of the online value function network i Comprises the following steps:
y t =r t +γQ'(s t+1 ,μ'(s t+1μ ')|θ Q' ) (2)
the action target value and the actual output Q(s) of the online value function network i ,a iQ ) Error delta between i Comprises the following steps:
δ i =y i -Q(s i ,a iQ ) (3)
substituting equation (3) into equation (2) can yield the loss function of the online value function network:
Figure BDA0003947318330000111
minimizing the loss function J (theta) by gradient descent method Q ) Function network parameter theta to online values Q Performing optimization updating to make J (theta) Q ) For network parameter theta Q Derivation to know the gradient value
Figure BDA0003947318330000112
Comprises the following steps:
Figure BDA0003947318330000113
the updating of the network parameters of the online value function is performed according to equation (5).
The optimization of the online strategy network parameters is divided into two parts, namely an expert demonstration track sample and a self-exploration sample. For expert demonstration trajectory data, the online policy network may be based on the current expert state
Figure BDA0003947318330000114
Predicted just-in-time strategy a i With real expert strategy
Figure BDA0003947318330000115
Mean square error of (J) expμ ) The strategy network is introduced as expert experience loss, so that the prediction output strategy of the network continuously tends to the expert strategy:
Figure BDA0003947318330000116
in the formula (I), the compound is shown in the specification,
Figure BDA0003947318330000117
basing online policy networks on current expert state
Figure BDA0003947318330000118
Predicted immediate policy.
Make expert experience lose J expμ ) The policy network parameter θ μ Obtaining the gradient value by derivation
Figure BDA0003947318330000119
Is composed of
Figure BDA00039473183300001110
Due to the fact that the track of the expert strategy is limited, the whole state and action space cannot be covered, the UAV can explore a larger space in the interaction with the environment, the upper limit implied by the expert strategy is broken through, and the stability of the algorithm is improved. Therefore, when the expert experience loss gradient optimization online strategy iterative process is introduced, the online strategy gradient value of the self-exploration track data set and the original DDPG algorithm is kept
Figure BDA00039473183300001111
Updating the parameter θ μ
Figure BDA00039473183300001112
Adopting an online strategy gradient containing expert experience loss
Figure BDA00039473183300001113
And original online policy gradient
Figure BDA00039473183300001114
The expert experience loss function method introduces high-quality expert strategies to save exploration space in the initial stage and improve algorithm convergence efficiency on one hand, and continuously learns in self exploration on the other hand so as to try to acquire better strategies which are not involved in expert tracks. Finally, the fusion gradient was used according to the following formula
Figure BDA0003947318330000121
Updating parameters of the online policy network:
Figure BDA0003947318330000122
wherein, lambda is a fusion gradient regulatory factor.
The updating of the target network parameters is based on the online network parameters and adopts a soft updating mode:
Figure BDA0003947318330000123
wherein τ <1.
And 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving an implicit probability model for generating the track under the condition that the track is demonstrated by known experts.
The known expert controls a track ζ generated by UAV obstacle avoidance:
ζ={(s 1 ,a 1 ),(s 2 ,a 2 ),…(s n ,a n )} (11)
the reward value r (ζ) for that track is then:
Figure BDA0003947318330000124
fitting the reward function with a linear combination of a limited number of significant feature functions f (-) then
Figure BDA0003947318330000125
In the formula (f) i Is the i-th characteristic component of the reward function, θ i Is the ith component of the reward function weight vector; n is the number of feature vectors in the reward function.
In the process of operating the UAV obstacle avoidance by the expert, the expert operator often makes a decision according to the current UAV flight speed, the azimuth distance between the UAV and the obstacle, and the like. Thus, the Euclidean distance d of the UAV from the obstacle, the relative distance heading angle ψ d Angle of climbing with relative distance
Figure BDA0003947318330000126
The moving speed v and the heading angle psi of the UAV relative to the obstacle v Angle of climb of relative movement speed
Figure BDA0003947318330000127
The information belongs to important characteristics in the UAV obstacle avoidance process, so
Figure BDA0003947318330000128
Definition F (ζ) is the sum of the characteristic components of the respective states in equation (14), in the specific form:
Figure BDA0003947318330000131
substituting equation (15) into equation (13), the prize value for each track can be expressed as:
r(ζ)=θ T F(ζ) (16)
given m expert trajectories, the characteristic expectations of the experts are:
Figure BDA0003947318330000132
with the expert trajectories known, assume a potential probability distribution of p (ζ) i | θ), then the expert trajectory is characterized by:
Figure BDA0003947318330000133
the maximum entropy model is constructed completely based on known data (namely expert tracks) in the formula, and no subjective assumption is made on unknown conditions, so that the ambiguity problem existing in the user-defined reward function can be effectively avoided. Converting the solution of the maximum entropy problem into an optimization problem:
Figure BDA0003947318330000134
wherein p = p (ζ) i |θ)。
Converting the optimization problem into a dual form:
Figure BDA0003947318330000135
in the formula, λ j 、λ 0 Is a lagrange multiplier.
Let the loss function L (p) derive the expert demonstration trajectory distribution probability p, and obtain:
Figure BDA0003947318330000136
and (3) making the above expression equal to 0, obtaining a maximum entropy probability model of the expert demonstration track:
Figure BDA0003947318330000137
in the formula of j A weight vector theta corresponding to a feature function in the reward function.
Figure BDA0003947318330000141
In the formula, Z (θ) is a distribution term, i.e., the sum of all possible expert trajectory probabilities.
In the probability model shown in the above formula, the greater the probability of occurrence of the expert trajectory, i.e., the greater the Z (θ), the closer the reward function setting is to the optimal strategy implied in the expert example. The method can convert the optimal reward function solving into the entropy of the maximized expert track distribution for optimization:
Figure BDA0003947318330000142
and (3) converting the above equation into a minimized negative log-likelihood function solution loss quantity of the characteristic component weight theta of the reward function:
Figure BDA0003947318330000143
and (3) calculating an expert track prediction distribution function Z (theta) under the current strategy:
Figure BDA0003947318330000144
in the formula, T samp Representing the expert tracks under the current strategy and n representing the number of expert tracks under the current strategy.
Since there is a certain difference in expert cognition, to reduce the fitting variance of the weight theta, the continuous expert states in the sampled expert trajectory are subjected to
Figure BDA0003947318330000145
And corresponding real expert strategy
Figure BDA0003947318330000146
Discretization is performed, and random batch sampling is performed from the discretization, and the formula (25) is converted into:
Figure BDA0003947318330000147
in the above equation, the loss function J (θ) is:
Figure BDA0003947318330000148
the loss function J (θ) is differentiated from the weight θ of the reward function, and the optimal reward function is solved by a gradient descent method, so that:
Figure BDA0003947318330000149
in conclusion, the global optimal solution r of the reward function can be finally learned through the equation (29) * (s i ,a i )。
And 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.
The obstacle avoidance problem using the modified DDPG algorithm can be described as: at a series of successive decision moments, the network makes a decision based on the UAV current state s; after the decision is implemented, the network acquires an instant reward value according to a reward function designed by reverse reinforcement learning, the reward corresponds to a network decision and an environment state, and then the network enters a state at the next moment corresponding to the decision and forwards updates network parameters through a DDPG algorithm for fusing expert supervision loss; and in the new training time, the network executes a new decision according to the current new state and obtains a new reward value, and the operation is repeated in a circulating way until the network completes the flight mission with the optimal strategy under the optimal reward function hidden in the expert track.
The specific training comprises the following steps:
1. randomly initializing an online policy network mu (s | theta) μ ) On-line value function network Q (s, a | theta [ ]) Q ) Network parameter θ of μ And theta Q Initializing the target networks mu 'and Q' and the weights thereof;
2. constructing a reward function and initializing the weight of the reward function;
3. initializing an experience pool, and demonstrating a track data set T by the acquired experts expert Storing the data into an experience pool;
4. performing network training with iteration number M;
a) The online strategy network obtains an action a = pi theta(s) + eta based on the current state s t Wherein η t Is random noise, and the action selection strategy pi depends on the design of a reward function;
b) Performing an action a with the environment in an interactive manner to obtain a new state s' and an instant reward value r;
c) Self-exploring sample data (s, a, r, s') generated by interaction with environment, namely T discover Storing the data into an experience pool;
d)s=s′;
e) Randomly sampling N sample data from experience pool, training, estimating distribution function Z (theta), minimizing target value shown in formula (28), and weighting reward function theta i Optimizing to obtain an optimal reward function;
f) Updating the value function network parameter theta according to equation (5) Q If the training data T is equal to T expert Then updating the strategy network parameters according to the formula (7), if the training data T is equal to T discover Then updating the policy network parameter θ according to equation (8) μ
g) Updating the target network parameter θ according to equation (10) Q′ And theta μ′
h) When s' is in the termination state, the current iteration is ended, otherwise, the step a) is carried out.
And testing the obstacle avoidance performance of the algorithm in the simulation environment. The simulation experiment environment is Python 3.7, intercore i5 processor with 2.42GHz main frequency and Windows 10 operating system. The test scene is a dense dynamic and static obstacle environment as shown in fig. 4, a three-dimensional area with a simulation environment of 200 × 250 × 400 is set, and a plurality of dynamic obstacles moving at different movement speeds along different heading angles and climbing angle directions exist in the area.
In order to test the strategy learning performance of the DDPG algorithm fusing the expert experience loss, the invention compares and tests the influence of the improved algorithm and the original algorithm on the network convergence speed under the condition of keeping the consistency of the reward function. The design reward function is shown in equation (30), where R is the reward function and d min Is d in formula (14) i The minimum of (d), i.e., the distance between the UAV and the nearest obstacle, thereby urging the UAV away from the obstacle; d tar The distance between the unmanned aerial vehicle and the terminal point is obtained, so that the UAV is promoted to fly towards the terminal point direction; k is a radical of 1 =2,k 2 =0.5。
Figure BDA0003947318330000161
The reward values during UAV training are shown in figure 7. As can be seen from the figure, the strategy learning speed of the UAV adopting the original DDPG algorithm rises approximately linearly in the iteration of 100 to 200 rounds; but in 200 to 600 iterations, the model falls into a local optimal solution, resulting in difficulty in network convergence; at 600 to 1600 rounds, the strategy learning speed continues to increase, but the speed increase is slowed down; the prize values converge gradually after 1600 iterations, roughly settling at 110, but with large fluctuations. By adopting a DDPG algorithm fused with expert experience loss, the UAV continuously tells the increase of the strategy learning speed in 100 to 300 rounds; after 300 iterations, the speed increase is slowed down to some extent, but no stagnation condition occurs; after 1300 iterations the network converged and the reward value was approximately stable at 125, higher than the original DDPG algorithm, and the fluctuation amplitude was small. In conclusion, the DDPG algorithm fused with experience loss has the advantages of higher convergence speed, better stability and better obstacle avoidance effect.
For the analysis of the influence of the maximum entropy inverse reinforcement learning algorithm on the obstacle avoidance effect, the reward value curve of the optimal reward function solved by the UAV based on the maximum entropy inverse reinforcement learning is shown in fig. 8. The UAV trained by the optimal reward function keeps increasing at a high speed in the strategy learning speed within 100 to 500 rounds; after 500 iterations, the acceleration rate is slightly slowed down; the prize value converges to approximately 170 after 1300 iterations. Compared with the original DDPG algorithm, the UAV reward value trained by the optimal reward function is obviously higher, the strategy learning speed is higher, and the stability is better.
And (3) integrating the expert experience loss and the reverse reinforcement learning to improve the DDPG algorithm. Figure 9 shows a prize value curve for the modified DDPG algorithm that combines expert experience loss and reverse reinforcement learning. It can be known from the figure that, combining the two improved DDPG algorithms, the two improved DDPG algorithms combine their respective advantages, so that not only the strategy learning speed is faster in the initial stage, but also the obstacle avoidance effect is better, and the reward value can converge to 175 in about 1300 iterations.
The invention aims at the UAV path planning problem and improves the DDPG algorithm. By introducing an expert experience loss function optimization strategy network iteration process, the initial exploration cost of the original DDPG algorithm is saved, and the network convergence speed is accelerated; meanwhile, the optimal reward function hidden in the expert demonstration track is solved based on the maximum entropy reverse reinforcement learning algorithm, and the problem that the reward function is difficult to set manually in a complex task is solved. The contrast experiment shows that the improved DDPG algorithm has higher convergence speed and better obstacle avoidance effect.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. An unmanned aerial vehicle path planning method based on reverse reinforcement learning is characterized by comprising the following steps:
step 1, collecting an expert demonstration track data set and a self-exploration track data set of an UAV obstacle avoidance operated by an expert;
step 2, constructing an experience pool, wherein the experience pool is composed of an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample;
step 3, based on the DDPG, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating solving of an optimal strategy;
step 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving a hidden probability model generating a track under the condition that the track is demonstrated by known experts;
and 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.
2. The method for unmanned aerial vehicle path planning based on reverse reinforcement learning according to claim 1, wherein the step 2 of constructing the experience pool specifically comprises the following steps:
experience pool demonstration of trajectory data set T by expert expert And self-exploring trajectory dataset T discover And a mixed sampling mechanism is adopted to sample from the two data sets respectively to form a final training sample T:
T=α·T expert +β·T discover (1)
where α is the slave training set T expert Specific gravity of middle sampling, beta is from training set T discover Specific gravity of medium sample.
3. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 1, wherein the DDPG algorithm guided by the expert experience loss function introduced in the step 3 comprises an online policy network μ (s | θ |) μ ) On-line value function network Q (s, a | theta [ ]) Q ) Target policy network mu' (s | theta) μ' ) And a target value function network Q' (s, a | theta) Q' )。
4. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 3, wherein the online value function network Q (s, a | θ) Q ) The optimization of the parameters specifically comprises the following steps:
training at the ith according to Bellman's equationAction target value y of time step, on-line value function network i Comprises the following steps:
y t =r t +γQ'(s t+1 ,μ'(s t+1μ' )|θ Q' ) (2)
the action target value and the actual output Q(s) of the online value function network i ,a iQ ) Error delta between i Comprises the following steps:
δ i =y i -Q(s i ,a iQ ) (3)
substituting equation (3) into equation (2) yields the loss function of the online value function network:
Figure FDA0003947318320000021
minimizing the loss function J (theta) by gradient descent method Q ) Function network parameter theta to online values Q Performing optimization updating to make J (theta) Q ) For network parameter theta Q Derivation to know the gradient value
Figure FDA0003947318320000022
Comprises the following steps:
Figure FDA0003947318320000023
the updating of the network parameters of the online value function is performed according to equation (5).
5. The unmanned aerial vehicle path planning method based on reverse reinforcement learning according to claim 3, wherein the optimization of the online policy network parameters specifically comprises the following steps:
optimizing the network parameters of the online strategy by two parts, namely an expert demonstration track sample and a self-exploration sample;
for expert demonstration trajectory data, an online policy network is based on the current expert state
Figure FDA0003947318320000024
Predicted just-in-time strategy a i With real expert strategy
Figure FDA0003947318320000025
Mean square error of (J) expμ ) The prediction output strategy of the network is continuously driven to the expert strategy by introducing the expert experience loss:
Figure FDA0003947318320000026
in the formula (I), the compound is shown in the specification,
Figure FDA0003947318320000027
basing online policy networks on current expert state
Figure FDA0003947318320000028
A predicted just-in-time policy;
make expert experience lose J expμ ) The policy network parameter θ μ Obtaining the gradient value by derivation
Figure FDA0003947318320000029
Is composed of
Figure FDA00039473183200000210
On-line strategy gradient value according to original DDPG algorithm
Figure FDA00039473183200000211
Updating the parameter θ μ
Figure FDA00039473183200000212
Using a fusion gradient
Figure FDA0003947318320000031
Updating parameters of the online policy network:
Figure FDA0003947318320000032
wherein, lambda is a fusion gradient regulatory factor.
6. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 3, wherein the update of the target network parameters is based on online network parameters in a soft update mode:
Figure FDA0003947318320000033
wherein τ <1.
7. The method of claim 6, wherein the constructing the reward function in step 4 comprises the following steps:
the known experts manipulate UAV obstacle avoidance generated trajectory ζ:
ζ={(s 1 ,a 1 ),(s 2 ,a 2 ),…(s n ,a n )} (11)
the reward value r (ζ) for that track is then:
Figure FDA0003947318320000034
fitting the reward function with a linear combination of a limited number of significant feature functions f (-) then
Figure FDA0003947318320000035
In the formula (f) i Is the i-th characteristic component of the reward function, θ i Is the ith component of the reward function weight vector; n is the number of the feature vectors in the reward function;
if the Euclidean distance d of the UAV relative to the obstacle, the relative distance heading angle psi d Angle of climbing with relative distance
Figure FDA0003947318320000036
Velocity v of UAV relative to an obstacle, relative velocity heading angle psi v Angle of climb of relative movement speed
Figure FDA0003947318320000037
The information belongs to important characteristics in the UAV obstacle avoidance process, so
Figure FDA0003947318320000038
Definition F (ζ) is the sum of the characteristic components of the respective states in equation (14), in the specific form:
Figure FDA0003947318320000039
substituting equation (15) into equation (13), the prize value for each track is expressed as:
r(ζ)=θ T F(ζ) (16)。
8. the method for unmanned aerial vehicle path planning based on inverse reinforcement learning of claim 7, wherein the step 4 of solving the reward function based on the maximum entropy inverse reinforcement learning algorithm specifically comprises the following steps:
given m expert tracks, the characteristic expectation of an expert is:
Figure FDA0003947318320000041
with the expert trajectories known, assume a potential probability distribution of p (ζ) i | θ), then the expert trajectory is characterized by:
Figure FDA0003947318320000042
the maximum entropy model is constructed based on the expert track in the formula, and the problem of solving the maximum entropy is converted into an optimization problem:
Figure FDA0003947318320000043
wherein p = p (ζ) i |θ);
Converting the optimization problem into a dual form:
Figure FDA0003947318320000044
in the formula, λ j 、λ 0 Is a lagrange multiplier;
let the loss function L (p) derive the expert demonstration trajectory distribution probability p, and obtain:
Figure FDA0003947318320000045
and (3) making the above expression equal to 0, obtaining a maximum entropy probability model of the expert demonstration track:
Figure FDA0003947318320000046
in the formula of j A weight vector theta corresponding to a feature function in the reward function;
Figure FDA0003947318320000047
in the formula, Z (theta) is a distribution item, namely the sum of all possible expert track probabilities;
in the probability model shown in the above formula, the higher the probability of the occurrence of the expert trajectory, i.e. the higher the Z (θ), the closer the reward function setting is to the optimal strategy implied in the expert example; the method can convert the optimal reward function solving into the entropy of the maximized expert track distribution for optimization:
Figure FDA0003947318320000051
and (3) converting the above equation into a minimized negative log-likelihood function solution loss quantity of the characteristic component weight theta of the reward function:
Figure FDA0003947318320000052
and (3) calculating an expert track prediction distribution function Z (theta) under the current strategy:
Figure FDA0003947318320000053
in the formula, T samp Representing expert tracks under the current strategy, wherein n represents the number of the expert tracks under the current strategy;
for continuous expert state in sampled expert track
Figure FDA0003947318320000054
And corresponding real expert strategy
Figure FDA0003947318320000055
Discretization is performed, and random batch sampling is performed from the discretization, and the formula (25) is converted into:
Figure FDA0003947318320000056
in the above equation, the loss function J (θ) is:
Figure FDA0003947318320000057
the loss function J (θ) is differentiated from the weight θ of the reward function, and the optimal reward function is solved by a gradient descent method, so that:
Figure FDA0003947318320000058
in conclusion, the global optimal solution r of the reward function is finally learned through the equation (29) * (s i ,a i )。
9. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 8, wherein training the DDPG in the step 5 until the DDPG completes the flight mission with an optimal strategy under an optimal reward function implied by the expert trajectory comprises the following steps:
randomly initializing online policy network Q (s, a | θ) Q ) And an online value function network mu (s | theta) μ ) Network parameter θ of μ And theta Q Initializing the target networks mu 'and Q' and their weights, initializing the reward function weights theta i (ii) a Initializing an experience pool, and demonstrating a track data set T by the acquired experts expert Storing the data into an experience pool;
a) The online strategy network obtains an action a = pi theta(s) + eta based on the current state s t Wherein η t Is random noise, and the action selection strategy pi depends on the design of a reward function;
b) Performing an action a with the environment in an interactive manner to obtain a new state s' and an instant reward value r;
c) Number of self-exploring samples to be generated by interaction with environmentAccording to (s, a, r, s'), i.e. T discover Storing the data into an experience pool;
d)s=s′;
e) Randomly sampling N sample data from experience pool, training, estimating distribution function Z (theta) according to formula (26), minimizing target value shown in formula (28), and weighting reward function theta i Optimizing to obtain an optimal reward function;
f) Updating the value function network parameter theta according to equation (5) Q If the training data T is equal to T expert Then updating the strategy network parameters according to the formula (7), if the training data T is equal to T discover Then updating the policy network parameter θ according to equation (8) μ
g) Updating the target network parameter θ according to equation (10) Q′ And theta μ′
h) When s' is in the termination state, the current iteration is ended, otherwise, the step a) is carried out.
CN202211437557.2A 2022-11-17 2022-11-17 Unmanned aerial vehicle path planning method based on reverse reinforcement learning Pending CN115826601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211437557.2A CN115826601A (en) 2022-11-17 2022-11-17 Unmanned aerial vehicle path planning method based on reverse reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211437557.2A CN115826601A (en) 2022-11-17 2022-11-17 Unmanned aerial vehicle path planning method based on reverse reinforcement learning

Publications (1)

Publication Number Publication Date
CN115826601A true CN115826601A (en) 2023-03-21

Family

ID=85528602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211437557.2A Pending CN115826601A (en) 2022-11-17 2022-11-17 Unmanned aerial vehicle path planning method based on reverse reinforcement learning

Country Status (1)

Country Link
CN (1) CN115826601A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523154A (en) * 2023-03-22 2023-08-01 中国科学院西北生态环境资源研究院 Model training method, route planning method and related devices
CN117193008A (en) * 2023-10-07 2023-12-08 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium
CN117273225A (en) * 2023-09-26 2023-12-22 西安理工大学 Pedestrian path prediction method based on space-time characteristics
CN118034065A (en) * 2024-04-11 2024-05-14 北京航空航天大学 Training method and device for unmanned aerial vehicle decision network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523154A (en) * 2023-03-22 2023-08-01 中国科学院西北生态环境资源研究院 Model training method, route planning method and related devices
CN116523154B (en) * 2023-03-22 2024-03-29 中国科学院西北生态环境资源研究院 Model training method, route planning method and related devices
CN117273225A (en) * 2023-09-26 2023-12-22 西安理工大学 Pedestrian path prediction method based on space-time characteristics
CN117273225B (en) * 2023-09-26 2024-05-03 西安理工大学 Pedestrian path prediction method based on space-time characteristics
CN117193008A (en) * 2023-10-07 2023-12-08 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium
CN117193008B (en) * 2023-10-07 2024-03-01 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium
CN118034065A (en) * 2024-04-11 2024-05-14 北京航空航天大学 Training method and device for unmanned aerial vehicle decision network

Similar Documents

Publication Publication Date Title
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN115826601A (en) Unmanned aerial vehicle path planning method based on reverse reinforcement learning
Li et al. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning
Cheng et al. Real-time optimal control for spacecraft orbit transfer via multiscale deep neural networks
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
Yang et al. UAV air combat autonomous maneuver decision based on DDPG algorithm
CN111123963A (en) Unknown environment autonomous navigation system and method based on reinforcement learning
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN114840020A (en) Unmanned aerial vehicle flight path planning method based on improved whale algorithm
CN115033022A (en) DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN115435787A (en) Unmanned aerial vehicle three-dimensional path planning method and system based on improved butterfly algorithm
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
Zijian et al. Imaginary filtered hindsight experience replay for UAV tracking dynamic targets in large-scale unknown environments
Cheng et al. Autonomous decision-making generation of UAV based on soft actor-critic algorithm
CN114679729A (en) Radar communication integrated unmanned aerial vehicle cooperative multi-target detection method
CN115293022A (en) Aviation soldier intelligent agent confrontation behavior modeling method based on OptiGAN and spatiotemporal attention
Zhang et al. UAV path planning based on receding horizon control with adaptive strategy
CN117784812A (en) Unmanned aerial vehicle autonomous flight decision-making method based on evolutionary guided deep reinforcement learning
Fletcher et al. Improvements in learning to control perched landings
Zhu et al. Mastering air combat game with deep reinforcement learning
Li et al. Autopilot controller of fixed-wing planes based on curriculum reinforcement learning scheduled by adaptive learning curve
Bonin et al. Optimal path planning for soaring flight
CN114815875B (en) Unmanned aerial vehicle cluster formation controller parameter adjustment method based on intelligent optimization of integrated fully-shooting pigeon clusters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination