CN112947575B

CN112947575B - Unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning

Info

Publication number: CN112947575B
Application number: CN202110285074.4A
Authority: CN
Inventors: 刘志宏; 李�杰; 周文宏; 王祥科; 相晓嘉; 丛一睿
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2023-05-16
Anticipated expiration: 2041-03-17
Also published as: CN112947575A

Abstract

The invention discloses an unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning, wherein the method comprises the following steps: s1, modeling multi-target search of unmanned aerial vehicle clusters, constructing a cluster decision model, a kinematic model of unmanned aerial vehicles and ground targets, a communication model among the unmanned aerial vehicles and an observation model of the unmanned aerial vehicles on the ground targets, and respectively and independently configuring a neural network model for each unmanned aerial vehicle in the clusters; s2, when multi-target searching is carried out, the respective neural network models of the unmanned aerial vehicles in the cluster respectively input ground target observation data of the unmanned aerial vehicles and communication data among the unmanned aerial vehicles, and the unmanned aerial vehicles acquire actions to be executed according to the output actions of the respective neural network models and update parameters of the respective neural network models. The method has the advantages of simplicity, low computational complexity, strong robust processing capability, high searching efficiency, good instantaneity and the like.

Description

Unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle cluster control, in particular to an unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning.

Background

In unmanned aerial vehicle cluster cooperative control, a plurality of unmanned aerial vehicle cooperative search methods exist currently, such as a method based on cooperative task planning, a method based on intelligent optimization, a method based on consistency control and the like. However, most of the above searching methods are implemented by means of establishing a mathematical model and then solving a feasible solution through numerical calculation, which has the following problems:

1) Accurate modeling is difficult. The searching performance of the various searching methods needs to depend on an accurate model, but the problem of multi-target searching of the unmanned aerial vehicle cluster relates to the aspects of the unmanned aerial vehicle self state, the target state, the multi-machine cooperation state, the information interaction state and the like, so that accurate modeling is difficult to realize, meanwhile, the influence of each aspect on decision making is difficult to quantitatively analyze, and the dependence of the searching method on the accurate model is difficult to support.

2) The scalability is not sufficient. The state space of the problem is increased in double indexes along with the number of unmanned aerial vehicles and decision step length, and the problem of state space explosion is faced by the search method along with the increase of the number of nodes in the unmanned aerial vehicle cluster, so that the expandability of application is insufficient.

3) Solving is difficult. The unmanned aerial vehicle execution environment is changed in real time, the unmanned aerial vehicle needs to make decisions in real time, and meanwhile, the unmanned aerial vehicle cluster is large in scale generally, so that the searching method is difficult to solve the abnormality rapidly.

The multi-target search of the unmanned aerial vehicle cluster is that the unmanned aerial vehicle cluster searches a plurality of targets. Aiming at the multi-target search problem of unmanned aerial vehicle clusters, the following problems mainly exist at present:

(1) Solvency. In the unmanned aerial vehicle cluster, a large number of unmanned aerial vehicles perform decision making through mutual cooperation, the scale of a state space is increased in double indexes along with the number of unmanned aerial vehicles and decision step length, and therefore, in unmanned aerial vehicle cluster control, how to solve a large-scale distributed decision problem is one of the difficult problems.

(2) Uncertainty. In the application of target searching, the prior knowledge about the number, distribution, motion state and the like of the targets is often less or may be difficult to obtain, and meanwhile, noise and errors exist in the detection of the sensor, which can cause uncertainty. The uncertainty not only increases the problem calculation difficulty, but also seriously affects the effectiveness and stability of the search. In the prior art, most of target distribution and sensor detection success rate are assumed by adopting a probability distribution model, but when the probability distribution model is changed, the method is difficult to be applied.

(3) Real-time performance. During the execution of the search task, the environment is dynamically changed, such as sudden appearance or disappearance of the target, massive aggregation of the target, value change of the target, and unexpected events such as faults, damages and the like of the unmanned aerial vehicle at any time. In this regard, the unmanned aerial vehicle cluster needs to have the ability to adjust decisions in real time and to cope with dynamic changes in the task environment in time. However, the dynamic change of the task environment further increases the complexity of problem solving, so that online solving becomes extremely difficult.

Many challenges and difficult problems such as the above aspects are still to be broken through for multi-objective searching of unmanned aerial vehicle clusters, and no solution for effectively solving the above aspects is available. Therefore, it is needed to provide a multi-objective search method for unmanned aerial vehicle clusters, so as to have the capability of effectively reducing the problem solving scale, to process interaction coordination problems among unmanned aerial vehicles, and to have the capability of robust processing of information uncertainty, thereby realizing efficient multi-objective search for unmanned aerial vehicle clusters.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems existing in the prior art, the invention provides the unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning, which have the advantages of simple implementation method, low computational complexity, strong robust processing capability, high searching efficiency and good instantaneity, and can efficiently realize unmanned aerial vehicle cluster multi-target distributed collaborative searching.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a unmanned aerial vehicle cluster multi-target searching method based on deep reinforcement learning comprises the following steps:

s1, modeling multi-target search of unmanned aerial vehicle clusters, constructing a cluster decision model, a kinematic model of unmanned aerial vehicles and ground targets, a communication model among the unmanned aerial vehicles and a ground observation model of the unmanned aerial vehicles on the ground targets, and respectively and independently configuring a neural network model for each unmanned aerial vehicle in the clusters to construct a mapping relation between input data and output actions of each unmanned aerial vehicle;

s2, when multi-target searching is carried out, the respective neural network model of each unmanned aerial vehicle in the cluster respectively inputs observation data of the unmanned aerial vehicle on a ground target and communication data between each unmanned aerial vehicle and a neighbor unmanned aerial vehicle, each unmanned aerial vehicle obtains actions to be executed according to output actions of the respective neural network model, and the parameters of the respective neural network model are updated according to the obtained actions.

Further, the neural network model comprises six layers, wherein the six layers are sequentially connected, the first layer is an input layer for inputting the observation data of the unmanned aerial vehicle on the ground target and the communication data sent between the unmanned aerial vehicle and the neighbor unmanned aerial vehicle, the second layer and the third layer are two full-connection layers for respectively extracting the characteristics of the full-connection layers of the observation data and the communication data, and two parts of output are obtained; the fourth layer is used for connecting and combining the two output parts of the third layer and then accessing a full-connection layer for processing; the fifth layer is used for processing the output result of the fourth layer to obtain the action output of the unmanned aerial vehicle, and the sixth layer is an output layer and is used for outputting the action obtained by the fifth layer.

Further, the input data is represented as a fixed length in the first layer by adopting a characteristic representation mode, wherein the observed quantity of the UAVi of the unmanned plane is

The observed data of the UAVi of the unmanned plane to the ground target k is that

The observed quantity of the UAV i expressed by the characteristic expression mode is as follows:

where z is the number of objects observed, x _t 、y _t For the position coordinates of the UAV i in the x and y planes, v _xt ,v _yt For the speed of the unmanned aerial vehicle UAV i in the x, y plane,

for the position coordinates of the ground object k in the x, y plane,/>

For ground target kLinear velocity of x, y plane;

the communication data sent by the UAV i to the neighbor UAVs is that

The communication data of the unmanned plane UAVj for the unmanned plane UAVi is +.>

The communication data is represented by adopting a characteristic representation mode:

where z' is the number of communication neighbors,

position coordinates of x, y plane sent to the neighbor unmanned aerial vehicle for unmanned aerial vehicle UAV i, +.>

Speed of x, y plane, a, sent for unmanned aerial vehicle UAV i to neighbor unmanned aerial vehicle ^c Action sent to neighbor drone for drone UAV i,/->

The position coordinates of the x, y plane sent to the unmanned aerial vehicle UAV i for the unmanned aerial vehicle UAVj,

speed of x, y plane, a, sent to unmanned aerial vehicle UAV i for unmanned aerial vehicle UAV j ^j And (3) sending actions to the unmanned aerial vehicle UAV i for the unmanned aerial vehicle UAVj.

Further, the output result of the fourth layer is processed in the fifth layer by using a competing network structure, wherein the used state action cost function is:

wherein V(s) is a state cost function, adv (s, a) is a dominance function,

s is the state of the unmanned aerial vehicle cluster system, a is the action of the unmanned aerial vehicle, and A is the joint action space of the unmanned aerial vehicle cluster.

Further, in the step S2, the method for updating the network weight based on D3QN uses a dual-network structure to update the weight of the neural network model, where the dual-network structure includes an evaluation network Q _E (s, a, θ) and a target network Q _T (s ', a', θ '), where s, a, θ are the state, action, and network weight of the evaluation network, respectively, and s', a ', θ' are the state, action, and network weight of the target network, respectively.

Further, in the step S1, a cluster decision model is constructed based on the partial observable markov decision model, where the cluster decision model is:

(S,A,T,R,O,Z,G)

s is a joint state space of an unmanned aerial vehicle cluster system, A is a joint action space of all unmanned aerial vehicle clusters, T is a transition probability function, and R is a reward function, wherein the reward function comprises a perception state of a target, an overlapping search state of a neighbor unmanned aerial vehicle and an out-of-range state of a boundary of a region; o is the joint observation space of the unmanned aerial vehicle cluster, Z is the observation function, G: g (N, E) is the communication topology of the drone cluster, N is the set of all drones, E is the set of edges that the drone can directly communicate with.

Further, the step of constructing the bonus function includes:

constructing rewards for searching ground targets is as follows:

wherein d ^i,k Is the relative horizontal distance between the unmanned aerial vehicle UAV i and the ground target k, and the unmanned aerial vehicle UAV i searches for m targetsThe rewards of (2) are:

the rewards for building the search overlap are:

the rewards for search overlap for unmanned UAV i are:

wherein n is the cluster size of the unmanned aerial vehicle;

constructing rewards for boundaries is:

finally, constructing and obtaining the reward function as follows:

further, the constructed kinematic model of the unmanned aerial vehicle is as follows:

wherein x is _u ,y _u V is the position of the unmanned aerial vehicle on the x-y plane _u Is the linear velocity of the unmanned aerial vehicle,

is the yaw rate of the unmanned aerial vehicle, a is the motion of the unmanned aerial vehicle and is used as the input of the kinematic model; the status of unmanned plane i is +.>

When a kinematic model of the ground target is constructed, the state of the ground target k is defined as

Further, when the communication model is constructed, the communication message sent to the UAVi by the UAV j is defined as

Wherein (1)>

Is the position coordinates of the UAVj of the unmanned plane in the x-y plane, +>

Is the linear velocity of UAVj of the unmanned plane in the x-y plane, a ^j Is the action of the unmanned aerial vehicle UAV j; when the earth observation model is constructed, the earth observation model of the UAVi of the unmanned plane is defined as +.>

Wherein k is ground target, ++>

Is the linear velocity of the target k in the x-y plane.

The unmanned aerial vehicle cluster multi-target distributed search system based on deep reinforcement learning comprises a controller, a memory and a plurality of unmanned aerial vehicles, wherein each unmanned aerial vehicle is respectively connected with the controller, the memory is used for storing a computer program, the controller is used for executing the computer program, and the controller is used for executing the method so as to control each unmanned aerial vehicle to move.

Compared with the prior art, the invention has the advantages that:

1. according to the invention, from the angle of intelligent learning, the deep reinforcement learning method is introduced, the unmanned aerial vehicles in the cluster are respectively and independently configured with the neural network model, each unmanned aerial vehicle acquires actions to carry out autonomous decision-making by learning the independent neural network model, and the deep reinforcement learning method is based on the deep reinforcement learning method, so that the capability of effectively reducing the problem solving scale can be realized, the interaction coordination among the unmanned aerial vehicles can be rapidly and effectively processed, the robust processing capability of information uncertainty can be realized, and the distributed collaborative search of the unmanned aerial vehicle cluster on multiple ground targets can be realized by fully combining the deep reinforcement learning method, namely, the collaborative and collaborative completion of the unmanned aerial vehicle cluster is controlled.

2. According to the invention, the unmanned aerial vehicle cluster multi-target search is realized by combining deep reinforcement learning, and the unmanned aerial vehicle cluster cooperation strategy is learned in a trial-and-error mode with the environment, so that the advantage of deep reinforcement learning can be fully exerted, and the unmanned aerial vehicle cluster multi-target cooperation search can be realized rapidly and efficiently without carrying out accurate mathematical modeling.

3. The reward function of the cluster decision model further comprehensively considers the perception state of the target, the overlapping search state of the neighbor unmanned aerial vehicle and the out-of-range state of the boundary of the region, and can realize a cooperation strategy for guiding the unmanned aerial vehicle to learn as many targets as possible, to overlap with the neighbor as few as possible and to cross the boundary as few as possible.

Drawings

Fig. 1 is a schematic flow chart of an implementation of the unmanned aerial vehicle cluster multi-target searching method based on deep reinforcement learning in this embodiment.

Fig. 2 is a schematic diagram of the neural network structure adopted in the present embodiment.

Detailed Description

The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.

As shown in fig. 1, the steps of the unmanned aerial vehicle cluster multi-target searching method based on deep reinforcement learning in this embodiment include:

The deep reinforcement learning is used as an original intelligent learning method, and opens up a new solution for the perception decision problem of the complex high-dimensional state space. The deep reinforcement learning has the understanding capability of the deep learning on complex high-dimensional data and the general learning capability of the reinforcement learning for self learning through an error trial and error mechanism. In addition, deep reinforcement learning can skillfully convert the problem which needs to be solved on line in the traditional concept into solving through a large amount of off-line training. With the continuous development of deep reinforcement learning, multi-agent deep reinforcement learning (Multi-Agent Deep Reinforcement Learning, MADRL) oriented to a distributed system has also made a significant progress. By considering action decisions of other agents in the learning process of each agent, the multi-agent deep reinforcement learning can form a strategy of mutual cooperation or competition among the multi-agents.

According to the method, the characteristics of deep reinforcement learning are utilized, aiming at the multi-target search problem of the unmanned aerial vehicle cluster, the unmanned aerial vehicles in the cluster are respectively and independently configured with the neural network model by introducing the deep reinforcement learning method, each unmanned aerial vehicle obtains actions by learning the independent neural network model to carry out autonomous decision-making, the capability of effectively reducing the problem solving scale is realized based on the deep reinforcement learning method, so that interaction coordination among unmanned aerial vehicles can be rapidly and effectively processed, the robust processing capability of information uncertainty is also realized, and the distributed collaborative search of the unmanned aerial vehicle cluster on the multi-ground target can be realized by fully combining the deep reinforcement learning method, namely, the collaborative and collaborative completion of multi-target search of the unmanned aerial vehicle cluster is controlled.

The method comprises the steps of firstly carrying out formal modeling on the multi-target search problem of the unmanned aerial vehicle cluster based on a part of considerable Markov decision model, then designing a deep neural network to construct a mapping of state input and action output, and finally designing a network updating method to update parameters.

In step S1 of this embodiment, when formalized modeling is performed on the unmanned aerial vehicle and the ground target, the method specifically includes:

(1) Cluster decision model

The unmanned aerial vehicle cluster-oriented ground target search is implemented, and a cluster decision model is constructed based on a part of considerable Markov decision model, wherein the cluster decision model is specifically as follows:

(S,A,T,R,O,Z,G) (1)

s is a joint state space of the unmanned aerial vehicle cluster system, S is a joint state, and S epsilon S.

A is the joint action space of all unmanned aerial vehicle clusters, wherein A is ⁱ ∈A，A ⁱ Is the action space of unmanned aerial vehicle UAV i, a ⁱ ∈A ⁱ ，a ⁱ Is the action of unmanned aerial vehicle UAV i, a= (a) ¹ ,a ² ,a ⁱ ,…a ⁿ ) Is the motion vector of all unmanned aerial vehicles.

T is a transition probability function, specifically: p (s '|s, a) → [0,1], s' is the next time state.

R is a bonus function. In consideration of the fact that in the searching process, the unmanned aerial vehicle cluster should keep the target within the detection range of the unmanned aerial vehicle cluster as far as possible, and the winning function of the embodiment specifically comprises a perception state of the target, an overlapping searching state of the neighbor unmanned aerial vehicle and an out-of-range state with the region boundary.

O is the joint observation space of the unmanned aerial vehicle cluster, O ⁱ ∈O，A、O ⁱ Is the observation space of UAV i, o ⁱ ∈O ⁱ ，o ⁱ Is the observed quantity of unmanned aerial vehicle UAV i, o= (o) ¹ ,o ² ,o ⁱ ,…o ⁿ ) Is free ofObserving vectors of a man-machine cluster at a certain moment;

z is an observation function, specifically expressed as: and Z (S, a) is S×A- & gtO, and represents the probability that the unmanned aerial vehicle cluster obtains the observed quantity O when the system executes the action a in the state S and the state transition occurs at the next moment. The above is the embodiment of the part of the considerable model.

G: g (N, E) is the communication topology of the drone cluster, where N is the set of all drones and E is the set of edges that the drone can directly communicate with. The relative distance between the unmanned plane and the unmanned plane is smaller than d _c Is communicated with, i.e. if d ^i,j ≤d _c Then there is a strip edge E in E ^i,j ∈E，d ^i,j For the horizontal distance, d, of unmanned aerial vehicle UAV i and unmanned aerial vehicle UAV j _c Is a predetermined threshold. If UAV i is within communication range of UAV j, the communication message sent by UAV j to UAV i is c ^(i,j) 。

In this embodiment, the unmanned aerial vehicle UAV i action is specifically the yaw rate of the unmanned aerial vehicle

And adopts discrete action space, namely +.>

Wherein N is _a For discretized granularity, +.>

For maximum yaw rate, N _a And->

Are all super parameters input in advance.

In this embodiment, the specific steps of constructing the reward function include:

constructing rewards for searching ground targets is as follows:

wherein d ^i,k Is the relative horizontal distance between unmanned UAV i and target k, and the rewards for unmanned UAV i searching for multiple targets (m) are:

meanwhile, in order to avoid the waste of resources caused by overlapping the search spaces of the unmanned aerial vehicle, the embodiment constructs rewards for overlapping the search as follows:

for unmanned aerial vehicle UAV i, if the unmanned aerial vehicle cluster size is n, constructing a search overlapping reward is:

in addition, since the search task is area-specific, the unmanned aerial vehicle cluster should also avoid crossing the area boundary, and the embodiment builds rewards for the boundary as follows:

and combining the constructed reward model, and adopting the three parts of rewards to form a final reward function, wherein the final constructed reward function is as follows:

the reward function in the constructed cluster decision model can realize a collaborative strategy for guiding the unmanned aerial vehicle to learn as many targets as possible, as few overlaps with neighbors as possible and crosses the boundary as few as possible by comprehensively considering the perception state of the targets, the overlapping search state of the neighbor unmanned aerial vehicle and the out-of-range state of the boundary of the region.

(2) Unmanned aerial vehicle and ground target's kinematics model

In this embodiment, assuming that the unmanned aerial vehicle keeps flying at a fixed altitude and a fixed speed, and the detector carried by the unmanned aerial vehicle keeps looking down at the visual angle under the front to search and track the target, the motion of the unmanned aerial vehicle can be described as a kinematic model of an x-y plane, and the kinematic model of the unmanned aerial vehicle is constructed as follows:

wherein, (x) _u ,y _u ) Is the position of the unmanned plane on the x-y plane, v _u Is the linear velocity of the unmanned aerial vehicle,

the yaw rate of the unmanned aerial vehicle is the input of a kinematic model, namely the action of the unmanned aerial vehicle. Status of unmanned plane i>

For a ground target, the kinematic model is similar to that of an unmanned aerial vehicle, namely the kinematic model of an x-y plane, and the embodiment defines the state of the ground target k as follows

(3) Communication model

The embodiment assumes that the unmanned aerial vehicle can and the relative distance is less than d _c The other unmanned aerial vehicles of (a) communicate, and the communication message sent to the unmanned aerial vehicle UAV i by the unmanned aerial vehicle UAV j is that

Wherein (1)>

Is the x-y plane position coordinates of UAV j, < >>

Is the linear velocity, a, of UAV j ^j Is the action of the unmanned aerial vehicle UAVj.

(4) Earth observation model

The present embodiment assumes that the unmanned aerial vehicle is able to observe a radius d around itself _o Ground targets in the area, and constructing a ground observation model of the unmanned aerial vehicle UAV i as

Wherein k is ground target, ++>

Is the linear velocity of target k.

When the neural network model is built in step S1 of the embodiment, the neural network model specifically includes six layers, each layer is sequentially connected, wherein the first layer is an input layer for inputting the observation data (perception target data information) of the unmanned aerial vehicle on the ground target and the communication data sent between the unmanned aerial vehicle and the neighboring unmanned aerial vehicle, namely, the observation data and the communication data on the ground target are used as the input of the neural network; the second layer and the third layer are two full-connection layers and are used for respectively carrying out feature extraction on the full-connection layers on the observation data and the communication data to obtain two parts of output, wherein the second layer and the third layer respectively comprise two full-connection layers for respectively accessing the observation data and the communication data to carry out feature extraction; the fourth layer comprises a connection merging layer and a full connection layer, and is used for connecting and merging the two output parts of the third layer and then connecting the two output parts into the full connection layer for processing; the fifth layer is used for processing the output result of the fourth layer by adopting the competition network to obtain the action output of the unmanned aerial vehicle, and the sixth layer is an output layer and is used for outputting the action obtained by the fifth layer.

The neural network structure adopted in this embodiment is specifically shown in fig. 2, and each layer is specifically:

(1) The first layer is an input layer, and input data mainly comprises two parts: earth observation data and communication data. Taking an unmanned aerial vehicle UAV i as an example, the observed data of the ground target is specifically UAV i to ground targetObserved position and velocity, etc. Suppose that the observed quantity of UAV i is

From the above earth observation model, the observation data of the unmanned UAV i on the ground target k should be +.>

Since the number of observable targets is varied, the present embodiment employs a feature representation method to represent the observed data as a fixed length.

In this embodiment, the observed data is specifically represented by an average representation method, and if the number of observed objects is z, there are:

wherein, (x) _t 、y _t ) For the position coordinates of UAVi in the x and y plane, (v) _xt ,v _yt ) For the speed of the unmanned plane UAVi in the x, y plane,

for the position coordinates of the ground object k in the x, y plane,/>

The linear velocity of the ground target k in the x and y planes is set;

also taking the unmanned aerial vehicle UAV i as an example, the communication data is specifically data information sent by a communication neighbor of the unmanned aerial vehicle UAV i to the unmanned aerial vehicle UAV i, expressed as

From the communication model, it is assumed that unmanned aerial vehicle UAV i is within the communication range of unmanned aerial vehicle UAV j, and the communication data from unmanned aerial vehicle UAV j to unmanned aerial vehicle UAV i is +.>

Since the number of communication neighbors is variableIn the present embodiment, the communication data is represented as a fixed length by the feature representation method.

In this embodiment, the communication data is represented by an average representation, and if the number of communication neighbors of the unmanned aerial vehicle UAV i is z', there are:

(2) And the second layer and the third layer respectively conduct feature extraction of the full connection layer on the observation data and the communication data.

(3) And the fourth layer is used for combining the two output parts of the third layer, and then the full-connection layer is accessed for processing.

(4) And introducing a competition network (reducing network) into the fifth layer, and processing the output result of the fourth layer.

In this embodiment, where competing networks are used in the fifth layer, V is the state cost function V(s), adv is the dominance function Adv (s, a),

the adopted state action cost function is as follows:

(5) The sixth layer is an output layer for outputting Q (s, a), wherein Q (s, a) is a discretized yaw rate

N _a For discretized granularity, +.>

For maximum yaw rate, N _a And->

All are pre-input super parameters.

In this embodiment, a method for updating the Network weight of D3QN (Dueling Double Deep Q-learning Network) is adopted to update the weight of the Network, and specifically, an evaluation Network and target Network dual-Network structure is adopted, which are Q _E (s, a, θ) and Q _T (s ', a', θ '), where s, a, θ are the state, action, and network weight of the evaluation network, respectively, and s', a ', θ' are the state, action, and network weight of the target network, respectively. The network structure designs of the evaluation network and the target network are shown in fig. 2.

When the weight is updated in this embodiment, the specific definition loss function is:

/>

wherein y=r+γmax _a ′Q _T Gamma is the discount factor and is a pre-set hyper-parameter.

Then, updating the network weight of the evaluation network by adopting a gradient updating method:

where α is the learning rate and is a pre-set hyper-parameter.

The weight of the target network is updated by the following formula:

θ′＝τθ+(1-τ)θ′ (13)

wherein, tau epsilon (0, 1) is the update rate of the target network weight and is a preset super parameter.

According to the embodiment, the parameters of the neural network model of the unmanned aerial vehicle are updated by adopting the D3QN network weight updating method based on the double-network structure, and the network parameters can be quickly and effectively updated by combining the real-time actions of each unmanned aerial vehicle, so that the coordination of the unmanned aerial vehicle cluster system is ensured.

The embodiment also comprises a deep reinforcement learning-based unmanned aerial vehicle cluster multi-target distributed search system, which comprises a controller, a memory and a plurality of unmanned aerial vehicles, wherein each unmanned aerial vehicle is respectively connected with the controller, the memory is used for storing a computer program, the controller is used for executing the computer program, and the controller is used for executing the method so as to control each unmanned aerial vehicle to move.

The deep reinforcement learning has the understanding capability of the deep learning on complex high-dimensional data and the general learning capability of the reinforcement learning for self learning through an error trial and error mechanism. And the deep reinforcement learning can skillfully convert the problem which needs to be solved on line in the traditional concept into the problem which is solved through a large amount of off-line training. According to the invention, the unmanned aerial vehicle cluster multi-target search is realized by combining deep reinforcement learning, and the unmanned aerial vehicle cluster cooperation strategy is learned in a trial-and-error mode with the environment, so that the advantage of deep reinforcement learning can be fully exerted, and the unmanned aerial vehicle cluster multi-target cooperation search can be realized rapidly and efficiently without carrying out accurate mathematical modeling.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.

Claims

1. The unmanned aerial vehicle cluster multi-target searching method based on deep reinforcement learning is characterized by comprising the following steps of:

s2, when multi-target searching is carried out, the respective neural network model of each unmanned aerial vehicle in the cluster respectively inputs observation data of the unmanned aerial vehicle on a ground target and communication data between each unmanned aerial vehicle and a neighbor unmanned aerial vehicle, and each unmanned aerial vehicle acquires actions to be executed according to output actions of the respective neural network model and updates parameters of the respective neural network model according to the acquired actions;

in the step S1, a cluster decision model is built based on a part of the considerable Markov decision model, and the cluster decision model is as follows:

(S,A,T,R,O,Z,G)

s is a joint state space of an unmanned aerial vehicle cluster system, A is a joint action space of all unmanned aerial vehicle clusters, T is a transition probability function, and R is a reward function, wherein the reward function comprises a perception state of a target, an overlapping search state of a neighbor unmanned aerial vehicle and an out-of-range state of a boundary of a region; o is the joint observation space of the unmanned aerial vehicle cluster, Z is the observation function, G: g (N, E) is a communication topological graph of the unmanned aerial vehicle cluster, N is a set of all unmanned aerial vehicles, and E is an edge set of unmanned aerial vehicle direct communication;

the step of constructing the bonus function includes:

constructing rewards for searching ground targets is as follows:

wherein d ^i,k Is the relative horizontal distance between the unmanned aerial vehicle UAV i and the ground target k, and the rewards for the unmanned aerial vehicle UAV i to search for m targets are:

the rewards for building the search overlap are:

the rewards for search overlap for unmanned UAV i are:

wherein n is the cluster size of the unmanned aerial vehicle;

constructing rewards for boundaries is:

finally, constructing and obtaining the reward function as follows:

2. the unmanned aerial vehicle cluster multi-target search method based on deep reinforcement learning according to claim 1, wherein: the neural network model comprises six layers, wherein the layers are sequentially connected, the first layer is an input layer for inputting the observation data of the unmanned aerial vehicle on a ground target and the communication data sent between the unmanned aerial vehicle and the neighbor unmanned aerial vehicle, the second layer and the third layer are two full-connection layers for respectively extracting the characteristics of the full-connection layers of the observation data and the communication data, and two parts of output are obtained; the fourth layer is used for connecting and combining the two output parts of the third layer and then accessing a full-connection layer for processing; the fifth layer is used for processing the output result of the fourth layer to obtain the action output of the unmanned aerial vehicle, and the sixth layer is an output layer and is used for outputting the action obtained by the fifth layer.

3. The deep reinforcement learning-based unmanned aerial vehicle cluster multi-objective search method of claim 2, wherein the input data is represented in the first layer as a fixed length by means of a feature representation, wherein the unmanned aerial vehicle UAV i observationsThe amount is

The observation data of unmanned aerial vehicle UAV i on ground target k is +.>

for the position coordinates of the ground object k in the x, y plane,/>

The linear velocity of the ground target k in the x and y planes is set;

the communication data sent by the UAV i to the neighbor UAVs is that

The communication data of the unmanned aerial vehicle UAV j for the unmanned aerial vehicle UAV i is +.>

where z' is the number of communication neighbors,

Position coordinates of the x, y plane sent to unmanned aerial vehicle UAV i for unmanned aerial vehicle UAV j, +.>

Speed of x, y plane, a, sent to unmanned aerial vehicle UAV i for unmanned aerial vehicle UAV j ^j An action sent to unmanned aerial vehicle UAV i for unmanned aerial vehicle UAV j.

4. The deep reinforcement learning-based unmanned aerial vehicle cluster multi-objective search method of claim 2, wherein the output result of the fourth layer is processed in the fifth layer using a competing network structure, wherein the state action cost function used is:

wherein V(s) is a state cost function, adv (s, a) is a dominance function,

5. Deep reinforcement learning-based unmanned aerial vehicle cluster according to any one of claims 1 to 4The multi-objective searching method is characterized in that the network weight updating method based on D3QN in the step S2 uses a dual-network structure to update the weight of the neural network model, wherein the dual-network structure comprises an evaluation network Q _E (s, a, θ) and a target network Q _T (s ', a', θ '), where s, a, θ are the state, action, and network weight of the evaluation network, respectively, and s', a ', θ' are the state, action, and network weight of the target network, respectively.

6. The unmanned aerial vehicle cluster multi-target searching method based on deep reinforcement learning according to any one of claims 1 to 4, wherein the constructed unmanned aerial vehicle kinematic model is:

7. The unmanned aerial vehicle cluster multi-target search method based on deep reinforcement learning according to any one of claims 1 to 4, wherein: when the communication model is constructed, the communication message sent to the unmanned aerial vehicle UAV i by the unmanned aerial vehicle UAV j is defined as

Wherein (1)>

Is the position coordinates of unmanned aerial vehicle UAV j in the x-y plane,/for>

Is the linear velocity of unmanned aerial vehicle UAV j in the x-y plane, a ^j Is the action of the unmanned aerial vehicle UAV j; when the earth observation model is constructed, the earth observation model of the unmanned aerial vehicle UAV i is defined as +.>

Wherein k is ground target, ++>

Is the linear velocity of the target k in the x-y plane.

8. An unmanned aerial vehicle cluster multi-target search system based on deep reinforcement learning, comprising a controller, a memory and a plurality of unmanned aerial vehicles, wherein each unmanned aerial vehicle is respectively connected with the controller, the memory is used for storing a computer program, and the controller is used for executing the computer program, and is characterized in that the controller is used for executing the method according to any one of claims 1-7 so as to control each unmanned aerial vehicle to operate.