CN117311392A

CN117311392A - Unmanned aerial vehicle group countermeasure control method and system

Info

Publication number: CN117311392A
Application number: CN202311512441.5A
Authority: CN
Inventors: 刘祥龙; 刘艾杉; 李海南; 方书艳; 洪日昌
Original assignee: Data Space Research Institute
Current assignee: Data Space Research Institute
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2023-12-29

Abstract

The invention discloses a method and a system for controlling the fight of an unmanned aerial vehicle group, wherein the method comprises the steps of constructing the fight environment of the unmanned aerial vehicle group based on an AirSim simulation environment; training in an unmanned aerial vehicle group countermeasure environment by using a multi-agent reinforcement learning training strategy to obtain an unmanned aerial vehicle group model; an unmanned aerial vehicle group attack method is adopted in an unmanned aerial vehicle group countermeasure environment to attack the unmanned aerial vehicle group model, and comprises a strategy-based attack, an observation-based attack, a reward function-based attack, a minority group-based attack and a majority group-based attack; performing simulation verification of the unmanned aerial vehicle group in an unmanned aerial vehicle group countermeasure environment, and evaluating the behavior of the unmanned aerial vehicle group model after attack; the invention can effectively perform unmanned aerial vehicle group fight exercise, is beneficial to testing the effect of an unmanned aerial vehicle group attack algorithm, and improves the robustness of the unmanned aerial vehicle group.

Description

Unmanned aerial vehicle group countermeasure control method and system

Technical Field

The invention relates to the technical field of unmanned aerial vehicle control, in particular to an unmanned aerial vehicle group countermeasure control method and system.

Background

Multi-agent systems, which are systems composed of multiple interactive agents in the same environment, are often used to solve the problem that a single agent is difficult to handle. Unmanned aerial vehicle group is being actively tried to make unmanned aerial vehicle group possess certain intelligent level by using reinforcement learning algorithm as a typical multi-agent system, for example, an unmanned cluster task cooperation method based on multi-agent reinforcement learning is proposed in patent document with publication number of CN113589842A, an unmanned aerial vehicle cluster countermeasure strategy optimization method based on layered reinforcement learning is proposed in patent application document with publication number of CN116520884A, and related technologies such as unmanned aerial vehicle cluster countermeasure simulation research based on reinforcement learning, western industry university, plum culvert and the like are focused on training the optimal cooperation strategy of unmanned aerial vehicle group in the unmanned aerial vehicle group fight process. The essence of reinforcement learning is to solve the problem that an agent learns a strategy in the process of interacting with the environment to maximize rewards or achieve specific goals. However, due to the vulnerability of multi-agent reinforcement learning, the unmanned air vehicle model may be attacked in multiple phases (training phase and testing phase) and multiple angles (state space, action space, rewarding function and environment transition probability) due to the uncertainty of the environment and agents, thus being unsafe and not robust.

Aiming at the problem that the unmanned aerial vehicle group model is easy to be subjected to attack resistance in multiple stages and multiple angles, the attack based on strategies, the attack based on observations and the attack based on rewarding functions are mainly adopted at present. The multi-agent reinforcement learning serving as the theoretical basis of the unmanned aerial vehicle group model is easy to be threatened by safety, and risks are brought to the safety of the unmanned aerial vehicle group. Classification is performed according to perturbable spaces in the four-tuple of the Markov decision process, and attack countermeasures can be perturbed in the strategy, state and reward functions to influence the decision of the model. Wherein policy-based attacks affect other agent decisions by training agents with countermeasures; inducing the agent to make a decision to minimize the objective function by changing the agent's observations or adding noise to the environment based on the observed attacks; attacks based on reward functions affect the global policy of an agent by slightly perturbing the reward functions during the agent's training process.

A large number of experiments prove that the attack can easily cause the strategy of the attacked intelligent agent to be sunk into confusion, so that the attacked intelligent agent cannot complete the established task, and the ever-present malicious attack means seriously threatens the safety of the unmanned aerial vehicle group model. It is clear that the attacks are all direct operations of one or more unmanned aerial vehicles in the attacked unmanned aerial vehicle group, and indirect attacks for operating the other unmanned aerial vehicle group are freshly explored. In fact, attacks like the latter are rather more common and commonplace in reality, so after the my optimal cooperation strategy has been trained, simulating the attack on the my unmanned aerial vehicle group and analyzing the anomalous behavior exhibited by the unmanned aerial vehicle group when attacked, revealing that the intrinsic cause of the vulnerability of the unmanned aerial vehicle group model in combating the attack is necessary, contributing to the anti-attack and anti-defense evaluation of the unmanned aerial vehicle group.

In addition, the current multi-agent reinforcement learning environment focuses on strategy games (such as SMAC, GRF and Hanabi), particle environments (such as MPE, LBF and RWARE) and robot environments (such as mamjoco), and the scene task is simple, and an open-source multi-agent reinforcement learning environment supporting unmanned group combat exercise does not exist. Accordingly, the complexity of the action space is low, only discrete actions (strategy games and most particle environments) or continuous actions (small particle environments and robot environments) are needed, and the discrete actions and the continuous actions are not needed at the same time, so that the intelligent group simulation environment is far from being enough to simulate the fight behaviors of the unmanned aerial vehicle group on the complexity of tasks, and reliable evaluation cannot be provided for algorithms applied to the unmanned aerial vehicle group. Therefore, the current multi-agent reinforcement learning environment lacks of unmanned aerial vehicle groups or similar environments, and how robust the multi-agent reinforcement learning environment is when the multi-agent reinforcement learning environment is subjected to various attacks in a real scene is rarely considered by researchers designed to train unmanned aerial vehicle group algorithms.

Disclosure of Invention

The invention aims to solve the technical problem of how to effectively perform unmanned aerial vehicle group fight exercise, which is helpful for testing the effect of an unmanned aerial vehicle group attack algorithm and improving the robustness of the unmanned aerial vehicle group.

The invention solves the technical problems by the following technical means:

in a first aspect, the present invention provides a method for controlling the countermeasures of a group of unmanned aerial vehicles, the method comprising the steps of:

constructing an unmanned aerial vehicle group countermeasure environment based on the AirSim simulation environment;

training in an unmanned aerial vehicle group countermeasure environment by using a multi-agent reinforcement learning training strategy to obtain an unmanned aerial vehicle group model;

an unmanned aerial vehicle group attack method is adopted in an unmanned aerial vehicle group countermeasure environment to attack the unmanned aerial vehicle group model, and comprises a strategy-based attack, an observation-based attack, a reward function-based attack, a minority group-based attack and a majority group-based attack;

and performing simulation verification of the unmanned aerial vehicle group in an unmanned aerial vehicle group countermeasure environment, and evaluating the behavior of the unmanned aerial vehicle group model after attack.

Further, the constructing the unmanned aerial vehicle group countermeasure environment based on the AirSim simulation environment comprises the following steps:

setting the flying of the unmanned aerial vehicles as two-dimensional plane motions and the heights of the unmanned aerial vehicles are different from each other, and setting the observation range, attack range, movement information and firing information of the unmanned aerial vehicles;

initializing that each unmanned aerial vehicle has a full health value and a full cartridge clip, and randomly generating in a map according to Gaussian distribution, wherein the initial speed is 0;

Setting an observation space and a state space of each unmanned aerial vehicle, wherein the observation space comprises a moving characteristic, an enemy plane characteristic, an friend plane characteristic and a self characteristic, and the state space is a complex state space or a simple state space which is obtained by splicing the observation space of each unmanned aerial vehicle in the unmanned aerial vehicle group in the same camp;

setting a mixed action space of each unmanned aerial vehicle, wherein in the mixed action space, discrete action space is used for firing action, and continuous action space is used for moving action;

setting the bonus function as a sparse bonus function or a dense bonus function.

Further, the gaussian distribution is expressed as:

μ～Uniform{(X _L ,Y _L ),(X _R ,Y _R )}+Uniform(-2,2),σ～Uniform(1,4)

wherein x is the initial position of the unmanned aerial vehicle group; (X) _L ,Y _L ) Is the coordinates of the central birth point of the left side of the unmanned aerial vehicle group; (X) _R ,Y _R ) Is the coordinates of the right center birth point of the unmanned aerial vehicle group; gaussian (x) is a Gaussian distribution with mean μ and standard deviation σ for the random variable x; form (-a, b) is at [ -a, b]Uniform distribution within the range; exp () is an exponential function; mu-form { (X) _L ,Y _L ),(X _R ,Y _R ) Mean value of the position of the unmanned aerial vehicle satisfies gaussian distribution, and the mean value is in (X _L ,Y _L ),(X ₁ ,Y _R ) The standard deviation of the unmanned aerial vehicle satisfies the Uniform distribution between (1 and 4).

Further, in the set observation space, the movement characteristics are distances from the current unmanned aerial vehicle to four boundaries of the drilling field, and the enemy plane characteristics comprise whether the enemy plane is in an attack range of the enemy plane, the distances from the enemy plane, X coordinates and Y coordinates relative to the enemy plane, and X-direction component speeds and Y-direction component speeds relative to the enemy plane;

the friendly machine features include whether the friendly machine is visible, a distance to the friendly machine, an X coordinate and a Y coordinate relative to the friendly machine, and an X-direction component speed and a Y-direction component speed relative to the friendly machine;

the self-characteristics include self blood volume value, ammunition quantity, weapon cooling time and a self-id independent heat vector.

Further, the complex state space includes: the blood volume value of the unmanned aerial vehicle, the coordinates relative to the center of the drill field X, the coordinates relative to the center of the drill field Y, the absolute component speed in the X direction, the absolute component speed in the Y direction, the ammunition quantity, and weapon cooling time information.

Further, the multi-agent reinforcement learning training strategy comprises a first training strategy and a second training strategy, wherein the first training strategy comprises a discrete Actor network corresponding to firing actions, a mobile Actor network corresponding to mobile actions and a Critic network, the discrete Actor network and the mobile Actor network share a state coding network, and the Critic network respectively outputs logits corresponding to each action dimension of a discrete action space and Gaussian distribution average values corresponding to continuous action spaces;

The second training strategy sets the action output by the discrete Actor network to be input to the mobile Actor network together in the form of a single-heat vector and an observed characteristic vector on the basis of the first training strategy.

Further, the expression formula of the policy-based attack is:

v(π)＝π _α (o),

wherein v (pi) is a perturbation to the unmanned aerial vehicle group strategy pi; r is (r) _t ′＝-r _t Is a reward function for an attacker; pi _α (o) is a policy used against an attacker; o is the attacker's observation; gamma is the discount factor; gamma ray ^t A discount factor that increases exponentially for increasing over time; t is the time step.

Further, the expression formula of the observation-based attack is:

wherein v (o) is an anti-noise added to the observation o;is a gradient sign; pi _o (a ^* S) selecting action a under observation o for an attacked drone ^* Probability of (2); sign () is a sign function; epsilon is the noise added by the constraint; o is the unmanned observation.

Further, the formulation of the reward function based attack is:

where v (r) is the range of challenge disturbance added to the bonus function r; r is (r) _thresh Is a reward threshold.

Further, the attack based on the minority group is an optimal cooperation strategy for fixing the own unmanned aerial vehicle, and the MAPPO algorithm is used for training the minority group in the other unmanned aerial vehicle group;

The attack based on the majority group is an optimal cooperation strategy for fixing own unmanned aerial vehicles, and the MAPPO algorithm is used for training the majority group in the other unmanned aerial vehicles.

In a second aspect, the invention also provides an unmanned aerial vehicle group countermeasure control system, which comprises an unmanned aerial vehicle group countermeasure environment, an unmanned aerial vehicle group training module, an unmanned aerial vehicle group attack module, a simulation verification module, a deployed training strategy algorithm library and an unmanned aerial vehicle group attack algorithm library, wherein the unmanned aerial vehicle group countermeasure environment, the unmanned aerial vehicle group training module, the unmanned aerial vehicle group attack module, the simulation verification module and the unmanned aerial vehicle group attack algorithm library are constructed based on an AirSim simulation environment, and the unmanned aerial vehicle group countermeasure control system comprises:

the unmanned aerial vehicle group training module is used for calling a multi-agent reinforcement learning training strategy in the training strategy algorithm library to train in the unmanned aerial vehicle group countermeasure environment to obtain an unmanned aerial vehicle group model;

the unmanned aerial vehicle group attack module is used for calling an unmanned aerial vehicle group attack method in an unmanned aerial vehicle group attack algorithm library to attack the unmanned aerial vehicle group model in the unmanned aerial vehicle group countermeasure environment, and the unmanned aerial vehicle group attack method comprises attack based on strategies, attack based on observation, attack based on rewarding functions, attack based on minority groups and attack based on majority groups;

the simulation verification module is used for performing simulation verification of the unmanned aerial vehicle group in an unmanned aerial vehicle group countermeasure environment and evaluating the behavior of the unmanned aerial vehicle group model after attack.

Further, the system communicates with the Python program through the msgpack-RPC protocol in the TCP/IP, and the emulation verification module is configured to monitor a request of a port corresponding to the Python program, so that the Python program sends an RPC data packet containing actions of each unmanned aerial vehicle in the unmanned aerial vehicle group to the port through the msgpack serialization format.

The invention has the advantages that:

(1) The invention designs the unmanned aerial vehicle group fight simulation environment based on AirSim, and maintains universality and expandability as much as possible while approaching to a physical environment and a real scene; then, adapting a multi-agent reinforcement learning algorithm to the unmanned aerial vehicle group, training the unmanned aerial vehicle group, after training to obtain an intelligent unmanned aerial vehicle group model, performing comprehensive attack on the unmanned aerial vehicle group model by adopting an unmanned aerial vehicle group attack method, wherein the adopted unmanned aerial vehicle group attack method is more comprehensive, creatively provides an attack algorithm based on a minority group and a majority group, and realizes indirect attack on own unmanned aerial vehicle group of an operation opposite unmanned aerial vehicle group; and after attack, simulation verification of the unmanned aerial vehicle group is carried out, the behavior of the unmanned aerial vehicle group model under various conditions is comprehensively evaluated and analyzed, the vulnerability mechanism of the unmanned aerial vehicle group model is revealed, the effect of an unmanned aerial vehicle group attack algorithm is helped to be tested, and a foundation is laid for improving the robustness of the unmanned aerial vehicle group.

(2) The invention is based on the current multi-agent reinforcement learning algorithm, and the design of the algorithm, the design of the environment, the parameters of the algorithm and the environment and the like are simultaneously modified according to the training result when the multi-agent reinforcement learning algorithm is deployed, so that the behavior of the unmanned aerial vehicle group model is more intelligent.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a schematic flow chart of a method for controlling the countermeasures of a group of unmanned aerial vehicles according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a population of unmanned aerial vehicles constructed in an embodiment of the present invention;

FIG. 3 is a schematic view of an observation space of a group of unmanned aerial vehicles constructed in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an unmanned aerial vehicle group motion space constructed in an embodiment of the invention;

FIG. 5 is a schematic diagram of a community of unmanned bonus functions constructed in accordance with an embodiment of the invention;

FIG. 6 is a schematic diagram of a training strategy for a drone swarm according to an embodiment of the present invention;

FIG. 7 is a code structure diagram of a drone swarm attack algorithm according to an embodiment of the present invention;

FIG. 8 is a code structure diagram of a drone swarm countermeasure platform according to an embodiment of the invention;

FIG. 9 is an overview of a map of unmanned air vehicle group engagement in an embodiment of the invention;

FIG. 10 is a schematic diagram of an unmanned cluster countermeasure control system according to an embodiment of the present invention;

FIG. 11 is a block diagram of an overall framework of an unmanned cluster reactance control system according to an embodiment of the present invention;

fig. 12 is a schematic diagram of communication connections between an unmanned cluster reactance control system and Python in an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a first embodiment of the present invention discloses a method for controlling the countermeasure of a group of unmanned aerial vehicles, which includes the steps of:

s10, constructing an unmanned aerial vehicle group countermeasure environment based on an AirSim simulation environment;

it should be noted that, the embodiment designs a fight environment of an unmanned aerial vehicle group based on the air sim simulation environment, has higher fidelity, can face the real world scene of fight of the unmanned aerial vehicle group, and can support more real unmanned aerial vehicle group algorithm training and evaluation; and in order for the environment to provide a unified interface to the outside, the invention complies with the Gym interface format of OpenAI.

S20, training by using a multi-agent reinforcement learning training strategy in an unmanned aerial vehicle group countermeasure environment to obtain an unmanned aerial vehicle group model;

it should be noted that, the task execution and interaction process of the unmanned aerial vehicle group is composed of a complex and precise system, including execution of a single unmanned aerial vehicle policy, communication interaction among unmanned aerial vehicle groups, and a central control policy guiding the whole group, and the complex behavior is followed by a series of policy learning and optimization based on multi-agent game. The unmanned aerial vehicle group trained by the single-agent algorithm is easy to explore a selfish strategy or a local optimal strategy due to lack of guidance of a cooperative strategy, and does not accord with the functional assumption of the unmanned aerial vehicle group, so that the multi-agent reinforcement learning algorithm is selected for training the unmanned aerial vehicle group model.

It should be noted that, similar to different algorithms in supervised learning, the present invention comprehensively analyzes the current mainstream multi-agent reinforcement learning algorithm and selects a suitable algorithm to be adapted to the unmanned aerial vehicle group combat environment, so as to strive for obtaining an intelligent unmanned aerial vehicle group model.

S30, an unmanned aerial vehicle group attack method is adopted to attack the unmanned aerial vehicle group model in an unmanned aerial vehicle group countermeasure environment, wherein the unmanned aerial vehicle group attack method comprises attack based on strategies, attack based on observation, attack based on rewarding functions, attack based on minority groups and attack based on majority groups;

It should be noted that, this embodiment provides two indirect attack methods for operating the unmanned aerial vehicle group of the opposite party, and the attack method is more comprehensive based on the attack of a few groups and the attack of a plurality of groups.

S40, performing simulation verification of the unmanned aerial vehicle group in an unmanned aerial vehicle group countermeasure environment, and evaluating the behavior of the unmanned aerial vehicle group model after attack.

The embodiment designs the unmanned aerial vehicle group combat simulation environment based on AirSim, and maintains universality and expandability as much as possible while approaching to a physical environment and a real scene; then, adapting a multi-agent reinforcement learning algorithm to the unmanned aerial vehicle group, training the unmanned aerial vehicle group, after training to obtain an intelligent unmanned aerial vehicle group model, performing comprehensive attack on the unmanned aerial vehicle group model by adopting an unmanned aerial vehicle group attack method, wherein the adopted unmanned aerial vehicle group attack method is more comprehensive, creatively provides an attack algorithm based on a minority group and a majority group, and realizes indirect attack on own unmanned aerial vehicle group of an operation opposite unmanned aerial vehicle group; and after attack, simulation verification of the unmanned aerial vehicle group is carried out, the behavior of the unmanned aerial vehicle group model under various conditions is comprehensively evaluated and analyzed, the vulnerability mechanism of the unmanned aerial vehicle group model is revealed, the effect of an unmanned aerial vehicle group attack algorithm is helped to be tested, and a foundation is laid for improving the robustness of the unmanned aerial vehicle group.

In one embodiment, the step S10: constructing an unmanned aerial vehicle group countermeasure environment based on an AirSim simulation environment, comprising the following steps:

s11, setting the flying of the unmanned aerial vehicle as two-dimensional plane motion and the heights of the unmanned aerial vehicles are different, and setting the observation range, attack range, movement information and firing information of the unmanned aerial vehicle;

s12, initializing that each unmanned aerial vehicle has a full health value and a full cartridge clip, and randomly generating in a map according to Gaussian distribution, wherein the initial speed is 0;

s13, setting an observation space and a state space of each unmanned aerial vehicle, wherein the observation space comprises a moving characteristic, an enemy plane characteristic, a friend plane characteristic and a self characteristic, and the state space is a complex state space or a simple state space is obtained by splicing the observation space of each unmanned aerial vehicle in the same unmanned aerial vehicle group;

s14, setting a mixed action space of each unmanned aerial vehicle, wherein in the mixed action space, discrete action space is used for firing action, and continuous action space is used for moving action;

s15, setting the reward function as a sparse reward function or a dense reward function.

It should be noted that, the multi-agent reinforcement learning environment designed in this embodiment mainly includes: task scenario, unit modeling, initialization, observation space, state space, action space, and bonus function.

Specifically, in step S11: the flight of the unmanned aerial vehicle is set to be two-dimensional plane motion, the heights of the unmanned aerial vehicles are different from each other, and the unmanned aerial vehicle observation range, the attack range, the movement information and the firing information are set: the two red and blue unmanned aerial vehicles respectively have an n_allies frame and an n_enemies frame which are identical in specification and are provided with identical weapons, the unmanned aerial vehicles of each party can fight in an open space with a specified size, and the aim of each unmanned aerial vehicle is to hit the unmanned aerial vehicle of the other party, thereby winning the winner. In order to focus the drone swarm as much as possible on the fight mission, the present embodiment limits the flight of all the drones to a two-dimensional planar motion. In consideration of the possible conflict between the movement strategy and collision avoidance required by the fight task, the embodiment sets the height of each unmanned aerial vehicle to be different, so that the interference of mutual collision among unmanned aerial vehicles on the cluster behavior required by the fight is avoided, and the assumption does not influence the reality of the unmanned aerial vehicle cluster simulation.

Each unmanned aerial vehicle is provided with an observation range of 12 units and an attack range of 6 units, and only other unmanned aerial vehicles (including friends and enemies) within the observation range can be observed by the unmanned aerial vehicle. Likewise, only enemy drones within range of the attack may be attacked by the drone.

Each unmanned aerial vehicle is arranged to bear the attack of 12 enemy weapons, the cartridge clip capacity of each unmanned aerial vehicle is 5 units of ammunition, when no ammunition exists in the cartridge clip, the cartridge clip is filled with 2 time units, and the ammunition cannot be launched to attack enemy. In 1 time unit, every unmanned aerial vehicle can remove 1 unit distance at most, and its direction can be arbitrary, but unmanned aerial vehicle can not surpass the exercise field of appointed 32 units, as long as cross the limit, then unmanned aerial vehicle crashes immediately. The schematic diagram of the unmanned aerial vehicle group countermeasure scene is shown in fig. 2.

Further, in the step S12, the random birth is performed in the map according to a gaussian distribution, where the gaussian distribution is specifically:

μ～Uniform{(X _L ,Y _L ),(X _R ,Y _R )}+Unidorm(-2,2),σ～Uniform(1,4)

wherein x is the initial position of the unmanned aerial vehicle group; (X) _L ,Y _L ) Is the coordinates of the central birth point of the left side of the unmanned aerial vehicle group; (X) _R ,Y _R ) Is the coordinates of the right center birth point of the unmanned aerial vehicle group; gaussian (x) is a Gaussian distribution with mean μ and standard deviation σ for the random variable x; form (-a, b) is at [ -a, b]Uniform distribution within the range; exp () is an exponential function; mu-form { (X) _L ,Y _L ),(X _R ,Y _R ) Mean value of the position of the unmanned aerial vehicle satisfies gaussian distribution, and the mean value is in (X _L ,Y _L ),(X _R ,Y _R ) The standard deviation of the unmanned aerial vehicle satisfies the Uniform distribution between (1 and 4).

In order to maximize the randomness of the environment as much as possible, the randomness is introduced into the left-right direction of the initial point, the mean value of the gaussian distribution and the variance of the gaussian distribution, and the randomness of the gaussian distribution sampling is combined, so that the unmanned aerial vehicle group model obtained through training is ensured not to be limited by one or more initial position distributions.

In one embodiment, in the step S13: the unmanned aerial vehicle group countermeasure environment designed by the embodiment follows a centralized training-distributed execution frame logic, and the observation space of each unmanned aerial vehicle only comprises local environment information around the unmanned aerial vehicle for distributed execution; the global information of the environment only exists in the state space of the unmanned aerial vehicle and is used for centralized training. Compared with the situation that each unmanned aerial vehicle in the unmanned aerial vehicle group has global observation, the characteristic that the unmanned aerial vehicle group has local observation can better represent the situation that the perception capability of a single unmanned aerial vehicle in the real world is limited.

In general, the observation space of each unmanned aerial vehicle comprises a moving feature, an enemy plane feature, a friend plane feature and a self feature, and the schematic diagram of the observation space of the unmanned aerial vehicle is shown in fig. 3, wherein:

(1) Movement characteristics: in order for the drone to recognize that there is a boundary of a certain size in the drill field and learn rules that cannot exceed the drill field boundary, the distance of the current drone to the four boundaries of the drill field is included in the observation space.

(2) Enemy plane is characterized in that: each unmanned plane can only observe the enemy plane which is in the observation range and is alive at present, including whether the enemy plane is in the attack range or not, the distance to the enemy plane, and the X-direction component speed and the Y-direction component speed of the enemy plane relative to the X-coordinate and the Y-coordinate of the enemy plane.

For a real drone, the values are very easy to obtain by the onboard sensors, so the observation space must include the fields described above. In addition, considering the characteristics of different unmanned aerial vehicles, the embodiment supports the use of three parameters of obs_end_health, obs_end_capacity and obs_end_cool down to determine whether to include the blood volume value (obs_end_health) of the enemy aircraft, the ammunition volume (obs_end_capacity) of the enemy aircraft and the weapon cooling time (obs_end_cool down) information of the enemy aircraft in the observation space. For a enemy aircraft that is not within the observation range or crashes, the enemy aircraft feature portion corresponding to the enemy aircraft is 0.

(3) Friend machine feature: each unmanned aerial vehicle can only observe the friendly machine which is in the observation range and is alive at present, including whether the friendly machine is visible or not, the distance from the friendly machine, the X coordinate and the Y coordinate relative to the friendly machine, and the X-direction component speed and the Y-direction component speed relative to the friendly machine.

For a real drone, the values are very easy to obtain by the onboard sensors, so the observation space must include the fields described above. In addition, considering the characteristics of different unmanned aerial vehicles, the invention supports the use of three parameters of the obs_all_health, obs_all_capacity and obs_all_cool down to determine whether to contain the blood volume value (obs_all_health) of the friendly machine, the ammunition volume (obs_all_capacity) of the friendly machine and the weapon cooling time (obs_all_cool down) information of the friendly machine in the observation space. For a friend machine that is not within the scope of observation or crashes, the friend machine feature corresponding to the friend machine is 0.

(4) The self-characteristics are as follows: each unmanned aerial vehicle can observe own blood volume value, bullet quantity and weapon cooling time, and a self-id independent heat vector, so that the shared unmanned aerial vehicle Actor network can be helped to distinguish different unmanned aerial vehicles.

In an embodiment, when the state space of the unmanned aerial vehicle is set in the embodiment, two state spaces are provided by setting the obs_instead_of_state parameter, and when obs_instead_of_state=0, the observation space of each unmanned aerial vehicle in the unmanned aerial vehicle group in the same camp is directly spliced to be used as a simple state space; when obs_instead_of_state=1, the complex state space set is relatively complex, and for each surviving unit, the part of the crash unit including the blood volume value of the drone, the coordinates relative to the center of the drill field X, the coordinates relative to the center of the drill field Y, the absolute component speed in the X direction, the absolute component speed in the Y direction, the ammunition volume, the weapon cooling time information is 0.

In addition, a state_last_action parameter may be set, and when state_last_action=0, the last action of each unit is not added; when state_last_action=1, the last action of each unit is added. Similar to the design of the state space, the state space contains a unique thermal vector with own id, which helps the shared unmanned aerial vehicle Critic network to distinguish different unmanned aerial vehicles.

In an embodiment, in the step S14, a mixed action space of each unmanned aerial vehicle is set, in which the firing action uses a discrete action space, and the moving action uses a continuous action space, and specifically includes:

the firing action uses a discrete action space, 0 represents no firing, the remaining dimensions represent enemy planes of corresponding numbers of attacks respectively, and the embodiment assumes that the attacks are directional and must hit; the movement uses a continuous movement space, and the unmanned aerial vehicle flight control is performed by adopting the component speed in the X direction and the component speed in the Y direction.

The schematic diagram of the action space of the unmanned aerial vehicle group is shown in fig. 4, and the hybrid action space set in the embodiment is different from the pure discrete action space and the pure continuous action space of the main stream, and because the unmanned aerial vehicle can fire and move at the same time, the environment adopts the design of the hybrid action space.

Further, the present embodiment adds an actionable space to the action space for limiting the action space of the unmanned aerial vehicle, and prohibiting the unmanned aerial vehicle from performing illegal actions (e.g., attacking an enemy unmanned aerial vehicle that is not within the scope of the attack or has crashed, or still firing the attack when there is no ammunition). For the movement action, since the invention does not find a way to limit the out-of-range of the unmanned aerial vehicle through the part, the idea of no-treatment is selected, and the numerical value of the part is set to be 1. For firing action, dimension 0 is always 1, representing that the unmanned aerial vehicle can always choose not to fire, allowing the unmanned aerial vehicle to store ammunition in the current time unit to obtain a possibly larger long-term reward; the remaining dimensions are 1 if and only if the enemy aircraft of the corresponding number is within the attack range of the unmanned aerial vehicle and survives, and the unmanned aerial vehicle has ammunition. When the unmanned aerial vehicle fires, the environment can set up its travel speed to former travel speed's 0.1 time, and the process of aiming when simulating unmanned aerial vehicle transmission. The crashed unmanned aerial vehicle can not execute any action, and the action space and the movable action space have no meaning.

In each time unit, the environment receives an action input from the outside, and the action is used for indicating the action of the unmanned aerial vehicle group in the current time unit, and simultaneously executing the firing and moving actions of each unmanned aerial vehicle. After the action of the current time unit is finished, checking the position of each unmanned aerial vehicle, clearing out-of-range unmanned aerial vehicles, counting the survival condition and blood volume value change condition of the unmanned aerial vehicles of the two parties, determining whether the fight is finished, and calculating a rewarding function of the current time step. To reduce the number of invalid actions that may occur later in each game, the current game ends when the game proceeds to step 100, whether winning or not.

In one embodiment, the step S15: setting the bonus function as a sparse bonus function or a dense bonus function, in particular, by setting a rewind_spark parameter.

Specifically, the present environment defines victory: within the time step limit, when at least one own unmanned aerial vehicle survives, all enemy unmanned aerial vehicles crash; the defeat is: within the time step limit, when at least one enemy unmanned aerial vehicle survives, all own unmanned aerial vehicles crash; the tie is: and in a certain time step within the time step limit, all own unmanned aerial vehicles and enemy unmanned aerial vehicles crash. For a match that does not end within the time step limit, the outcome of the match is not defined. The arrangement of the dense rewarding function ensures that the unmanned aerial vehicle group can obtain rewards in each time period, so that the credit allocation problem caused by sparse rewarding function is avoided, and the unmanned aerial vehicle group strategy is convenient to learn.

As shown in FIG. 5, for the sparse reward function, the reward function is 1 only at the winning time, the reward function at the defeat time is-1, and the reward functions at all other times are 0; for dense bonus functions, a reorder_only_positive parameter may be further set to distinguish between one-way bonus functions or two-way bonus functions.

Unilateral rewarding function: the reward of the current time step is equal to the number of enemy machines newly crashed by the current time step. If the current time step is successful, the fight win item reward_win is increased.

Both parties rewards the function: on the basis of a unilateral rewarding function, symmetrical three items are added: number of own unmanned aerial vehicles crashed at the current time step, rewind_overth_value, the amount of injury suffered by the current time step own unmanned aerial vehicle is reorder_negative_scale. If the current time step is defeated, the fight failure item, re-fight_safe, is added.

To encourage coordination of the drone clusters, the present invention gives the same reward function to all drones of the same party. Furthermore, the present invention enables setting of the reorder_scale and reorder_scale_rate parameters to scale the bonus function in order to control the value of the bonus function within a certain range.

It should be noted that, for out-of-range behaviors, the method does not directly set a relevant reward function, but indirectly guides the unmanned aerial vehicle group model to realize the limitation of the boundary through other reward functions in a mode of forcedly crashing the out-of-range unmanned aerial vehicle, so as to learn the rule that the unmanned aerial vehicle cannot move beyond the boundary.

In an embodiment, as shown in fig. 6, the multi-agent reinforcement learning training strategy includes a first training strategy and a second training strategy, where the first training strategy includes setting a discrete Actor network corresponding to the firing action, a mobile Actor network corresponding to the mobile action, and a Critic network, where the discrete Actor network and the mobile Actor network share a state coding network, and the Critic network outputs logits corresponding to each action dimension of the discrete action space and gaussian distribution means corresponding to the continuous action space respectively;

It should be noted that, the unmanned aerial vehicle group action space design of the present invention greatly influences the selection of the multi-agent reinforcement learning algorithm. The QMIX algorithm is limited to a discrete motion space, and the madddpg algorithm can be expanded from a continuous motion space to a discrete motion space through a re-parameterized skill, but the algorithm effect and the training efficiency cannot meet the complex scene of unmanned aerial vehicle group combat. The MAPPO algorithm is applicable to both discrete and continuous motion spaces, and achieves good results in both discrete and continuous motion space representative environments. Therefore, the MAPPO algorithm is selected as the dominant algorithm for training the unmanned aerial vehicle group model in the embodiment.

It should be noted that, because the motion space logic of the unmanned plane is very different from four common hybrid motion spaces (parameterized motion space, multidimensional discrete motion space, multi-element binary motion space and hierarchical motion space), the embodiment cannot obtain ideal effects by directly applying the MAPPO algorithm. For example, since the mobile Actor network and the firing Actor network share the same reward function, in many cases, the reward cannot be accurately distributed to a certain Actor, which ultimately results in the trained unmanned aerial vehicle group model learning not to back against the attack of the enemy when no ammunition is available. Therefore, the invention adopts two special schemes to organically combine the two action spaces without mutual incoherence. Training strategies for unmanned aerial vehicle clusters include two types:

scheme a (first training strategy): because both Actor networks need to extract the features observed by the unmanned aerial vehicle, the bottom-layer feature extraction networks of the two actors in the invention are shared, and the upper-layer action output network respectively gives logits corresponding to each action dimension of the discrete action space and a Gaussian distribution mean value corresponding to the continuous action space.

Scheme B (second training strategy): based on the scheme A, considering the guidance of firing action on moving action (for example, firing can only be performed when moving within the attack range, and the attack range of an enemy aircraft needs to be escaped as much as possible when the weapon is cooled), the action output by the discrete Actor network can be input into the continuous Actor network together in the form of a single thermal vector and the observed characteristic vector, so that the continuous Actor network obtains more information and makes more intelligent action.

In addition, in order to better transfer information among unmanned aerial vehicles, the invention shares parameters of the firing initiator network and the mobile initiator network of all unmanned aerial vehicles in the same camp, so that the unmanned aerial vehicle group can carry out implicit communication, and the shared strategy is optimized by utilizing a shared rewarding function.

Further, the training strategy set in this embodiment supports two main training modes in reinforcement learning, including rule-based training and Self-play-based training, and a trainer can select a proper mode for training based on own needs. For rule-based training, the invention embeds a manually designed rule: the surviving unmanned aerial vehicle preferentially selects the enemy plane closest to the target as an object of the attack, if the target is not in the attack range of the unmanned aerial vehicle, the target is approached to the target at the maximum speed, and the speed direction is along the connecting line of the two unmanned aerial vehicles and points to the target; otherwise, if ammunition is currently available, attacking the target; if the filling ammunition is currently in the cooling period, the filling ammunition is far away from the target at the maximum speed under the premise of ensuring that the filling ammunition does not cross the boundary, and the speed direction is along the connecting line of the two unmanned aerial vehicles and is far away from the target.

Aiming at Self-play-based training, the unmanned aerial vehicle group environment can be switched between the visual angles of two parties, namely, the concepts of an observation space, a rewarding function and the like can respectively provide the visual angle of a red party and the visual angle of a blue party.

In addition, the invention opens the opponent model sampling technology by default during Self-play training, and randomly selects the model of the other party at different moments (not limited to the latest moment) during training of the model of the one party, thereby continuously improving the model performance of the two parties, and avoiding learning how to restrain the opponent model at the latest moment and forgetting how to restrain the opponent model at the earlier moment.

It should be noted that, for the attack resistance of multi-agent reinforcement learning still being in the initial stage at present, the attack resistance means of multi-agent reinforcement learning of different agents, observation, action and rewarding functions still lacks, the invention provides an omnibearing attack algorithm, so as to assist the attack resistance evaluation of the unmanned aerial vehicle group, the unmanned aerial vehicle group attack principle adopted in the embodiment is shown in fig. 7, and the following description is performed on the adopted unmanned aerial vehicle group attack method:

(1) Policy-based attack

Policy-based attacks refer to the existence of an attacked "inside ghost" in the own unmanned aerial vehicle group in the unmanned aerial vehicle countermeasure scenario, affecting the normal behavior of the my unmanned aerial vehicle group by making anomalous actions. Policy-based attacks perform black box attacks during the test phase of the attacked model, but the attack process involves training against policies, which are implemented by replacing the policies of a certain agent in the multi-agent environment. Policy-based attacks do not require changing the environment and rewarding functions, but only require setting up an agent with a countering policy to act with normal agents to cooperate or compete.

Assuming that the policies of an agent with a countermeasure policy and a normal agent that is under attack are pi, respectively _α And pi _ν During training of the countermeasure strategy, pi is fixed _v The countermeasure agent treats the subject attack agent as a part of the environment, and designs a reward function to target the performance decline of the subject attack normal agent, and gives the countermeasure agent a corresponding reward, and the optimization target can be expressed by the following formula:

wherein R is _adv Awarding a prize according to a specific countermeasure objective for a prize function of the countermeasure agent; s is(s) _t The state is the current moment;to act against the agent; gamma is a discount factor; pi _a (·|o _t ) Strategies for anti-agents; gamma ray ^t A discount factor that increases exponentially for increasing over time; />To combat agent policy pi _a (·|s _t ) The sampled antibody acts.

According to the embodiment, the strategy-based countermeasure sample generation is used for training the countermeasure strategy by using the reinforcement learning method, and the countermeasure strategy is generated aiming at a specific target strategy, so that the method has a targeted attack effect.

Further, policy-based attacks train one or more drone models with countermeasure policies, called "inside-ghost" drones, using single-agent or multi-agent reinforcement learning algorithms, with the goal of making the game as fail as possible. The general countermeasure strategy is only used for a two-person game scene, the strategy of a certain unmanned aerial vehicle in the unmanned aerial vehicle group is replaced by the countermeasure strategy, so that the strategy becomes 'interior ghost', and the training countermeasure strategy can be regarded as a single-agent reinforcement learning task. Assuming the reward for the countermeasure policy is r', the policy-based attack may be formulated as follows:

v(π)＝π _α (o),

Wherein v (pi) is the perturbation to the unmanned cluster strategy pi; r is (r) _t ′＝-r _t Is a reward function for an attacker; pi _α (o) is a strategy against an intelligent agent; gamma ray ^t Is a discount factor that increases exponentially with time; t is the time step.

The present embodiment uses reinforcement learning methods for countermeasure policy generation, training in the process of interacting with attackers, whose attack is optimal from a long term perspective. In addition, the attack assumes that two-person zero and game are formed between the attacker and the defender, and the target of the attacker can be interfered to the greatest extent.

(2) Observation-based attack

An observation-based attack refers to a situation in which the state observed by the drone swarm is disturbed such that the drone swarm makes a false action, thereby negatively affecting the final reward. The observation-based attack can be expressed by the following formula:

s.t.v(s)∈B(s),s′～P(s′|s,{a ⁱ }),{a ⁱ }～π(·|ν(s))

where v(s) is a perturbation to the state space; b(s) is a space for disturbing the state space; pi (|v (s)) is a policy of the unmanned aerial vehicle; a, a ⁱ Is the action of the ith unmanned plane; p (s' |s, { a) ⁱ -state transition probability); s' is the state of the next time period;is; gamma ray ^t Is a discount factor that increases exponentially with time; r is (r) _t Is a reward function of the cooperating agent; is a process of sampling from a probability distribution.

It should be noted that the attack based on observation only goes on in the test phase of the model, since the multi-agent algorithm following the CTDE framework accepts global states as input while training, but only local observations of each agent while testing, the present invention adopts gradient-based attack methods commonly used in computer vision communities (e.g., fast Gradient Symbol Method (FGSM), mapped gradient descent method (PGD)), which add anti-noise to the local observation space of one or more unmanned aerial vehicles in the test phase.

Compared with the existing attack based on observation, the attack mode provided by the embodiment uses reinforcement learning to generate the disturbance target, so that the result obtained by the attack based on observation has long-term optimal property. In addition, the embodiment aims at the situation that the observation space dimension of the unmanned aerial vehicle group is high, and is difficult to directly optimize by using a reinforcement learning algorithm, and the method firstly uses the attack based on the strategy to fight against the strategy pi _a (·|s _t ) Generating, then, for the countermeasure policy pi _a (·|s _t ) Given the results, PGD algorithms are used to generate noise immunity.

Since the gradient-based attack determines the disturbance direction by calculating the gradient of the input by the loss function of the network, and then adding the counterdisturbance to the input by using gradient rising and other modes. The FGSM algorithm adds only one-time disturbance to the input according to the sign of the gradient, and has higher calculation efficiency but relatively weaker attack capability. The PGD algorithm introduces an iterative idea based on the FGSM algorithm, each iteration projects a tiny disturbance into a specified range, and finally, multiple disturbances are superimposed to obtain an antagonistic sample. Because the attack capability of the PGD algorithm is greatly improved compared with that of FGSM, the PGD algorithm is selected as a framework based on observation attack in this embodiment, and a nearly optimal first-order contrast sample can be obtained under the condition that the calculation time is acceptable.

In the current multi-agent reinforcement learning, the observation-based attack mainly focuses on QMIX, and the objective of the attack is to make the attacked to make Q _total Minimum value action. However, MAPPO does not have a network for evaluating the quality of actions, and it is often not easy to find a strong attack target. For this purpose,in the embodiment, the action of the 'interior ghost' unmanned aerial vehicle is selected as an attack target, and the attacked unmanned aerial vehicle model executes the action of the 'interior ghost' by applying certain disturbance in an observation space as much as possible.

Specifically, definition a ^* ＝argmax _a Pi ' (a|s) is the optimal attack action of the ' interior ghost ' unmanned plane strategy output, v (o) is the countermeasure noise added on the observation o,is a gradient sign, pi (a ^* S) selecting action a under observation o for an attacked drone ^* The observed-based attack can be expressed as follows:

wherein v (o) is an anti-noise added on observation o;is a gradient sign; pi _o (a ^* S) selecting action a under observation o for an attacked drone ^* Probability of (2); sign () is a sign function; e is constraint added noise, and is set to 0.01; o is the observation of each drone.

(3) Attack based on reward function

Attacks based on the reward function aim to disturb the internal information of the drone swarm, thus disturbing the behaviour of the drone swarm. Unlike policy-based attacks and observation-based attacks, which are used for attacks performed during the test phase of the model, rewards function-based attacks are used for attacks performed during the training phase of the model. The attack based on the reward function provided by the embodiment can better assist the attack resistance to the robustness of the unmanned aerial vehicle group, so that the robustness evaluation is carried out on the unmanned aerial vehicle group algorithm.

The attack based on the reward function is a black box attack, any model structure information is not required to be known, only the control right of a certain reward function is required to be mastered, and the reward signal is fed back to the unmanned aerial vehicle group through the disturbance environment, so that the unmanned aerial vehicle group receives wrong reward information in the training process, and the strategy performance is poor. The reward function based attack may be formalized as follows:

s.t.ν(r)∈B(r),s′～P(s′|s,{a ⁱ }),{a ⁱ }～π(·|s)

Wherein v (r) is the range over which the bonus function can be perturbed; b (r) is the range of perturbation to the bonus function; p (s' |s, { a) ⁱ -state transfer function); g _t Is an objective function of reinforcement learning; r is (r) _t Is a reward function; gamma is a discount factor, pi (|s) is an agent policy; gamma ray ^t A discount factor that increases exponentially for increasing over time; s' is the state of the agent after selecting action a in state s; { a ⁱ The process of sampling actions by each intelligent agent i based on the strategy pi (|o); a, a ⁱ Is the action of agent i; s is the current state of the unmanned aerial vehicle.

In the unmanned aerial vehicle group countermeasure environment designed in the embodiment, all unmanned aerial vehicles in the same camping share the reward function, so that attack can be completed only by disturbing the shared reward function, and the embodiment is to turn over the symbol of the k% reward with the maximum value at all time points in one match.

It should be noted that most of current attacks based on reward functions are applied to the leaving policy algorithm, and the data of the attack can be directly put into the playback buffer pool, but MAPPO is in the policy algorithm, and the playback buffer pool does not exist. For this reason, the present embodiment adds disturbance directly to the reward function of the environmental feedback, assuming v (r) is the range of the opposing disturbance added to the reward function r, and the k% reward threshold with the maximum value is r _thresh Then the attack based on the reward function may be formalized as follows:

(4) Minority group-based attacks

Attacks based on minority groups aim to pointedly strengthen minority unmanned aerial vehicles in the other unmanned aerial vehicle group, so that the intelligent degree of own unmanned aerial vehicle group behaviors is reduced to a small extent. Different from the three attack modes, the strategy, the state or the rewarding function of one or more own unmanned aerial vehicles are required to be directly disturbed, the attack based on a few groups is the optimal cooperation strategy for fixing the own unmanned aerial vehicles, the MAPPO algorithm is used for normally training the few groups in the opposite unmanned aerial vehicles to perform competitive interaction with the own unmanned aerial vehicles, and the attack is a black box attack during testing.

Specifically, the strategy of the own unmanned aerial vehicle group is fixed based on the attack of the minority group, and the strategy of the minority (less than half of the total number) unmanned aerial vehicle group in the counterpart unmanned aerial vehicle group is retrained. Notably, the policies of non-minority aggressors in the contra unmanned cluster are also fixed.

(5) Attack based on majority population

Attacks based on a plurality of groups aim at pertinently strengthening a plurality of unmanned aerial vehicles in the unmanned aerial vehicle group of the other party, thereby greatly reducing the intelligent degree of the behavior of the unmanned aerial vehicle group of the own party. Similar to the attack algorithm based on a few groups, the attack based on a plurality of groups does not need to directly disturb any element of the own unmanned aerial vehicle group, namely, the optimal cooperation strategy of the fixed own unmanned aerial vehicle, the MAPPO algorithm is used for normally training a plurality of groups in the opposite unmanned aerial vehicle group, competitive interaction is carried out on the opposite unmanned aerial vehicle group and the opposite unmanned aerial vehicle group, and the attack is also a black box attack during testing.

Specifically, most population-based attacks are similar to minority population-based attacks, and are strategies for fixing own unmanned aerial vehicles, and retraining most (more than half of the total number) of unmanned aerial vehicles in the counterpart unmanned aerial vehicle. In the invention, most of unmanned aerial vehicles are defined as all of the other unmanned aerial vehicles in order to obtain the strongest attack capability. However, because the number of unmanned aerial vehicles with changed strategies in the other unmanned aerial vehicle group is different, different misleading behaviors can occur in the process that the own unmanned aerial vehicle is attacked.

Specifically, the unmanned aerial vehicle group attack method is as follows:

TABLE 1 unmanned aerial vehicle group attack method

The control principle of the unmanned aerial vehicle group countermeasure provided by the embodiment is shown in fig. 8, and from the angles of task scene, observation space, state space, action space and rewarding function design, a training and testing environment of an unmanned aerial vehicle group model is provided, as shown in fig. 9; training by using a training strategy to obtain an unmanned aerial vehicle group model, wherein after the training algorithm of the unmanned aerial vehicle group comprehensively analyzes the multi-agent reinforcement learning algorithm of the current main stream, the attack algorithm of the adaptive unmanned aerial vehicle group is carried out on the current most effective MAPPO algorithm on the unmanned aerial vehicle group object, and the scattered reinforcement learning attack algorithm is integrated and transferred to the MAPPO algorithm; the attack algorithm based on the strategy, observation and rewarding function is designed uniformly, the attack algorithm based on a minority group and a majority group is provided, and the attack is carried out on the unmanned aerial vehicle group model; finally, based on an AirSim open source simulation environment, constructing an unmanned aerial vehicle group anti-exercise field, comprehensively evaluating and analyzing the behavior of the unmanned aerial vehicle group model under various conditions, and revealing the vulnerability mechanism of the unmanned aerial vehicle group model.

As shown in fig. 10, a second embodiment of the present invention discloses a system for controlling the challenge of a group of robots, the system comprising a group of robots challenge environment, a group of robots training module, a group of robots attack module, a simulation verification module, a deployed training strategy algorithm library and a group of robots attack algorithm library, which are constructed based on an air sim simulation environment, wherein:

the unmanned aerial vehicle group training module 10 is used for calling a multi-agent reinforcement learning training strategy in a training strategy algorithm library to train in the unmanned aerial vehicle group countermeasure environment to obtain an unmanned aerial vehicle group model;

the unmanned aerial vehicle group attack module 20 is configured to invoke an unmanned aerial vehicle group attack method in an unmanned aerial vehicle group attack algorithm library to attack the unmanned aerial vehicle group model in the unmanned aerial vehicle group countermeasure environment, where the unmanned aerial vehicle group attack method includes a policy-based attack, an observation-based attack, a reward function-based attack, a minority group-based attack, and a majority group-based attack;

the simulation verification module 30 is used for performing simulation verification of the unmanned aerial vehicle group in the unmanned aerial vehicle group countermeasure environment, and evaluating the behavior of the unmanned aerial vehicle group model after attack.

It should be noted that, the multi-agent reinforcement learning training strategy and the unmanned aerial vehicle group attack method adopted in the unmanned aerial vehicle group countermeasure control system of the present invention can refer to the above method embodiments, and are not redundant here.

In one embodiment, as shown in fig. 11, the system communicates with the Python program through the msgpack-RPC protocol in TCP/IP, and the emulation prevention module is configured to monitor a request of a port corresponding to the Python program, so that the Python program sends an RPC data packet containing an action of each of the drones in the drone group to the port through the msgpack serialization format.

This embodiment uses the msgpack-rpc protocol in TCP/IP to complete the communication of the Python program with AirSim. And designating a port number every time when running, enabling the AirSim to always monitor the request of the port in the simulation process, and enabling the Python program to send RPC data packets containing the actions of each unmanned aerial vehicle in the unmanned aerial vehicle group to the port through the msgpack serialization format to carry out interactive control with the AirSim. Thus, the AirSim and Python programs can be isolated from each other and do not interfere with each other. In addition, the corresponding port number configuration can realize multi-machine simulation, one machine runs the Python program, and the other machine or machines run the AirSim.

It should be noted that, in this embodiment, simulation verification of the unmanned aerial vehicle group is performed based on the air sim, and the unmanned aerial vehicle group in the air sim simulation environment is controlled by using the multi-agent reinforcement learning algorithm, so that the multi-agent reinforcement learning algorithm gradually updates its own strategy in the error test process, and obtains higher rewards.

Further, compared with the situation that most environments and code frames only support the reinforcement learning VS rule, the unmanned aerial vehicle group combat environment provided by the embodiment supports two modes of reinforcement learning VS rule and reinforcement learning VS reinforcement learning, the matched codes also support two training modes of reinforcement learning VS rule and reinforcement learning VS reinforcement learning, and meanwhile, the unmanned aerial vehicle group combat environment is compatible with multiple main stream attack modes. As shown in fig. 12, the code repository of the unmanned aerial vehicle swarm countermeasure platform comprises an attack algorithm library and a corresponding model library, only the configuration files in the configuration library need to be selected, and all attack algorithms can be trained and tested according to the same flow.

It should be noted that the MAPPO algorithm implemented by the framework is superior to other frameworks in model performance, and is also excellent in training efficiency. Because MAPPO is a strategy algorithm, the utilization rate of samples is low, and a large number of samples need to be acquired in the training process of a model. Therefore, the invention uses the multi-thread communication mechanism of Python to run a plurality of AirSim environment samples with different ports in parallel, and respectively communicates with corresponding Python sub-threads to complete the sampling process, and then uniformly transmits the samples to the Python main thread for training of unmanned plane group models. As long as sufficient hardware resources such as CPU, memory and display card exist, the training time of the algorithm can be greatly shortened.

In order to obtain better visual effects (such as a custom view, a custom unmanned aerial vehicle model and the like), as shown in fig. 9, the invention constructs a new map on a Windows system based on a fictive engine UE4.27, and installs the compiled AirSim-v1.7.0 as a plug-in into the fictive engine engineering, and a Python program controls the movement of the unmanned aerial vehicle by calling a multi-rotor related API of the AirSim to acquire the state and the like of the unmanned aerial vehicle.

Compared with the prior art, the invention has the advantages that:

(1) A brand-new unmanned aerial vehicle group fight task environment is designed, a current multi-agent reinforcement learning algorithm is improved based on the characteristics of the environment, and the unmanned aerial vehicle group obtained through training shows a certain level of intelligence in behavior. Compared with the current simpler multi-agent reinforcement learning environment, the environment designed by the invention can be used for testing the effect of the multi-agent reinforcement learning algorithm in a complex scene.

(2) The invention integrates the attack algorithms based on strategies, observation and rewarding functions in multi-agent reinforcement learning, and creatively proposes the attack algorithms based on minority groups and majority groups. The platform provides all codes and models, is beneficial to testing the effect of an unmanned aerial vehicle group attack algorithm, and lays a foundation for improving the robustness of the unmanned aerial vehicle group.

(3) Based on a large number of experiments, the method starts from two angles of single behavior and group behavior, and the behavior of the unmanned aerial vehicle group model under normal conditions and under attack is analyzed in detail. Aiming at abnormal behaviors shown by the unmanned aerial vehicle group when being attacked, the intrinsic cause of vulnerability of the unmanned aerial vehicle group model when being against the attack is revealed, and a visual method is adopted to demonstrate the specific behaviors of the unmanned aerial vehicle group so as to verify the rationality of analysis.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method of controlling a population of unmanned aerial vehicles, the method comprising:

2. The method of claim 1, wherein the constructing the drone swarm challenge environment based on the AirSim simulation environment comprises:

3. The unmanned aerial vehicle crowd countermeasure control method of claim 2, wherein the gaussian distribution is expressed as:

μ～Uniform{(X _L ，Y _L )，(X _R ，Y _R )}+Uniform(-2，2)，σ～Uniform(1，4)

wherein x is the initial position of the unmanned aerial vehicle group; (X) _L ，Y _L ) Is the coordinates of the central birth point of the left side of the unmanned aerial vehicle group; (X) _R ，Y _R ) Is the coordinates of the right center birth point of the unmanned aerial vehicle group; gaussian (x) is a Gaussian distribution with mean μ and standard deviation σ for the random variable x; form (-a, b) is at [ -a, b ]Uniform distribution within the range; exp () is an exponential function; mu-form { (X) _L ，Y _L )，(X _R ，Y _R ) Mean value of the position of the unmanned aerial vehicle satisfies gaussian distribution, and the mean value is in (X _L ，Y _L )，(X _R ，Y _R ) The standard deviation of the unmanned aerial vehicle satisfies the Uniform distribution between (1 and 4).

4. The unmanned aerial vehicle crowd countermeasure control method of claim 2, wherein in the set observation space, the movement characteristic is distances of four boundaries of the current unmanned aerial vehicle to the drill field, the enemy aircraft characteristic including whether the enemy aircraft is within its attack range, distances to the enemy aircraft, X-coordinate and Y-coordinate with respect to the enemy aircraft, and X-direction component speed and Y-direction component speed with respect to the enemy aircraft;

5. The unmanned aerial vehicle crowd countermeasure control method of claim 2, wherein the complex state space includes: the blood volume value of the unmanned aerial vehicle, the coordinates relative to the center of the drill field X, the coordinates relative to the center of the drill field Y, the absolute component speed in the X direction, the absolute component speed in the Y direction, the ammunition quantity, and weapon cooling time information.

6. The method of claim 1, wherein the multi-agent reinforcement learning training strategy comprises a first training strategy and a second training strategy, the first training strategy comprises setting a discrete Actor network corresponding to a firing action, a mobile Actor network corresponding to a mobile action, and a Critic network, the discrete Actor network and the mobile Actor network share a state coding network, and the Critic network outputs logits corresponding to each action dimension of a discrete action space and gaussian distribution means corresponding to a continuous action space respectively;

7. The method of controlling the population of unmanned aerial vehicles according to claim 1, wherein the policy-based attack is formulated as:

v(π)＝π _α (o)，

wherein v (pi) is a perturbation to the unmanned aerial vehicle group strategy pi; r's' _t ＝-r _t Is a reward function for an attacker; pi _α (o) is a strategy against an intelligent agent; gamma ray ^t Is a discount factor that increases exponentially with time; t is the time step.

8. The method of controlling the population of unmanned aerial vehicles according to claim 1, wherein the expression of the observation-based attack is:

wherein v (o) is an anti-noise added on observation o;is a gradient sign; pi _o (a ^* S) selecting action a under observation o for an attacked drone ^* Probability of (2); sign () is a sign function; epsilon is the noise added by the constraint; o is the observation of the agent.

9. The method of controlling the population of unmanned aerial vehicle countermeasure according to claim 1, wherein the equation for the attack based on the reward function is expressed as:

where v (r) is the range of challenge disturbances added to the bonus function r; r is (r) _thresh Is a reward threshold.

10. The method for controlling the countermeasures of the unmanned aerial vehicle group according to claim 1, wherein the attack based on the minority group is an optimal cooperation strategy for fixing own unmanned aerial vehicles, and the minority group in the unmanned aerial vehicle group of the other party is trained by using a MAPPO algorithm;

11. The unmanned aerial vehicle group countermeasure control system is characterized by comprising an unmanned aerial vehicle group countermeasure environment, an unmanned aerial vehicle group training module, an unmanned aerial vehicle group attack module, a simulation verification module, a deployed training strategy algorithm library and an unmanned aerial vehicle group attack algorithm library which are constructed based on an AirSim simulation environment, wherein:

12. The drone swarm countermeasure control system of claim 11, wherein the system communicates with the Python program via a msgpack-RPC protocol in TCP/IP, and wherein the emulation prevention module is configured to monitor a request of a port corresponding to the Python program to cause the Python program to send RPC packets containing each of the drone actions in the drone swarm to the port via a msgpack serialization format.