CN112131786B

CN112131786B - Target detection and distribution method and device based on multi-agent reinforcement learning

Info

Publication number: CN112131786B
Application number: CN202010959038.7A
Authority: CN
Inventors: 伊山; 魏晓龙; 鹿涛; 黄谦; 齐智敏; 蔡春晓; 赵昊; 张帅; 亢原平
Original assignee: China Aerospace System Simulation Technology Co ltd Beijing; Evaluation Argument Research Center Academy Of Military Sciences Pla China
Current assignee: China Aerospace System Simulation Technology Co ltd Beijing; Evaluation Argument Research Center Academy Of Military Sciences Pla China
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2024-05-31
Anticipated expiration: 2040-09-14
Also published as: CN112131786A

Abstract

The invention relates to a target detection and distribution method and device based on multi-agent reinforcement learning, comprising the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The reinforcement learning algorithm MADDPG is integrated into the chess deduction system, a simulation environment from simple to complex is constructed, the reinforcement learning convergence speed is optimized, and the problem of intelligent optimization convergence speed in the chess deduction system is effectively solved.

Description

Target detection and distribution method and device based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of simulation, and particularly relates to a target detection and distribution method and device based on multi-agent reinforcement learning.

Background

With the development of artificial intelligence, the era of tactics of research and military planning by manpower is gradually moving away from us. In the past, in the process of computer application Yu Bing chess deduction simulation, people effectively simulate the process of war by means of differential equation and war theory, and greatly improve the combat level of the army. Artificial intelligence will now play a more important role in the application of chess deduction. The modeling method has more advantages than the traditional modeling method based on the capability of multi-agent modeling in describing complex systems and the capability of modeling the behavior in dynamic environments. The appearance of the multi-agent system provides a new platform for further expansion of the chess deduction system.

In the course of simulating and deducting the chess, the experienced commander can judge and predict the fight task executed by the commander according to the state, fight capability, fight rule and other information of the enemy. With the continuous development and improvement of the chess system, the simulated combat task of the chess system is faced with a plurality of new changes: firstly, the number of the combat units is increased dramatically, and the workload of each target combat task is heavy to analyze and determine one by a commander, so that the situation of a battlefield is difficult to grasp comprehensively and accurately; secondly, the continuous development of information technology makes the battlefield situation evolution speed continuously accelerated, and purely relying on manual identification of enemy aerial tasks will seriously influence the response time of the my and reduce the battlefield efficiency; finally, the vast amount of battlefield data is often incomplete, untimely and inaccurate, even with fraudulence, and it is difficult for commanders to analyze the key situation hidden therein. The series of deep changes increase difficulty for air mission identification, and the traditional method relying on manual identification is difficult to adapt to the battlefield situation with high complexity and rapid transformation, so that the intelligent battlefield identification method is researched, commanders are liberated from multi-source, complex and heterogeneous battlefield data, more energy is put into command decisions, and the method is a great trend of the development of the intelligent chess system in the future.

With the continuous development of multi-agent reinforcement learning, reinforcement learning has the capabilities of autonomous learning, distribution coordination and organization, and the self behavior is planned through cooperation with other agents, so that the self state information is changed, and finally, the goal is efficiently completed. The multi-agent system not only can completely replace a single agent to complete the object, but also can exceed the efficiency of the single agent, and the phenomenon of large man-power is reflected. The goal of multi-agent team cooperation like a person is a new topic. Deep reinforcement learning uses an asynchronous framework to train multiple agents, each independent of the others, and if the agents are not equally separated, the asynchronous framework is not suitable. The interaction of the intelligent agents in some multi-intelligent-agent algorithms is fully connected, so that the complexity of the algorithm is increased and the algorithm is more difficult to apply to reality, and the optimization convergence speed of the combat behavior model in the chess deduction system is low.

Disclosure of Invention

In view of the above, the invention aims to overcome the defects of the prior art, and provides a target detection and distribution method and device based on multi-agent reinforcement learning, so as to solve the problem of slow optimization convergence speed of a combat behavior model in a chess deduction system in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme: a target detection and distribution method based on multi-agent reinforcement learning comprises the following steps:

constructing an operational behavior model and a reinforcement learning training environment;

Training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model;

Training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.

Further, constructing a reinforcement learning training environment, including:

And mapping the combat simulation engine with the reinforcement learning training environment by adopting MADDPG algorithm.

Further, the mapping the combat simulation engine with the reinforcement learning training environment by adopting MADDPG algorithm comprises:

mapping the combat behavior model in the combat simulation engine into a plurality of agents in the reinforcement learning training environment, wherein the agents are used as training objects;

mapping a perception model in the combat simulation engine into a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;

Mapping a decision model in the combat simulation engine into a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting actions to be executed according to the current battlefield situation;

mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;

mapping the memory model in the combat simulation engine into a memory agent module in the reinforcement learning training environment for storing the battlefield situation.

Further, training the combat behavior model by using a reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model, including:

initializing an intelligent agent;

the perception agent module acquires environment information, determines the current battlefield situation and stores the current battlefield situation in the memory agent module;

the decision agent module selects actions to be executed according to the current battlefield situation;

the action agent module executes the selected action;

The reinforcement learning training environment feeds back a battlefield environment to the intelligent body for optimization according to the action result;

judging whether the intelligent agent converges or not, and outputting an artificial intelligent behavior model after the intelligent agent converges.

Further, the training of the artificial intelligence behavior model by using the combat simulation engine, outputting an optimization model, includes:

Initializing an artificial intelligence behavior model;

The perception model acquires environmental information, determines the current battlefield situation and stores the current battlefield situation in the memory model;

the decision model selects actions to be executed according to the current battlefield situation;

the action model performs the selected action;

The battle simulation engine feeds back a battle field environment to the artificial intelligent behavior model for optimization according to the action result;

judging whether the artificial intelligent behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.

Further, before determining whether to converge, the method further includes:

Judging whether a preset training ending time is reached or not;

Ending and exiting if the training ending time is reached, otherwise continuing training.

Further, the reinforcement learning training environment utilizes MADDPG algorithm distributed operation to intensively train the combat behavior model.

Further, the number of the intelligent agents is 3.

Furthermore, the combat behavior model adopts a multi-agent artificial neural network.

The embodiment of the application provides a target detection and distribution device based on multi-agent reinforcement learning, which comprises:

The construction module is used for constructing a combat behavior model and a reinforcement learning training environment;

The acquisition module is used for training the combat behavior model by adopting a reinforcement learning training environment until the model converges to acquire an artificial intelligent behavior model;

and the output module is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model.

By adopting the technical scheme, the invention has the following beneficial effects:

the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, comprising the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The reinforcement learning algorithm MADDPG is integrated into the chess deduction system, a simulation environment from simple to complex is constructed, the reinforcement learning convergence speed is optimized, and the problem of intelligent optimization convergence speed in the chess deduction system is effectively solved.

The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent intelligent body which wants to be combined, the intelligent body can leave own pheromone after the action, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies.

The invention applies MADDPG ideas to the field of military simulation, makes each combat unit an independent and wanted compound intelligent body, aims at the problem of convergence speed of multiple intelligent bodies in the MADDPG algorithm during training, adopts the MPE (multiagent-part-envs) environment developed by OpenAI as a basis, removes most of combat model mathematical operations, and keeps most of functional simulation combat simulation deductions of engines. After each step is finished, the learned experience of the intelligent agent is inherited to the chess simulation deduction system, and the chess simulation deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram showing steps of a target detection and distribution method based on multi-agent reinforcement learning according to the present invention;

FIG. 2 is a schematic view of a combat simulation scenario of the present invention;

FIG. 3 is a block diagram of the algorithm of the present invention MADDPG;

Fig. 4 is a schematic structural diagram of a target detecting and distributing device based on multi-agent reinforcement learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

Inspired by MADDPG (Multi-AGENT DEEP DETERMINISTIC Policy Gradient) Multi-agent algorithm, the strategy Gradient algorithm is subjected to series improvement, so that the strategy Gradient algorithm can be suitable for complex Multi-agent scenes which cannot be processed by the traditional algorithm. The MADDPG algorithm has the following three point features:

And 1, by learning the obtained optimal strategy, the optimal action can be given by only utilizing local information during application.

2, No knowledge of the dynamic model of the environment and special communication requirements is required.

3, The algorithm can be used not only in a collaborative environment but also in a competitive environment.

The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent and mutually-cooperated intelligent body, the intelligent body can leave own pheromone after acting, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies.

The following describes a specific target detection and distribution method and device based on multi-agent reinforcement learning according to an embodiment of the present application with reference to the accompanying drawings.

As shown in fig. 1, a target detection and distribution method based on multi-agent reinforcement learning provided in an embodiment of the present application includes:

s101, constructing an operational behavior model and a reinforcement learning training environment;

as shown in fig. 2, for convenience of study, the following settings were made for the combat behavior model:

(1) The target does not overlap with the target;

(2) The target is not overlapped with the radar detection range;

(3) The unmanned aerial vehicle cluster is fixed at a stable flying height, so that the measurement precision and the ground resolution of the magnetic detector can be ensured.

A group of unmanned aerial vehicles detects dynamic and static targets of a large-scale unstructured environment with obstacles, the quality of a process is measured through a proper objective function, and the perceived radius of the detector changes with the environment and the targets in real time. The observation time is expected herein as a task objective function, optimizing overall quality by minimizing the time required to find a given static target, or by maximizing the average number of dynamic targets found over a certain search time.

The combat behavior model corresponds to a multi-agent artificial neural network, is a core for generating intelligence, and is an object for reinforcement learning training.

S102, training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model;

The reinforcement learning training environment is an environment set according to the combat simulation engine, for example: the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment. Mapping the combat behavior model into multiple agents, and training the multiple agents to obtain an optimized artificial intelligent behavior model. In the application, 3 intelligent agents are adopted, and it is understood that 4, 5 and 6 intelligent agents can be adopted, and the application is not limited herein.

S103, training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.

And (3) putting the artificial intelligent behavior model obtained after the pre-training of the small environment into a large environment for compensation training again, and finally obtaining an optimized model. And constructing a simulation environment from simple to complex, and optimizing the reinforcement learning convergence speed.

The target detection and distribution method based on multi-agent reinforcement learning has the working principle that: constructing an operational behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The training phase is divided into two phases of 'small environment pre-training' and 'large environment compensation training', so that the adaptability of the artificial intelligent behavior model is improved.

In some embodiments, MADDPG algorithms are employed to map the combat simulation engine with the reinforcement learning training environment.

Preferably, the mapping the combat simulation engine with the reinforcement learning training environment using MADDPG algorithm includes:

Because the operation environment and programming language of the reinforcement learning training environment and the combat simulation engine are different, the combat simulation engine and the reinforcement learning training environment are difficult to integrate directly, the most-front multi-agent reinforcement learning algorithm MADDPG is adopted, the algorithm framework is shown in figure 3, and because the real simulation engine data volume and the calculated volume are huge, a simple battlefield environment is firstly constructed on the basis of OpenAI MPE (matrix model) on the outside and is called a chess deduction small environment, the chess deduction small environment can provide simple geographic information data, and simple deduction process data can be generated.

In some embodiments, training the combat behavior model with the reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model includes:

initializing an intelligent agent;

the action agent module executes the selected action;

Specifically, the training terminal is divided into two stages, firstly, pre-training of a small environment is carried out, specifically, an intelligent body is placed in a reinforcement learning training environment, a perception module calls a simulated sensor and a simulated communication interface to acquire environment information, a current battlefield situation is determined and stored in a memory agent module, wherein the sensor is a radar, and the position of teammates, the position of enemies and the like can be acquired. The movement is performed by a positional relationship selection action, such as a leftward action or a rightward action. And the reinforcement learning training environment gives feedback to the agent according to the action result, and determines the reward function so as to optimize.

The expression of the reward function isWhere 100 is a reward, -100 is a penalty, which is made when an agent collides.

Continuously optimizing until the intelligent agent converges, and outputting an artificial intelligent behavior model.

The above is a one-time combat simulation training process, and the combat behavior model is gradually converged through multiple sample training, so as to generate an artificial intelligent behavior model with a reactive type of the countermeasure. Because the first-stage training opponents are traditional reactive behavior models, the behavior logic of the first-stage training opponents is solidified, and the second-stage training is needed to increase the training sample space and improve the adaptability of the artificial intelligence behavior models.

In some embodiments, the training the artificial intelligence behavior model by using the combat simulation engine outputs an optimization model, including:

Initializing an artificial intelligence behavior model;

the action model performs the selected action;

Then, the artificial intelligent behavior model is put into a large environment for compensation training, and the real combat simulation engine is adopted as the environment for retraining after the small environment is pretrained due to the fact that the probability of error occurrence of the data collected in the small environment pretraining stage is high and the data processing in the pretraining stage is not perfect. The training is performed by adopting a perception model, a decision model, a memory model and a behavior model of the combat simulation engine, and the training process is the same as that of the reinforcement learning training environment, and the application is not repeated here.

Specifically, before judging whether to converge, the method further comprises:

Judging whether a preset training ending time is reached or not;

The end of the training process of the present application can be divided into two types, one for maximum run time, i.e. i run for at most one thousand steps and one for one thousand frame periods, and one for achieving the intended optimization objective.

Preferably, the reinforcement learning training environment uses MADDPG algorithm distributed running to train the combat behavior model in a centralized manner.

Preferably, the combat behavior model adopts a multi-agent artificial neural network.

As shown in fig. 4, an embodiment of the present application provides a target detection and distribution device based on multi-agent reinforcement learning, including:

the construction module 401 is used for constructing a combat behavior model and a reinforcement learning training environment;

an obtaining module 402, configured to train the combat behavior model by using a reinforcement learning training environment until the model converges, to obtain an artificial intelligence behavior model;

And the output module 403 is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model.

The working principle of the target detection and distribution device based on multi-agent reinforcement learning provided by the application is that a construction module 401 constructs a combat behavior model and a reinforcement learning training environment; the acquisition module 402 adopts a reinforcement learning training environment to train the combat behavior model until the model converges, and acquires an artificial intelligent behavior model; the output module 403 trains the artificial intelligence behavior model by adopting a combat simulation engine and outputs an optimization model.

The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;

The memory is used for storing a computer program, and the computer program is used for executing the target detection and distribution method based on multi-agent reinforcement learning provided by any one of the embodiments;

The processor is used to call and execute the computer program in the memory.

In summary, the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, which integrates reinforcement learning algorithm MADDPG into a chess deduction system, constructs a simple to complex simulation environment, optimizes reinforcement learning convergence speed, and effectively solves the problem of agent optimization convergence speed in the chess deduction system. The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent intelligent body which wants to be combined, the intelligent body can leave own pheromone after the action, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies. The invention applies MADDPG ideas to the field of military simulation, makes each combat unit an independent and wanted compound intelligent body, aims at the problem of convergence speed of multiple intelligent bodies in the MADDPG algorithm during training, adopts the MPE (multiagent-part-envs) environment developed by OpenAI as a basis, removes most of combat model mathematical operations, and keeps most of functional simulation combat simulation deductions of engines. After each step is finished, the learned experience of the intelligent agent is inherited to the chess simulation deduction system, and the chess simulation deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.

It can be understood that the above-provided method embodiments correspond to the above-described apparatus embodiments, and corresponding specific details may be referred to each other and will not be described herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The target detection and distribution method based on multi-agent reinforcement learning is characterized by comprising the following steps:

Training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model; the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment.

2. The method of claim 1, wherein constructing a reinforcement learning training environment comprises:

3. The method of claim 2, wherein the employing MADDPG algorithm to map the combat simulation engine with the reinforcement learning training environment comprises:

4. The method of claim 3, wherein training the combat behavior model using a reinforcement learning training environment to model convergence, obtaining an artificial intelligence behavior model, comprises:

initializing an intelligent agent;

the action agent module executes the selected action;

5. The method of claim 4, wherein training the artificial intelligence behavioral model with the combat simulation engine outputs an optimization model, comprising:

Initializing an artificial intelligence behavior model;

the action model performs the selected action;

6. The method of claim 4 or 5, further comprising, prior to determining whether to converge:

Judging whether a preset training ending time is reached or not;

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The reinforcement learning training environment utilizes MADDPG algorithm distributed operation to intensively train the combat behavior model.

8. The method of claim 3, wherein the step of,

The number of the intelligent agents is 3.

9. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The combat behavior model adopts a multi-agent artificial neural network.

10. Target detection and distribution device based on multi-agent reinforcement learning, characterized by comprising:

the output module is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model; the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment.