CN112131786B - Target detection and distribution method and device based on multi-agent reinforcement learning - Google Patents

Target detection and distribution method and device based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN112131786B
CN112131786B CN202010959038.7A CN202010959038A CN112131786B CN 112131786 B CN112131786 B CN 112131786B CN 202010959038 A CN202010959038 A CN 202010959038A CN 112131786 B CN112131786 B CN 112131786B
Authority
CN
China
Prior art keywords
reinforcement learning
model
environment
combat
behavior model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010959038.7A
Other languages
Chinese (zh)
Other versions
CN112131786A (en
Inventor
伊山
魏晓龙
鹿涛
黄谦
齐智敏
蔡春晓
赵昊
张帅
亢原平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Aerospace System Simulation Technology Co ltd Beijing
Evaluation Argument Research Center Academy Of Military Sciences Pla China
Original Assignee
China Aerospace System Simulation Technology Co ltd Beijing
Evaluation Argument Research Center Academy Of Military Sciences Pla China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Aerospace System Simulation Technology Co ltd Beijing, Evaluation Argument Research Center Academy Of Military Sciences Pla China filed Critical China Aerospace System Simulation Technology Co ltd Beijing
Priority to CN202010959038.7A priority Critical patent/CN112131786B/en
Publication of CN112131786A publication Critical patent/CN112131786A/en
Application granted granted Critical
Publication of CN112131786B publication Critical patent/CN112131786B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/14Force analysis or force optimisation, e.g. static or dynamic forces

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a target detection and distribution method and device based on multi-agent reinforcement learning, comprising the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The reinforcement learning algorithm MADDPG is integrated into the chess deduction system, a simulation environment from simple to complex is constructed, the reinforcement learning convergence speed is optimized, and the problem of intelligent optimization convergence speed in the chess deduction system is effectively solved.

Description

Target detection and distribution method and device based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of simulation, and particularly relates to a target detection and distribution method and device based on multi-agent reinforcement learning.
Background
With the development of artificial intelligence, the era of tactics of research and military planning by manpower is gradually moving away from us. In the past, in the process of computer application Yu Bing chess deduction simulation, people effectively simulate the process of war by means of differential equation and war theory, and greatly improve the combat level of the army. Artificial intelligence will now play a more important role in the application of chess deduction. The modeling method has more advantages than the traditional modeling method based on the capability of multi-agent modeling in describing complex systems and the capability of modeling the behavior in dynamic environments. The appearance of the multi-agent system provides a new platform for further expansion of the chess deduction system.
In the course of simulating and deducting the chess, the experienced commander can judge and predict the fight task executed by the commander according to the state, fight capability, fight rule and other information of the enemy. With the continuous development and improvement of the chess system, the simulated combat task of the chess system is faced with a plurality of new changes: firstly, the number of the combat units is increased dramatically, and the workload of each target combat task is heavy to analyze and determine one by a commander, so that the situation of a battlefield is difficult to grasp comprehensively and accurately; secondly, the continuous development of information technology makes the battlefield situation evolution speed continuously accelerated, and purely relying on manual identification of enemy aerial tasks will seriously influence the response time of the my and reduce the battlefield efficiency; finally, the vast amount of battlefield data is often incomplete, untimely and inaccurate, even with fraudulence, and it is difficult for commanders to analyze the key situation hidden therein. The series of deep changes increase difficulty for air mission identification, and the traditional method relying on manual identification is difficult to adapt to the battlefield situation with high complexity and rapid transformation, so that the intelligent battlefield identification method is researched, commanders are liberated from multi-source, complex and heterogeneous battlefield data, more energy is put into command decisions, and the method is a great trend of the development of the intelligent chess system in the future.
With the continuous development of multi-agent reinforcement learning, reinforcement learning has the capabilities of autonomous learning, distribution coordination and organization, and the self behavior is planned through cooperation with other agents, so that the self state information is changed, and finally, the goal is efficiently completed. The multi-agent system not only can completely replace a single agent to complete the object, but also can exceed the efficiency of the single agent, and the phenomenon of large man-power is reflected. The goal of multi-agent team cooperation like a person is a new topic. Deep reinforcement learning uses an asynchronous framework to train multiple agents, each independent of the others, and if the agents are not equally separated, the asynchronous framework is not suitable. The interaction of the intelligent agents in some multi-intelligent-agent algorithms is fully connected, so that the complexity of the algorithm is increased and the algorithm is more difficult to apply to reality, and the optimization convergence speed of the combat behavior model in the chess deduction system is low.
Disclosure of Invention
In view of the above, the invention aims to overcome the defects of the prior art, and provides a target detection and distribution method and device based on multi-agent reinforcement learning, so as to solve the problem of slow optimization convergence speed of a combat behavior model in a chess deduction system in the prior art.
In order to achieve the above purpose, the invention adopts the following technical scheme: a target detection and distribution method based on multi-agent reinforcement learning comprises the following steps:
constructing an operational behavior model and a reinforcement learning training environment;
Training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model;
Training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.
Further, constructing a reinforcement learning training environment, including:
And mapping the combat simulation engine with the reinforcement learning training environment by adopting MADDPG algorithm.
Further, the mapping the combat simulation engine with the reinforcement learning training environment by adopting MADDPG algorithm comprises:
mapping the combat behavior model in the combat simulation engine into a plurality of agents in the reinforcement learning training environment, wherein the agents are used as training objects;
mapping a perception model in the combat simulation engine into a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
Mapping a decision model in the combat simulation engine into a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting actions to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
mapping the memory model in the combat simulation engine into a memory agent module in the reinforcement learning training environment for storing the battlefield situation.
Further, training the combat behavior model by using a reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model, including:
initializing an intelligent agent;
the perception agent module acquires environment information, determines the current battlefield situation and stores the current battlefield situation in the memory agent module;
the decision agent module selects actions to be executed according to the current battlefield situation;
the action agent module executes the selected action;
The reinforcement learning training environment feeds back a battlefield environment to the intelligent body for optimization according to the action result;
judging whether the intelligent agent converges or not, and outputting an artificial intelligent behavior model after the intelligent agent converges.
Further, the training of the artificial intelligence behavior model by using the combat simulation engine, outputting an optimization model, includes:
Initializing an artificial intelligence behavior model;
The perception model acquires environmental information, determines the current battlefield situation and stores the current battlefield situation in the memory model;
the decision model selects actions to be executed according to the current battlefield situation;
the action model performs the selected action;
The battle simulation engine feeds back a battle field environment to the artificial intelligent behavior model for optimization according to the action result;
judging whether the artificial intelligent behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
Further, before determining whether to converge, the method further includes:
Judging whether a preset training ending time is reached or not;
Ending and exiting if the training ending time is reached, otherwise continuing training.
Further, the reinforcement learning training environment utilizes MADDPG algorithm distributed operation to intensively train the combat behavior model.
Further, the number of the intelligent agents is 3.
Furthermore, the combat behavior model adopts a multi-agent artificial neural network.
The embodiment of the application provides a target detection and distribution device based on multi-agent reinforcement learning, which comprises:
The construction module is used for constructing a combat behavior model and a reinforcement learning training environment;
The acquisition module is used for training the combat behavior model by adopting a reinforcement learning training environment until the model converges to acquire an artificial intelligent behavior model;
and the output module is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model.
By adopting the technical scheme, the invention has the following beneficial effects:
the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, comprising the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The reinforcement learning algorithm MADDPG is integrated into the chess deduction system, a simulation environment from simple to complex is constructed, the reinforcement learning convergence speed is optimized, and the problem of intelligent optimization convergence speed in the chess deduction system is effectively solved.
The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent intelligent body which wants to be combined, the intelligent body can leave own pheromone after the action, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies.
The invention applies MADDPG ideas to the field of military simulation, makes each combat unit an independent and wanted compound intelligent body, aims at the problem of convergence speed of multiple intelligent bodies in the MADDPG algorithm during training, adopts the MPE (multiagent-part-envs) environment developed by OpenAI as a basis, removes most of combat model mathematical operations, and keeps most of functional simulation combat simulation deductions of engines. After each step is finished, the learned experience of the intelligent agent is inherited to the chess simulation deduction system, and the chess simulation deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram showing steps of a target detection and distribution method based on multi-agent reinforcement learning according to the present invention;
FIG. 2 is a schematic view of a combat simulation scenario of the present invention;
FIG. 3 is a block diagram of the algorithm of the present invention MADDPG;
Fig. 4 is a schematic structural diagram of a target detecting and distributing device based on multi-agent reinforcement learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.
Inspired by MADDPG (Multi-AGENT DEEP DETERMINISTIC Policy Gradient) Multi-agent algorithm, the strategy Gradient algorithm is subjected to series improvement, so that the strategy Gradient algorithm can be suitable for complex Multi-agent scenes which cannot be processed by the traditional algorithm. The MADDPG algorithm has the following three point features:
And 1, by learning the obtained optimal strategy, the optimal action can be given by only utilizing local information during application.
2, No knowledge of the dynamic model of the environment and special communication requirements is required.
3, The algorithm can be used not only in a collaborative environment but also in a competitive environment.
The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent and mutually-cooperated intelligent body, the intelligent body can leave own pheromone after acting, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies.
The following describes a specific target detection and distribution method and device based on multi-agent reinforcement learning according to an embodiment of the present application with reference to the accompanying drawings.
As shown in fig. 1, a target detection and distribution method based on multi-agent reinforcement learning provided in an embodiment of the present application includes:
s101, constructing an operational behavior model and a reinforcement learning training environment;
as shown in fig. 2, for convenience of study, the following settings were made for the combat behavior model:
(1) The target does not overlap with the target;
(2) The target is not overlapped with the radar detection range;
(3) The unmanned aerial vehicle cluster is fixed at a stable flying height, so that the measurement precision and the ground resolution of the magnetic detector can be ensured.
A group of unmanned aerial vehicles detects dynamic and static targets of a large-scale unstructured environment with obstacles, the quality of a process is measured through a proper objective function, and the perceived radius of the detector changes with the environment and the targets in real time. The observation time is expected herein as a task objective function, optimizing overall quality by minimizing the time required to find a given static target, or by maximizing the average number of dynamic targets found over a certain search time.
The combat behavior model corresponds to a multi-agent artificial neural network, is a core for generating intelligence, and is an object for reinforcement learning training.
S102, training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model;
The reinforcement learning training environment is an environment set according to the combat simulation engine, for example: the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment. Mapping the combat behavior model into multiple agents, and training the multiple agents to obtain an optimized artificial intelligent behavior model. In the application, 3 intelligent agents are adopted, and it is understood that 4, 5 and 6 intelligent agents can be adopted, and the application is not limited herein.
S103, training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.
And (3) putting the artificial intelligent behavior model obtained after the pre-training of the small environment into a large environment for compensation training again, and finally obtaining an optimized model. And constructing a simulation environment from simple to complex, and optimizing the reinforcement learning convergence speed.
The target detection and distribution method based on multi-agent reinforcement learning has the working principle that: constructing an operational behavior model and a reinforcement learning training environment; training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model; training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The training phase is divided into two phases of 'small environment pre-training' and 'large environment compensation training', so that the adaptability of the artificial intelligent behavior model is improved.
In some embodiments, MADDPG algorithms are employed to map the combat simulation engine with the reinforcement learning training environment.
Preferably, the mapping the combat simulation engine with the reinforcement learning training environment using MADDPG algorithm includes:
mapping the combat behavior model in the combat simulation engine into a plurality of agents in the reinforcement learning training environment, wherein the agents are used as training objects;
mapping a perception model in the combat simulation engine into a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
Mapping a decision model in the combat simulation engine into a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting actions to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
mapping the memory model in the combat simulation engine into a memory agent module in the reinforcement learning training environment for storing the battlefield situation.
Because the operation environment and programming language of the reinforcement learning training environment and the combat simulation engine are different, the combat simulation engine and the reinforcement learning training environment are difficult to integrate directly, the most-front multi-agent reinforcement learning algorithm MADDPG is adopted, the algorithm framework is shown in figure 3, and because the real simulation engine data volume and the calculated volume are huge, a simple battlefield environment is firstly constructed on the basis of OpenAI MPE (matrix model) on the outside and is called a chess deduction small environment, the chess deduction small environment can provide simple geographic information data, and simple deduction process data can be generated.
In some embodiments, training the combat behavior model with the reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model includes:
initializing an intelligent agent;
the perception agent module acquires environment information, determines the current battlefield situation and stores the current battlefield situation in the memory agent module;
the decision agent module selects actions to be executed according to the current battlefield situation;
the action agent module executes the selected action;
The reinforcement learning training environment feeds back a battlefield environment to the intelligent body for optimization according to the action result;
judging whether the intelligent agent converges or not, and outputting an artificial intelligent behavior model after the intelligent agent converges.
Specifically, the training terminal is divided into two stages, firstly, pre-training of a small environment is carried out, specifically, an intelligent body is placed in a reinforcement learning training environment, a perception module calls a simulated sensor and a simulated communication interface to acquire environment information, a current battlefield situation is determined and stored in a memory agent module, wherein the sensor is a radar, and the position of teammates, the position of enemies and the like can be acquired. The movement is performed by a positional relationship selection action, such as a leftward action or a rightward action. And the reinforcement learning training environment gives feedback to the agent according to the action result, and determines the reward function so as to optimize.
The expression of the reward function isWhere 100 is a reward, -100 is a penalty, which is made when an agent collides.
Continuously optimizing until the intelligent agent converges, and outputting an artificial intelligent behavior model.
The above is a one-time combat simulation training process, and the combat behavior model is gradually converged through multiple sample training, so as to generate an artificial intelligent behavior model with a reactive type of the countermeasure. Because the first-stage training opponents are traditional reactive behavior models, the behavior logic of the first-stage training opponents is solidified, and the second-stage training is needed to increase the training sample space and improve the adaptability of the artificial intelligence behavior models.
In some embodiments, the training the artificial intelligence behavior model by using the combat simulation engine outputs an optimization model, including:
Initializing an artificial intelligence behavior model;
The perception model acquires environmental information, determines the current battlefield situation and stores the current battlefield situation in the memory model;
the decision model selects actions to be executed according to the current battlefield situation;
the action model performs the selected action;
The battle simulation engine feeds back a battle field environment to the artificial intelligent behavior model for optimization according to the action result;
judging whether the artificial intelligent behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
Then, the artificial intelligent behavior model is put into a large environment for compensation training, and the real combat simulation engine is adopted as the environment for retraining after the small environment is pretrained due to the fact that the probability of error occurrence of the data collected in the small environment pretraining stage is high and the data processing in the pretraining stage is not perfect. The training is performed by adopting a perception model, a decision model, a memory model and a behavior model of the combat simulation engine, and the training process is the same as that of the reinforcement learning training environment, and the application is not repeated here.
Specifically, before judging whether to converge, the method further comprises:
Judging whether a preset training ending time is reached or not;
Ending and exiting if the training ending time is reached, otherwise continuing training.
The end of the training process of the present application can be divided into two types, one for maximum run time, i.e. i run for at most one thousand steps and one for one thousand frame periods, and one for achieving the intended optimization objective.
Preferably, the reinforcement learning training environment uses MADDPG algorithm distributed running to train the combat behavior model in a centralized manner.
Preferably, the combat behavior model adopts a multi-agent artificial neural network.
As shown in fig. 4, an embodiment of the present application provides a target detection and distribution device based on multi-agent reinforcement learning, including:
the construction module 401 is used for constructing a combat behavior model and a reinforcement learning training environment;
an obtaining module 402, configured to train the combat behavior model by using a reinforcement learning training environment until the model converges, to obtain an artificial intelligence behavior model;
And the output module 403 is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model.
The working principle of the target detection and distribution device based on multi-agent reinforcement learning provided by the application is that a construction module 401 constructs a combat behavior model and a reinforcement learning training environment; the acquisition module 402 adopts a reinforcement learning training environment to train the combat behavior model until the model converges, and acquires an artificial intelligent behavior model; the output module 403 trains the artificial intelligence behavior model by adopting a combat simulation engine and outputs an optimization model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
The memory is used for storing a computer program, and the computer program is used for executing the target detection and distribution method based on multi-agent reinforcement learning provided by any one of the embodiments;
The processor is used to call and execute the computer program in the memory.
In summary, the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, which integrates reinforcement learning algorithm MADDPG into a chess deduction system, constructs a simple to complex simulation environment, optimizes reinforcement learning convergence speed, and effectively solves the problem of agent optimization convergence speed in the chess deduction system. The invention applies MADDPG ideas to the field of military simulation, so that each combat unit becomes an independent intelligent body which wants to be combined, the intelligent body can leave own pheromone after the action, and the multi-intelligent body can learn how to aggravate the good pheromone and the poor-attenuation pheromone along with the time. Thus, by increasing interactions between multiple agents, the agents optimize their own strategies, and even if the environment changes, the agents can achieve the goals well according to learned strategies. The invention applies MADDPG ideas to the field of military simulation, makes each combat unit an independent and wanted compound intelligent body, aims at the problem of convergence speed of multiple intelligent bodies in the MADDPG algorithm during training, adopts the MPE (multiagent-part-envs) environment developed by OpenAI as a basis, removes most of combat model mathematical operations, and keeps most of functional simulation combat simulation deductions of engines. After each step is finished, the learned experience of the intelligent agent is inherited to the chess simulation deduction system, and the chess simulation deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
It can be understood that the above-provided method embodiments correspond to the above-described apparatus embodiments, and corresponding specific details may be referred to each other and will not be described herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The target detection and distribution method based on multi-agent reinforcement learning is characterized by comprising the following steps:
constructing an operational behavior model and a reinforcement learning training environment;
Training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligent behavior model;
Training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model; the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment.
2. The method of claim 1, wherein constructing a reinforcement learning training environment comprises:
And mapping the combat simulation engine with the reinforcement learning training environment by adopting MADDPG algorithm.
3. The method of claim 2, wherein the employing MADDPG algorithm to map the combat simulation engine with the reinforcement learning training environment comprises:
mapping the combat behavior model in the combat simulation engine into a plurality of agents in the reinforcement learning training environment, wherein the agents are used as training objects;
mapping a perception model in the combat simulation engine into a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
Mapping a decision model in the combat simulation engine into a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting actions to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
mapping the memory model in the combat simulation engine into a memory agent module in the reinforcement learning training environment for storing the battlefield situation.
4. The method of claim 3, wherein training the combat behavior model using a reinforcement learning training environment to model convergence, obtaining an artificial intelligence behavior model, comprises:
initializing an intelligent agent;
the perception agent module acquires environment information, determines the current battlefield situation and stores the current battlefield situation in the memory agent module;
the decision agent module selects actions to be executed according to the current battlefield situation;
the action agent module executes the selected action;
The reinforcement learning training environment feeds back a battlefield environment to the intelligent body for optimization according to the action result;
judging whether the intelligent agent converges or not, and outputting an artificial intelligent behavior model after the intelligent agent converges.
5. The method of claim 4, wherein training the artificial intelligence behavioral model with the combat simulation engine outputs an optimization model, comprising:
Initializing an artificial intelligence behavior model;
The perception model acquires environmental information, determines the current battlefield situation and stores the current battlefield situation in the memory model;
the decision model selects actions to be executed according to the current battlefield situation;
the action model performs the selected action;
The battle simulation engine feeds back a battle field environment to the artificial intelligent behavior model for optimization according to the action result;
judging whether the artificial intelligent behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
6. The method of claim 4 or 5, further comprising, prior to determining whether to converge:
Judging whether a preset training ending time is reached or not;
Ending and exiting if the training ending time is reached, otherwise continuing training.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The reinforcement learning training environment utilizes MADDPG algorithm distributed operation to intensively train the combat behavior model.
8. The method of claim 3, wherein the step of,
The number of the intelligent agents is 3.
9. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The combat behavior model adopts a multi-agent artificial neural network.
10. Target detection and distribution device based on multi-agent reinforcement learning, characterized by comprising:
The construction module is used for constructing a combat behavior model and a reinforcement learning training environment;
The acquisition module is used for training the combat behavior model by adopting a reinforcement learning training environment until the model converges to acquire an artificial intelligent behavior model;
the output module is used for training the artificial intelligent behavior model by adopting a combat simulation engine and outputting an optimization model; the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment which is made by extracting necessary factors in the large environment.
CN202010959038.7A 2020-09-14 2020-09-14 Target detection and distribution method and device based on multi-agent reinforcement learning Active CN112131786B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010959038.7A CN112131786B (en) 2020-09-14 2020-09-14 Target detection and distribution method and device based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010959038.7A CN112131786B (en) 2020-09-14 2020-09-14 Target detection and distribution method and device based on multi-agent reinforcement learning

Publications (2)

Publication Number Publication Date
CN112131786A CN112131786A (en) 2020-12-25
CN112131786B true CN112131786B (en) 2024-05-31

Family

ID=73846639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010959038.7A Active CN112131786B (en) 2020-09-14 2020-09-14 Target detection and distribution method and device based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN112131786B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222106B (en) * 2021-02-10 2024-04-30 西北工业大学 Intelligent soldier chess deduction method based on distributed reinforcement learning
CN112905166B (en) * 2021-03-04 2024-04-05 青岛海科智汇信息科技有限公司 Artificial intelligence programming system, computer device, and computer-readable storage medium
CN112633519B (en) * 2021-03-11 2021-07-27 中国科学院自动化研究所 Man-machine antagonistic action prediction method, device, electronic equipment and storage medium
CN113469853A (en) * 2021-05-13 2021-10-01 航天科工空间工程发展有限公司 Method for accelerating command control of fighting and artificial intelligence device
CN113435598B (en) * 2021-07-08 2022-06-21 中国人民解放军国防科技大学 Knowledge-driven intelligent strategy deduction decision method
TR2021014085A2 (en) * 2021-09-08 2021-09-21 Havelsan Hava Elektronik San Ve Tic A S AUTONOMOUS VIRTUAL SIMULATOR ASSETS THAT CONTINUOUSLY LEARN THROUGH EXPERIENCE
CN113723013B (en) * 2021-09-10 2024-06-18 中国人民解放军国防科技大学 Multi-agent decision-making method for continuous space soldier chess deduction
CN114327916B (en) * 2022-03-10 2022-06-17 中国科学院自动化研究所 Training method, device and equipment of resource allocation system
CN114611669B (en) * 2022-03-14 2023-10-13 三峡大学 Intelligent decision-making method for chess deduction based on double experience pool DDPG network
CN115906673B (en) * 2023-01-10 2023-11-03 中国人民解放军陆军工程大学 Combat entity behavior model integrated modeling method and system
CN116739323B (en) * 2023-08-16 2023-11-10 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN117217100B (en) * 2023-11-08 2024-01-30 中国人民解放军63963部队 Intelligent modeling method and simulation system for certain team numbers based on reinforcement learning

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101964019A (en) * 2010-09-10 2011-02-02 北京航空航天大学 Against behavior modeling simulation platform and method based on Agent technology
KR101179074B1 (en) * 2011-12-13 2012-09-05 국방과학연구소 Airburst simulation apparatus and method of simulation for airbrust
RU2562096C1 (en) * 2014-06-25 2015-09-10 Федеральное государственное казённое военное образовательное учреждение высшего профессионального образования "Военная академия воздушно-космической обороны им. Маршала Советского Союза Г.К. Жукова" Министерства обороны Российской Федерации Training command post of main rocket attack warning centre
CN105677443A (en) * 2015-12-29 2016-06-15 中国人民解放军空军指挥学院 Heterogeneous simulation system
WO2018175551A1 (en) * 2017-03-22 2018-09-27 Circadence Corporation Mission-based, game-implemented cyber training system and method
CN108629422A (en) * 2018-05-10 2018-10-09 浙江大学 A kind of intelligent body learning method of knowledge based guidance-tactics perception
CN108646589A (en) * 2018-07-11 2018-10-12 北京晶品镜像科技有限公司 A kind of battle simulation training system and method for the formation of attack unmanned plane
CN109636699A (en) * 2018-11-06 2019-04-16 中国电子科技集团公司第五十二研究所 A kind of unsupervised intellectualized battle deduction system based on deeply study
CN109740283A (en) * 2019-01-17 2019-05-10 清华大学 Autonomous multiple agent confronting simulation method and system
CN110147883A (en) * 2019-05-28 2019-08-20 航天科工***仿真科技(北京)有限公司 Training method, device, equipment and the storage medium of model for emulation of fighting
CN110428057A (en) * 2019-05-06 2019-11-08 南京大学 A kind of intelligent game playing system based on multiple agent deeply learning algorithm
CN110766169A (en) * 2019-10-31 2020-02-07 深圳前海微众银行股份有限公司 Transfer training optimization method and device for reinforcement learning, terminal and storage medium
CN110929871A (en) * 2019-11-15 2020-03-27 南京星火技术有限公司 Game decision method and system
CN111027862A (en) * 2019-12-11 2020-04-17 中国舰船研究设计中心 Multidimensional-based hierarchical aggregation combat simulation training evaluation method
CN111632387A (en) * 2020-06-12 2020-09-08 南京大学 Command control system based on interstellar dispute II

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101964019A (en) * 2010-09-10 2011-02-02 北京航空航天大学 Against behavior modeling simulation platform and method based on Agent technology
KR101179074B1 (en) * 2011-12-13 2012-09-05 국방과학연구소 Airburst simulation apparatus and method of simulation for airbrust
RU2562096C1 (en) * 2014-06-25 2015-09-10 Федеральное государственное казённое военное образовательное учреждение высшего профессионального образования "Военная академия воздушно-космической обороны им. Маршала Советского Союза Г.К. Жукова" Министерства обороны Российской Федерации Training command post of main rocket attack warning centre
CN105677443A (en) * 2015-12-29 2016-06-15 中国人民解放军空军指挥学院 Heterogeneous simulation system
WO2018175551A1 (en) * 2017-03-22 2018-09-27 Circadence Corporation Mission-based, game-implemented cyber training system and method
CN108629422A (en) * 2018-05-10 2018-10-09 浙江大学 A kind of intelligent body learning method of knowledge based guidance-tactics perception
CN108646589A (en) * 2018-07-11 2018-10-12 北京晶品镜像科技有限公司 A kind of battle simulation training system and method for the formation of attack unmanned plane
CN109636699A (en) * 2018-11-06 2019-04-16 中国电子科技集团公司第五十二研究所 A kind of unsupervised intellectualized battle deduction system based on deeply study
CN109740283A (en) * 2019-01-17 2019-05-10 清华大学 Autonomous multiple agent confronting simulation method and system
CN110428057A (en) * 2019-05-06 2019-11-08 南京大学 A kind of intelligent game playing system based on multiple agent deeply learning algorithm
CN110147883A (en) * 2019-05-28 2019-08-20 航天科工***仿真科技(北京)有限公司 Training method, device, equipment and the storage medium of model for emulation of fighting
CN110766169A (en) * 2019-10-31 2020-02-07 深圳前海微众银行股份有限公司 Transfer training optimization method and device for reinforcement learning, terminal and storage medium
CN110929871A (en) * 2019-11-15 2020-03-27 南京星火技术有限公司 Game decision method and system
CN111027862A (en) * 2019-12-11 2020-04-17 中国舰船研究设计中心 Multidimensional-based hierarchical aggregation combat simulation training evaluation method
CN111632387A (en) * 2020-06-12 2020-09-08 南京大学 Command control system based on interstellar dispute II

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Efficient training techniques for multi-agent reinforcement learning in combat tasks;Zhang Guanyu 等;《IEEE ACCESS》;20190821;第7卷;109301-109310 *
Recurrent MADDPG for Object Detection and Assignment in Combat Tasks;Wei Xiaolong 等;《IEEE ACCESS》;20200918;第8卷;163334-163343 *
基于深度学习的兵棋实体决策效果智能评估模型;欧微 等;《军事运筹与评估》;20181231;第32卷(第4期);29-34 *
空地协作组网的无人机位置部署及能量优化机制研究;郜富晓;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20200115;第2020年(第01期);C034-1016 *
舰载机对海作战训练仿真***设计与关键技术;王述运 等;《指挥控制与仿真》;20200630;第43卷(第3期);81-86 *

Also Published As

Publication number Publication date
CN112131786A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN112131786B (en) Target detection and distribution method and device based on multi-agent reinforcement learning
Zhang et al. Deep interactive reinforcement learning for path following of autonomous underwater vehicle
Liu et al. A deep reinforcement learning based intelligent decision method for UCAV air combat
Lin et al. Evolutionary digital twin: A new approach for intelligent industrial product development
CN112215350B (en) Method and device for controlling agent based on reinforcement learning
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN112329948A (en) Multi-agent strategy prediction method and device
CN101083019A (en) Rapid evaluating system based on roomage state sensing
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN112947575B (en) Unmanned aerial vehicle cluster multi-target searching method and system based on deep reinforcement learning
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN115952736A (en) Multi-agent target collaborative search method and system
CN114815891A (en) PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method
Li et al. Improved Q-learning based route planning method for UAVs in unknown environment
Ruifeng et al. Research progress and application of behavior tree technology
Wang et al. Data-driven path-following control of underactuated ships based on antenna mutation beetle swarm predictive reinforcement learning
CN117590867A (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
CN117518907A (en) Control method, device, equipment and storage medium of intelligent agent
CN116933948A (en) Prediction method and system based on improved seagull algorithm and back propagation neural network
CN116663416A (en) CGF decision behavior simulation method based on behavior tree
Montana et al. Towards a unified framework for learning from observation
CN116430891A (en) Deep reinforcement learning method oriented to multi-agent path planning environment
Wang et al. A review of deep reinforcement learning methods and military application research
Cummings et al. Development of a hybrid machine learning agent based model for optimization and interpretability
Lu et al. Strategy Generation Based on DDPG with Prioritized Experience Replay for UCAV

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant