CN112819144B - Method for improving convergence and training speed of neural network with multiple agents - Google Patents

Method for improving convergence and training speed of neural network with multiple agents Download PDF

Info

Publication number
CN112819144B
CN112819144B CN202110192255.2A CN202110192255A CN112819144B CN 112819144 B CN112819144 B CN 112819144B CN 202110192255 A CN202110192255 A CN 202110192255A CN 112819144 B CN112819144 B CN 112819144B
Authority
CN
China
Prior art keywords
agent
intelligent
agents
neural network
rewards
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110192255.2A
Other languages
Chinese (zh)
Other versions
CN112819144A (en
Inventor
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD
Original Assignee
XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD filed Critical XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD
Priority to CN202110192255.2A priority Critical patent/CN112819144B/en
Publication of CN112819144A publication Critical patent/CN112819144A/en
Application granted granted Critical
Publication of CN112819144B publication Critical patent/CN112819144B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6027Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a method, a device and a storable medium for improving convergence and training speed of a neural network with multiple agents, which can give directional rewards/penalties to rewards of the multiple agents, wherein for single agents under the task of the multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are given directional penalties, and the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI.

Description

Method for improving convergence and training speed of neural network with multiple agents
Technical Field
The invention relates to the technical field of artificial intelligence reinforcement learning, in particular to a method for improving convergence and training speed of a neural network with multiple intelligent agents.
Background
As shown in fig. 1, reinforcement learning is learning by an Agent (Agent) in a "trial and error" manner, and obtains rewards by interacting with the environment, so that the Agent obtains the largest rewards, and the reinforcement learning is different from supervised learning in connection with sense learning and mainly represented on reinforcement signals, wherein reinforcement signals provided by the environment in reinforcement learning are used for evaluating whether to generate actions, and not for telling the reinforcement learning system RLS (reinforcement learning system) how to generate correct actions. Since little information is provided by the external environment, RLS must learn from its own experiences. In this way, the RLS obtains knowledge in the context of the action-assessment, improving the action plan to suit the context.
If a certain behavior strategy of an agent leads to a positive prize (signal enhancement) for the environment, the agent's later trend to develop this behavior strategy will be enhanced. The goal of the agent is to find the optimal strategy at each discrete state to maximize the desired discount rewards and. Reinforcement learning refers to learning as a heuristic evaluation process, in which an agent selects an action for an environment, the state of the environment changes after receiving the action, and a reinforcement signal (rewards or punishments) is generated and fed back to the agent, and the agent selects the next action according to the reinforcement signal and the current state of the environment, wherein the selection principle is that the probability of receiving positive reinforcement (rewards) is increased. The selected action affects not only the timely reinforcement signal, but also the state at the next moment in the environment and the final reinforcement signal. The goal of reinforcement learning system learning is to dynamically adjust parameters to achieve reinforcement signal maximization. For example, in the artificial intelligence training of weiqi, if an artificial intelligence AI falls on a position where a piece already exists, a penalty needs to be made on the action policy, so as to guide the AI to perform optimization. (the present invention calls positive scores as rewards and negative deductions as penalties.)
In reinforcement learning artificial intelligence training, there is a multi-agent rewards (review) setting problem. As shown in fig. 2, in the technical solution for processing multiple agents, rewards (Reward) are obtained for the whole of the multiple agents AI, so that the neural network is optimized by further back propagation according to the rewards and punishments. The disadvantage of uniformly finding rewards under the multi-agent problem is that when the multi-agent is optimized, which agent is better and which agent is worse is not known in fact, so that the multi-agent AI is guided to make more effective optimization. Because of this disadvantage, the multi-agent AI will not allow the single agent to make instructions to compare the benefits of the out-of-team at the time of optimization, thereby affecting the possibility that the multi-agent AI explores the best strategy and losing many opportunities to train the curiosity.
For example, in a round-robin game, the multi-agent AI operates as a character for the entire team. In round-making games, such as games with war mists, the multi-agent AI plays a team of players, each player observes different misting states because of their different viewing angles, and the multi-agent AI pieces the states into global information, thereby making further decisions to let each player execute different instructions, respectively. If there are multiple roles to operate in the team, then the error instruction they make needs to be passed to the multi-agent AI to make a penalty. The problem is that the Reward setting of AI shares a Reward report according to the existing technology, namely, a plurality of roles to be operated are stained and co-stained, a role error is subjected to full team penalty, even if only one role of the full team is in error, the Reward is still subjected to penalty (of course, penalty is lower than that of the full team error).
In view of the above, the present inventors have made intensive ideas to the above problems, and have made the present invention.
Disclosure of Invention
The invention aims to provide a method for improving convergence and training speed of a neural network with multiple agents, which can improve the convergence and training speed of the neural network by orienting rewards/penalties.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the method is realized based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent and is used for judging whether instructions of the agents are wrong or not and making an excellent decision; the method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
the intelligent agent gives rewards and punishments feedback to the intelligent agent according to the instruction result and by combining with the buried point judgment in the feedback;
transmitting the rewards and punishments list of the N intelligent agents to a multi-intelligent agent master control;
and the master control of multiple agents reversely updates the neural network of each agent according to the rewarding and punishing list.
An apparatus for improving convergence and training speed for a neural network having multiple agents, the apparatus comprising a processor and a memory;
the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.
A computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform a method as described above.
A computer software program product which, when run on a terminal device, causes the terminal device to perform a method as described above.
After the scheme is adopted, the invention makes directional rewards/penalties for rewards of multiple agents, and for single agents under the task of multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are subjected to directional penalties, so that the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.
Drawings
FIG. 1 is a schematic diagram of reinforcement learning;
FIG. 2 is a schematic diagram of a multi-agent reinforcement learning method in the prior art;
FIG. 3 is a schematic diagram of a multi-agent reinforcement learning method according to the present invention;
fig. 4 is a schematic diagram of a learning method according to an embodiment of the invention.
Detailed Description
As shown in fig. 3, the present invention discloses a method for improving convergence and training speed of a neural network with multiple agents, which is implemented based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent for judging whether instructions of the agents are wrong or not and making an excellent decision. The method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
the intelligent agent gives rewards and punishments feedback to the intelligent agent according to the instruction result and by combining with the buried point judgment in the feedback;
transmitting the rewards and punishments list of the N intelligent agents to a multi-intelligent agent master control;
and the master control of multiple agents reversely updates the neural network of each agent according to the rewarding and punishing list.
Fig. 4 shows an embodiment of the present invention, which is a total of three agents, namely agent a, agent B and agent C. After the state information is input, the intelligent agent A, the intelligent agent B and the intelligent agent C output instructions according to the respective neural networks and by combining the input state information; and then, according to the instruction result and the buried point judgment, giving out intelligent punishment values. In this embodiment, the prize value of agent a is 1, the prize value of agent B is 50, and the penalty value of agent C is 100. The agent A, the agent B and the agent C summarize respective rewards and punishments, form a rewards and punishments list { +1, +50, -100}, and transmit the rewards and punishments list { +1, +50, -100}, and the multi-agent master control updates the network of the agent according to the reward and punishments list counter-propagation: agent A updates its own neural network according to prize value 1 in the prize and punishment list, agent B updates its own neural network according to prize value 50 in the prize and punishment list, and agent C updates its own neural network according to punishment value 100 in the prize and punishment list.
The invention makes directional rewards/penalties for rewards of multiple intelligent agents, and for single intelligent agents under the task of multiple intelligent agents, the intelligent agent which has made the optimal decision at present is encouraged and reserved, while the intelligent agent which has made the wrong decision is punished directionally, so that the neural network optimization process of other intelligent agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.
The invention is very suitable for decision tasks which are highly dependent on single agents to make individual characteristics under the problem of multi-agent AI. For example, a team may cooperate with a game, requiring a single agent to make a sacrifice, thereby preserving the revenue of the entire team. On such premise, the old solution does not allow the single agent to take extreme strategies off the team, since the team will share rewards and the neural network will not be considered as an excellent decision to be made by this single agent. The directional rewarding and punishing scheme of the invention gives each agent an independent rewarding value, considers that the sacrificing little my agent makes a very intelligent decision, gives a correct rewarding guide, encourages the multi-agent to continue to take such decisions in the same state in the future, and further correctly optimizes the multi-agent AI.
Based on the same inventive concept, the invention also discloses a device for improving convergence and training speed of the neural network with multiple intelligent agents, wherein the device comprises a processor and a memory; the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.
The invention also discloses a computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform the method as described above.
The invention also discloses a computer software program product which, when run on a terminal device, causes the terminal device to perform the method as described above.
The foregoing embodiments of the present invention are not intended to limit the technical scope of the present invention, and therefore, any minor modifications, equivalent variations and modifications made to the above embodiments according to the technical principles of the present invention still fall within the scope of the technical proposal of the present invention.

Claims (3)

1. A method for improving convergence and training speed of a neural network with multiple agents, which is characterized by comprising the following steps: the method is realized based on a multi-intelligent system, the multi-intelligent system comprises a multi-intelligent body master control and N intelligent bodies, and a buried point is arranged in the feedback of each intelligent body and is used for judging whether the instruction of the intelligent body is wrong or not and whether an excellent decision is made or not; the method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
according to the instruction result, the intelligent body i combines with the buried point judgment in feedback to give the intelligent body a reward and punishment feedback Mi, wherein i=1, 2, … … and N;
transmitting N intelligent agent rewards and punishments to a multi-intelligent agent master control, wherein the N intelligent agent rewards and punishments are assembled into a rewards and punishments list { M1, M2, …, mi, …, MN };
the master control of multiple agents reversely updates the neural network of each agent according to the rewards and punishments list { M1, M2, …, mi, …, MN }, namely, each agent i updates the neural network of the agent according to the rewards and punishments value Mi.
2. The utility model provides a neural network promotes device of convergence and training speed with many agents which characterized in that: the apparatus includes a processor and a memory;
the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of claim 1.
3. A computer-readable storage medium, characterized by: the computer readable storage medium has stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of claim 1.
CN202110192255.2A 2021-02-20 2021-02-20 Method for improving convergence and training speed of neural network with multiple agents Active CN112819144B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110192255.2A CN112819144B (en) 2021-02-20 2021-02-20 Method for improving convergence and training speed of neural network with multiple agents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110192255.2A CN112819144B (en) 2021-02-20 2021-02-20 Method for improving convergence and training speed of neural network with multiple agents

Publications (2)

Publication Number Publication Date
CN112819144A CN112819144A (en) 2021-05-18
CN112819144B true CN112819144B (en) 2024-02-13

Family

ID=75864251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110192255.2A Active CN112819144B (en) 2021-02-20 2021-02-20 Method for improving convergence and training speed of neural network with multiple agents

Country Status (1)

Country Link
CN (1) CN112819144B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
CN111079717A (en) * 2020-01-09 2020-04-28 西安理工大学 Face recognition method based on reinforcement learning
CN112286203A (en) * 2020-11-11 2021-01-29 大连理工大学 Multi-agent reinforcement learning path planning method based on ant colony algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
CN111079717A (en) * 2020-01-09 2020-04-28 西安理工大学 Face recognition method based on reinforcement learning
CN112286203A (en) * 2020-11-11 2021-01-29 大连理工大学 Multi-agent reinforcement learning path planning method based on ant colony algorithm

Also Published As

Publication number Publication date
CN112819144A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN110991545B (en) Multi-agent confrontation oriented reinforcement learning training optimization method and device
CN110882544B (en) Multi-agent training method and device and electronic equipment
CN108211362B (en) Non-player character combat strategy learning method based on deep Q learning network
Loiacono et al. The 2009 simulated car racing championship
Hwang et al. Cooperative strategy based on adaptive Q-learning for robot soccer systems
CN109794937B (en) Football robot cooperation method based on reinforcement learning
CN113952733A (en) Multi-agent self-adaptive sampling strategy generation method
CN112149344B (en) Football robot with ball strategy selection method based on reinforcement learning
CN112488320A (en) Training method and system for multiple intelligent agents under complex conditions
CN112044076B (en) Object control method and device and computer readable storage medium
Andou Refinement of soccer agents' positions using reinforcement learning
Kose et al. Q-learning based market-driven multi-agent collaboration in robot soccer
CN116187777A (en) Unmanned aerial vehicle air combat autonomous decision-making method based on SAC algorithm and alliance training
CN115409158A (en) Robot behavior decision method and device based on layered deep reinforcement learning model
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN112819144B (en) Method for improving convergence and training speed of neural network with multiple agents
Wang et al. Experience sharing based memetic transfer learning for multiagent reinforcement learning
CN111814988B (en) Testing method of multi-agent cooperative environment reinforcement learning algorithm
CN112540614A (en) Unmanned ship track control method based on deep reinforcement learning
CN116991067A (en) Pulse type track-chasing-escaping-blocking cooperative game intelligent decision control method
CN116227622A (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
Packard et al. Learning behavior from limited demonstrations in the context of games
Ali et al. Evolving emergent team strategies in robotic soccer using enhanced cultural algorithms
Kim et al. Deep q-network for ai soccer
MARTINS Exploring multi-agent deep reinforcement learning in IEEE very small size soccer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant