CN114048833B

CN114048833B - Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game

Info

Publication number: CN114048833B
Application number: CN202111303688.7A
Authority: CN
Inventors: 王轩; 漆舒汉; 张加佳; 于梓元; 刘洋; 唐琳琳; 夏文; 廖清; 蒋琳; 张丹丹
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2023-01-17
Anticipated expiration: 2041-11-05
Also published as: CN114048833A

Abstract

The invention discloses a multi-person large-scale non-complete information game method and a device based on neural network virtual self-game, which introduces a priority experience sampling mechanism and a priority weighted degree control mechanism on the basis of a traditional neural network virtual self-game NFSP algorithm, sets priorities according to the learning values of experience segments to filter experiences in a memory bank, adopts a data structure of a summation tree for storage and sampling of the priority experiences, realizes priority experience sampling with time complexity, reduces the cost of interaction with the environment in the NFSP training process, and accelerates the solving speed; meanwhile, a Markov decision process is used for modeling an extended game, a multi-person game is converted into an interactive process of a single intelligent agent and an environment, the interactive process can be regarded as a two-person game of the single intelligent agent and the environment, the application range of NFSP is expanded to the multi-person game, and the universality of an algorithm is enhanced.

Description

Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-person and large-scale incomplete information game method and device based on neural network virtual self-checking.

Background

From the beginning of the germination stage of artificial intelligence development, the machine game is interwoven, and is always an inspection scale for the development level of artificial intelligence. The related research of machine game induces a plurality of important theories and methods in the field of artificial intelligence, and generates profound academic and social influences.

Machine games can be generally divided into games under complete information conditions (called complete information games for short) and games under non-complete information conditions (called non-complete information games for short). The former refers to that players participate in games on the premise of acquiring all information of the game environment, such as most chess games of chess and go mentioned above. In recent years, complete information games represented by the go program AlphaGo are well solved, and academia turns the research center to the non-complete information machine games. In the non-full information game, the game information can be only partially observed and acquired, and the state of the opponent and the adopted strategy cannot be accurately known, for example, in the texas playing card, a player can only see own hands and public cards but cannot know the hands of the opponent. The complexity of the non-complete information game is closely related to the number of players, the uncertainty degree of information, game rules and the like, and becomes a challenging branch of machine game research. Meanwhile, the situation of incomplete information is commonly existed in strategic decisions of the real world, such as military and business strategies, negotiation, financial investment and the like, which also makes the research on the incomplete information game particularly important.

One of the main directions of non-full information gaming research is the research of gaming algorithms based on the combination of gaming theory and machine learning. The direction is characterized by not only having the theoretical support of game Nash equilibrium, but also having the learning ability of zero-starting point self-adaption.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art and provide a multi-person and large-scale non-complete information game method and device based on neural network virtual self-game, wherein the approximate optimal reaction of a multi-person game is obtained by solving the approximate solution of the Markov decision process determined by an opponent strategy group, and a final game strategy is obtained after training and promotion.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a multi-person and large-scale non-complete information game method based on neural network virtual self-game, which comprises the following steps:

the method comprises the following steps that an intelligent agent and a virtual opponent game in a game environment generate experience segments, and the experience segments are stored in an optimal response memory base and used for training and updating an optimal response network; if the intelligent agent selects the optimal response according to the probability, the optimal response experience segment is stored in the average strategy memory base for learning and updating of the average strategy; the optimal response is realized through a reinforcement learning algorithm DQN in a neural network virtual self-centering NFSP algorithm, and the average strategy is obtained through a supervision learning mode in the neural network virtual self-centering NFSP algorithm;

the intelligent agent plays games in a game environment and accumulates experience segments, and when the experience segments in the average strategy memory base reach a certain number, training and improvement of the intelligent agent output strategy are started;

introducing a priority experience sampling mechanism into an optimal response network of the NFSP algorithm, setting priorities according to learning values of experience segments to filter the experience segments in an optimal response memory base, storing the experience segments with the added priority attributes by using a summing tree, sampling the priorities by taking the priorities as a measurement standard in a sampling stage, realizing the prior learning of the valuable experience segments, recalculating the priorities after learning, and updating the summing tree;

optimizing the learning degree of the experience segments by adopting a priority weighted learning degree control mechanism, and realizing the control of the learning degree of the experience segments with different values by adjusting the learning times of the experience segments in training;

when multi-agent gaming is carried out, a Markov Decision Process (MDP) is used for carrying out remodeling on the extended gaming so as to simulate a multi-agent gaming environment;

on the premise that each intelligent agent has a private memory bank, a common memory bank is further added, and the learning effect is improved; the private memory bank is used for independently storing experience segments related to a single intelligent agent; the common memory base is used for storing all experience segments generated in the game process.

As a preferred technical scheme, the intelligent agent adopts a mixed action strategy and a virtual opponent to play a game in a game environment, namely an average strategy for fighting the virtual opponent is selected by a dynamic expected parameter according to probability to obtain an optimal reaction and an average strategy.

As a preferred technical solution, the intelligent agent plays games in a game environment and accumulates experience segments, and when the experience segments in the average policy memory base reach a certain number, training and improvement of the intelligent agent output policy are started, specifically:

updating parameters of the optimal reaction network by sampling empirical segments of the optimal reaction memory base; and when the optimal response experience segments in the average strategy memory bank are accumulated to a certain number, triggering the updating of the average strategy network, sampling the optimal response experience segments in the optimal response memory bank for supervised learning, fitting the optimal expression of the game intelligent body, so as to continuously train and promote the strategy, and gradually converging the average strategy to approximate Nash equilibrium according to the convergence theoretical guarantee.

As a preferred technical solution, the priority of the experience segment is measured by the size of TD-error, and the larger TD-error, the more worth learning the experience segment, the higher the priority; the priorities are expressed as follows:

wherein, p (e) _i ) Indicates priority, δ _i Representing empirical fraction e _i TD-error, alpha ∈ [0,1 ]]Alpha controls the influence degree of TD-error, and the degradation is simple random sampling at that time; epsilon is a small positive number to avoid zero-priority experience segments and to ensure that all experience segments are likely to be sampled.

As a preferred technical scheme, the probability distribution deviation caused by the priority empirical sampling is corrected by adopting an annealing algorithm, and the original distribution P is increased by adding an ISW weight coefficient to the TD-error _A Equivalent conversion of samples to redistribution P _B Sampling, completing correction, and obtaining the final weight after correction as follows:

in the above formula, because p (e) ^-β Monotonically decreases with increasing p (e), so p (e) ^-β Has a maximum value of (c) corresponding to a minimum value of p (e), i.e. max _j p(e _j ) ^-β ＝(min _j p(e _j )) ^-β 。

As a preferred technical solution, the summation tree is constructed by the following steps:

initializing the value of a tree node to be zero, when an experience fragment is stored in a memory bank, storing the experience fragment and the priority thereof by using a leaf node, then updating the priority data stored by an ancestor node upwards layer by layer, and when sampling is carried out, sampling n experience samples to obtain the best experience fragmentThe sum of the first level is divided into n intervals averagely, then a priority level is randomly selected from the n priority level intervals respectively and is marked as p ₁ ,p ₂ ,…,p _n Then, the corresponding experience sample is found in the summation tree according to the priorities.

As a preferred technical solution, the learning degree control mechanism with priority weighting is used to optimize the learning degree of the experience segment, specifically:

the learning times LT of the experience segment ei in one training are defined by taking the priority as a weight coefficient of the learning times as shown in the following formula:

LT(e _i )＝clip[p(e _i )N _ltmax ,N _ltmin ,N _ltmax ]

wherein N is _ltmin ,N _ltmax Respectively the upper limit and the lower limit of learning times of the empirical fragment, and the clip clamps the rounded learning times to N _ltmin ,N _ltmax ]Within the range.

As an optimal technical scheme, the method for re-modeling the extended game by using the markov decision process specifically comprises the following steps:

in order to realize correct modeling of multi-person games, for each agent, N-1 opponent agents except the agent determine an MDP together, from the perspective of a single game agent, the agent simply interacts with the environment, experience segments are continuously generated in the interaction process, and then an approximate solution of the MDP is obtained through learning the experience segments, namely the optimal reaction of the opponent agents against N-1.

Preferably, the game environment includes a training environment and an evaluation environment, the training environment is used for a plurality of agents to play games to improve the game level, and the evaluation environment is used for evaluating the game level of the trained agents by playing games with other agents.

Another aspect of the present invention provides an electronic device, including:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer program instructions executable by the at least one processor, the computer program

The instructions are executable by the at least one processor to enable the at least one processor to perform the multi-person, large-scale non-full information gaming method based on neural network virtual self-alignment.

In still another aspect, the present invention provides a computer readable storage medium storing a program, which when executed by a processor, implements the multi-person, large-scale non-full information game method based on neural network virtual self-alignment.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention introduces a priority empirical sampling mechanism and a priority weighted degree control mechanism on the basis of the traditional NFSP algorithm, reduces the cost of interaction with the environment in the NFSP training process, and accelerates the solving speed; meanwhile, a Markov decision process is used for modeling an extended game, a multi-person game is converted into an interactive process of a single intelligent agent and an environment, the interactive process can be regarded as a two-person game of the single intelligent agent and the environment, the application range of NFSP is expanded to the multi-person game, and the universality of an algorithm is enhanced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flow chart of the algorithm of the DQN of the present invention;

FIG. 2 is a diagram of the NFSP algorithm framework of the present invention;

FIG. 3 is a schematic diagram of a summing tree holding experience segments and their priorities in accordance with the present invention;

FIG. 4 is a schematic diagram of the interaction pattern in the multi-agent gaming of the present invention;

FIG. 5 is a schematic diagram of the method for extracting and storing experience segments of multi-agent in the present invention;

FIGS. 6 (a), 6 (b) are the results of the evaluation of the present invention during the game training of Leduc and HUNL;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

And (4) machine game:

machine gaming is also known as Computer gaming (Computer Games), and the gaming process in the present invention can be described using an extended game model:

extended Game (extended Form Game): commonly used to describe sequential games is a model of sequential actions by multiple players. The model is typically represented by a game tree with the six-member set < N, H, P, σ, μ, I > described as follows:

(1) N = {1, …, N }, representing a set of players;

(2) H, a set of node states of the limited game tree, wherein the state node H belongs to H, and the outgoing edge of the state node H represents an available action set A (H) in the current state;

(3) H → N ^ C }, appoint the next gambler, c represents the chance node;

(4) Strategy σ: selecting an action for a state;

(5) Z → R, calculating the profit of each gambler reaching the leaf node;

(6) Information set I: refers to a collection of indistinguishable all states.

Behaviour Strategy (Behavior Strategy) sigma of a gambler _i (a|μ _i ) Is the probability distribution of all legal actions in the face of a particular state. Sigma _i Is a set of player i behavior strategies. One policy group (Strategy Profile) σ = (σ) ₁ ,σ ₂ ,…,σ _n ) Is a combination of all player policies. Sigma _-i Is except sigma in the policy group _i Except for the combination of the remaining policies.

Virtual self-game:

the virtual Self-Play (FSP) is developed from the virtual Play (FP). FSP is a classic game theory model that promotes the level of strategy by playing with a virtual opponent. In repeated play, the optimal response to the virtual opponent's average strategy is taken. Over a sufficient number of iterations, the average strategy of the player can converge to nash equilibrium in a two-person zero-sum game.

The FSP belongs to generalized weakened virtual game, and training and promotion are carried out through sampling learning experience segments. The FP traverses all states every iteration, inevitably encountering a dimension disaster, and the generalized weakening of virtual alignment solves the problem well. FSP introduces machine learning in the framework of FP to approximately solve the optimal reaction and averaging strategy. FSP uses the average strategy of the mixed strategy sigma = (1-eta) pi + eta beta to resist adversaries to generate empirical data D which is stored in an optimal response memory bank M _B In (1). If the optimal response beta is selected according to the probability, the average strategy memory base M is synchronously updated _A . Then, the approximate optimal reaction and the average strategy are obtained by respectively adopting reinforcement learning and supervised learning. After sufficient iteration, the average strategy pi approaches nash equilibrium, and the specific algorithm is as follows:

virtual self-alignment in conjunction with neural networks:

virtual Self-Play (NFSP) in conjunction with Neural networks is based on the FSP framework and can be used to solve extended game strategies. NFSP combines the FSP framework with neural networks to obtain the ability to receive complex inputs and solve end-to-end for optimal responses and averaging strategies. NFSP mainly comprises two major modules, one of which is an optimal reaction module, and is currently realized by adopting DQN to carry out an algorithm. The other is an average strategy module which is realized by adopting a supervised learning algorithm at present. The NFSP solves the optimal reaction by means of the DQN, and compared with a CFR series algorithm traversal game tree for solving the optimal reaction, the strategy solving process is greatly simplified. Because the construction of the game tree depends on game prior knowledge, the CFR series algorithm is inevitably limited by the prior knowledge, while the NFSP algorithm can realize learning without the prior knowledge. In addition, the CFR algorithm is used for iteratively searching the game tree, is not suitable for solving a large-scale game strategy, and generally needs to abstract the game according to complicated field knowledge, so that the problem scale is reduced, and the generalization is insufficient. And the NFSP is more suitable for large-scale games because of combining with a neural network. The NFSP inherits the theoretical guarantee of virtual self-game, and can converge to approximate Nash equilibrium in the zero-sum game of two persons with incomplete information. Therefore, the NFSP mainly solves three problems of approximate Nash equilibrium which is independent of local search, does not have prior knowledge learning and converges to self-game, and is more suitable for solving a large-scale game strategy.

As shown in fig. 1 and fig. 2, the multi-person large-scale non-complete information game method based on neural network virtual self-game includes the following steps:

s1, an intelligent agent and a virtual opponent play in a game environment to generate experience segments, specifically:

the NFSP gaming agent adopts a mixed action Strategy (Behavior Strategy) sigma = (1-eta) pi + eta beta in a gaming environment, namelyAverage strategy pi against virtual adversaries by probabilistically selecting optimal responses β = ε -green (Q) and average strategy pi by dynamic expectation parameters ^-1 . The game agent and the virtual opponent play in the environment to generate experience segments (s, a, r, s'), and store the experience segments into the optimal response memory bank M _B And the method is used for training and updating the optimal response network. If the game intelligent body selects the optimal reaction beta according to the probability, the optimal reaction experience segment (s, a) is stored in the average strategy memory bank M _A For learning update of the average strategy. The modular composition of the algorithm and its relationship are shown in fig. 2. Corresponding to the optimal response beta and the average strategy pi, two networks, namely an optimal response network DQN network (actually comprising two networks, namely a current valuation network Q and a target valuation network Q', and expressed as a whole here) and an average strategy network pi, are configured in the NFSP algorithm. The optimal response network provides an optimal response and the average policy network provides an average policy.

S2, the intelligent agent plays games in a game environment and accumulates experience segments, and when the experience segments in the memory base reach a certain number, training promotion of strategies is started, and the method specifically comprises the following steps:

by sampling M _B To update the optimal reaction network parameter sum; when average strategy memory base M _A After the optimal response memory segments are accumulated to a certain number, the updating of the average strategy network is triggered, and M is sampled _A The optimal reaction experience segments (s, a) in the game agent are subjected to supervised learning, and optimal performance of the game agent is fitted. The continuous training and promotion of the strategy are carried out, and the average strategy gradually converges to approximate Nash equilibrium according to the theoretical guarantee of convergence. The specific algorithm is as follows:

the NFSP algorithm utilizes DQN to provide an optimal reaction, and obtains an average strategy through a mode of supervised learning. The maximum cost in the DQN reinforcement learning algorithm depends on interaction between the intelligent agent and the environment, so that whether experience can be efficiently utilized or not can be reduced as much as possible, and the training improvement of the intelligent agent is greatly influenced. By the inference, the interaction cost of the agent with the environment also has a non-negligible effect on the game level of the NFSP algorithm.

For some large-scale non-full information games, the allocation of chips in the prize pool is only performed when the winning or losing of the final state of the game is reached, which is called sparse awards. The reward sparsity problem is a core problem faced by deep reinforcement learning in solving tasks and widely exists in real life. One effect of reward sparseness is that a large number of experience segments stored in the memory base have very low learning value, even no learning value. The learning utilization of the experience segments in the experience pool without distinction can cause the problem that the learning of the intelligent agent is inefficient. In order to increase the level of the agent, the agent is required to continue interacting with the environment and continuously supplementing experience segments. Thus, a vicious circle of inefficient learning and more interaction costs is involved. At the end of the line, the key to break the cycle is to improve the learning efficiency of the experience segment. How to improve the learning efficiency of the intelligent agent on the experience fragment becomes the problem to be solved in the following emphatically.

Reviewing the empirical learning mode in the NFSP algorithm, the empirical learning of the optimal reaction part is from M _B The memory bank randomly and uniformly samples a certain amount of experience segments for batch learning, and the experience learning of the average strategy part is from M _A The memory bank samples a certain amount of experience segments in a reservoir sampling mode to carry out batch learning. Referring to the methodology of human learning strategies, the arrangement of the learning order of materials and the control of the learning degree in the learning process have a great influence on the learning effect. Therefore, the invention provides an effective method for improving the NFSP experience learning efficiency by taking the learning sequence and the learning degree as breakthrough points.

And S3, introducing a priority experience sampling mechanism into an optimal response network of the NFSP algorithm, setting priorities according to learning values of experience segments to filter the experience segments in an optimal response memory base, storing the experience segments with the added priority attributes by using a summing tree, sampling the priorities by taking the priorities as a measuring standard in a sampling stage, realizing the purpose of learning valuable experience segments preferentially, recalculating the priorities after learning, and updating the summing tree.

According to the invention, on the basis of a DQN experience playback mechanism, a priority experience sampling mechanism is introduced, the experience in a memory base is filtered by setting the priority according to the learning value of experience segments, and the experience segments with higher priority are sampled and learned preferentially for the current intelligent, so that the optimization of a learning sequence is realized.

The size of the TD-error is selected as a measure for the priority rating, because the TD-error can reflect how unexpected the experience segment is for the current agent, i.e., reflect how much the newly generated experience exceeds the prior knowledge experience of the agent. Therefore, the larger the TD-error, the more worthwhile the experience segment is to learn, and the higher the priority should be, and the priority of the experience segment measured by TD-error is expressed as the following formula:

wherein, delta _i Representing empirical fraction e _i TD-error, alpha ∈ [0,1 ]]And controlling the influence degree of TD-error, wherein the time degradation is simple random sampling, and epsilon is a small positive number, so as to avoid the zero-priority condition of the experience segments and ensure that all the experience segments can be sampled. The expression balances simple random sampling and greedy priority sampling (namely sampling according to the descending order of the TD-error absolute value), samples according to the normalized probability, ensures that the sampling probability is in direct proportion to the priority, has the monotonous characteristic, and simultaneously improves the exploratory property of sampling and the diversity of experience segments obtained by sampling.

It should be noted that the introduction of priority changes the probability distribution of the original empirical segment, introduces bias, which we correct using PER method. The PER method adopts a Weighted Importance Sampling (WIS) method to perform deviation annealing, and the WIS has the characteristic of small variance compared with the general Importance Sampling (IS). During correction, an Importance Sampling Weight (ISW) is added to the loss function as a Weight coefficient, and then the update is completed by using the corrected gradient. The specific correction method is as follows:

assuming an empirical random variable X obeys a probability distribution P _A I.e. X to P _A The empirical sample of the sample is { x } ₀ ,x ₁ ,…,x _n }, the loss function is expressed as follows:

wherein delta _i Sample x representing an empirical fragment _i TD-error, E [. Cndot.)]Indicating the desire.

Assuming that the priority is introduced, the empirical probability distribution becomes P _B Now according to P _B Sampling samples, wherein the sampling according to importance has the following formula:

wherein

Namely ISW, the original distribution P can be distributed by adding an ISW weight coefficient to TD-error _A Sampling equivalent transformation to redistribution P _B And sampling to finish correction. When the priority experience sampling is specifically realized, the experience segment ei is in P _A Probability under distribution P _A (e _i )＝1/N，P _B (e _i )＝p(e _i ) (i.e., priority of ei), so

Normalization is typically performed to enhance stability. In reinforcement learning, as training gradually converges, the weights can be adjusted elastically by annealing method, introducing annealing coefficient beta, and selecting an initial value beta ₀ Initially, as the training increases gradually linearly to 1, the weightsThe influence of the weight gradually decreases, and the final weight is determined as shown in the following formula:

There is also a need to solve the problem of how to construct prioritized banks. The application adopts a data structure of a summation tree (SumTree) to realize priority empirical sampling according to time complexity.

A summing tree is a special binary tree in which the values of parent nodes are the sum of the values of their child nodes. As shown in fig. 3, the values of the leaf nodes are accumulated layer by layer in an ascending manner, and the value of the root node is the sum of the values of all the leaves. The method for constructing the priority memory base by using the summation tree comprises the following steps: and initializing the value of the tree node to be zero, when the experience is stored in the memory base, storing the experience and the priority thereof by using the leaf node, and then updating the priority data stored by the ancestor node layer by layer upwards. When sampling is carried out, for example, n empirical samples are sampled, the priority sum is averagely divided into n intervals, and then one priority is randomly selected from the n priority intervals respectively and is marked as p ₁ ,p ₂ ,…,p _n Then, the corresponding experience sample is found in the summation tree according to the priorities. With p _i For example, =9, 9 compares the priorities of two child nodes of the root node, is not greater than the value of the left child, and enters the left sub-tree; then, starting from the root node of the left subtree, if the root node is larger than the left child, subtracting the priority of the left child from 9 to obtain a new priority 6, and entering the right subtree; and then, starting from the root node of the right subtree, finding a leaf node with the left priority of 3 because the leaf node is not larger than the left child, and extracting the experience from the leaf node, thereby completing sampling of the experience. In this manner, sampling of the remaining n-1 experiences is accomplished.

The application introduces priority empirical sampling into an optimal response module of an NFSP algorithm, provides an improved algorithm, namely NFSP-PER, stores empirical fragments added with priority attributes by using a summing tree on the basis of the original NFSP algorithm, performs priority sampling by taking priority as a measurement standard in a sampling stage, realizes priority learning on valuable empirical fragments, recalculates priority after learning, and updates the summing tree, and the specific algorithm is as follows:

s4, optimizing the learning degree of the experience segments by adopting a priority weighted learning degree control mechanism, and controlling the learning degree of the experience segments with different values by adjusting the learning times of the experience segments in training;

and after the learning sequence optimization of the experience segments is completed, further considering the optimization of the learning degree of the experience segments. The central idea is that experience segments with higher learning values should give deeper learning. The learning degree of the experience segments with different values is controlled by adjusting the learning times of the experience segments in training. Specifically, the priority is used as a weight coefficient of the learning times, and the learning times LT of the experience segment ei in one training is defined as follows:

LT(e _i )＝clip[p(e _i )N _ltmax ,N _ltmin ,N _ltmax ]

wherein N is _ltmin ,N _ltmax Respectively the upper limit and the lower limit of the learning times of the empirical segment, and the clip clamps the rounded learning times to N _ltmin ,N _ltmax ]Within the range. Care should be taken in this process to ensure that the learning level is moderate, avoiding the over-fitting problem. In addition, the optimal reaction part of the NFSP algorithm adopts a batch learning mode, and the priorities of a plurality of experience segments obtained by batch sampling can be different, so that the priority is given toAnd updating the average value of the priorities of all the experience segments, and defining the learning times LT of the experience segments in one batch training as shown in the following formula:

where k is the size of the batch sample, e = { e = ₀ ,e ₁ ,…,e _k Is the batch experience from the sampling.

The learning degree control of priority weighting and the experience sampling of the priority jointly form a priority experience playback mechanism, an NFSP algorithm and the priority experience playback mechanism are combined, the NFSP-PER-LT algorithm is further provided on the basis of the NFSP-PER, batch sampling is carried out according to the priority of experience segments, more optimal learning sequence organization is achieved, meanwhile, the learning times in training are determined according to the value of batch experience, and more optimal learning degree control is achieved.

S5, when the multi-agent game is carried out, a Markov decision process MDP is used for carrying out re-modeling on the extended game so as to simulate the multi-agent game environment;

according to the correlation theory that the extended game (EFG) six-tuple is less than N, H, P, sigma, mu and I > and the Markov Decision Process (MDP) five-tuple is less than S, A, P ^ R and gamma, the EFG can be correspondingly modeled by adopting MDP. Aiming at the condition of multiple persons, a player set N, a game state set H and a player function P in the EFG correspond to a state set S and a state transfer function P ^ in the MDP, a legal action set A (H) of a state H in the EFG corresponds to an action set A in the MDP, a player strategy sigma in the EFG corresponds to an agent strategy pi of the MDP, and a profit function of the EFG corresponds to a reward G obtained by R and gamma of the MDP. In the corresponding relations, the corresponding of the relevant states is the most complex, and extracting and storing the correct experience segments according to the state transition condition is an important base point for realizing the NFSP multi-player game algorithm.

The NFSP agent achieves a progressive increase in game level from scratch by gaming with a virtual opponent in the environment. In the specific implementation process of the algorithm, the game environment which is suitable for multiple persons during training is configured in the environmentMultiple NFSP agents play a mutual game. The intelligent agents start with a strategy of random initialization, experience segments are generated in the mutual game process, and the game level is improved through learning of the experience segments. In order to realize correct modeling of multi-person gaming, for each agent, N-1 opponent agents except the agent determine an MDP together, from the perspective of a single gaming agent, the agent simply interacts with the environment, experience segments are continuously generated in the interaction process, and then an approximate solution of the MDP, namely the optimal reaction of the opponent agents against N-1 opponent agents, is obtained through learning the experience segments. The specific interaction mode is shown in fig. 4, the policy groups of

agents

1,2, and 3 other than agent 0 determine a markov decision process, agent 0 takes action to act on the environment after sensing the environment state s, and the state is transferred to s _mid0 After which agent 1 perceives s _mid0 Post-action state transition to s _mid1 Thus, agent 3 takes post-action state from s _mid2 Transition to s ', and then s' is sensed by agent 0. Although three intermediate states are experienced from s to s ', for agent 0, s' is the next state of s, and all intermediate state transitions are considered internal state transitions of the MDP, as determined by the state transition function of the MDP, equivalent to the state change that occurs after the environment takes action, and so a multiplayer game is considered to be a two-player game between a single agent and the environment.

S6, on the premise that each intelligent agent has a private memory bank, a common memory bank is further added, and the learning effect is improved; the private memory bank is used for independently storing experience segments related to a single intelligent agent; the common memory base is used for storing all experience segments generated in the game process;

compared with the case of two agents, the game of a plurality of agents needs to ensure that correct experience segments are extracted and stored from the interactive experience sequence of the MDP, so that each game agent is ensured to realize strategy promotion by learning the experience segments. Meanwhile, considering that multiple NFSP agents may need experience sharing during training, a mechanism for experience sharing needs to be constructed. In the specific implementation, two sets of experience segment storage mechanisms can be set, one is to set an independent experience segment memory base in each intelligent agent and independently store experience segments related to a single intelligent agent, and the other is to set a common experience segment memory base in the environment and store all experience segments generated in the game process for all intelligent agents to selectively learn. In addition, a training environment and an evaluation environment are distinguished, the training environment is specially used for a plurality of NFSP agents to play games to improve the game level, and the evaluation environment is specially used for playing the trained NFSP agents and other agents to evaluate the game level of the NFSP agents. The extraction and storage of the empirical fragment is shown in fig. 5.

In addition, the game system also needs to be properly adjusted, and the game system comprises a judging system and a state rotation system of game results. The result of the two-player zero-sum game is a pure win or a pure lose, the mode of the prize distribution is relatively simple, only the prize of one player needs to be determined, and the prize of the other player is the opposite number. However, the situation of multiple persons is very different, and the number of the winning or losing persons and the number of the tie persons can be more than one person, so that aiming at the multi-person game, the judgment system for the game winning or losing is properly adjusted, the number of the winning, losing and tie players needs to be respectively determined, and the awards are distributed. Meanwhile, the state rotation mode also needs to be adjusted from a simple switch mode to a multi-state switching mode, and accordingly boundary conditions during switching are adjusted.

In a more specific embodiment of the present application, the technology is applied to a real gaming scenario to realize a multi-agent gaming. The concrete mode is as follows: extracting the characteristics of the environment where the intelligent agent is located according to the game scene, and coding the characteristics into a characteristic vector as the input of a Q value network; the characteristic vector comprises private card information, public card information and historical action sequences of all game participants in the whole game; the network main body adopts a Q-value convolutional neural network structure, the number of hidden layers is 3, the number of neurons in each layer is 64, the reinforcement learning rate is 0.1, the mean square error is used as a loss function, the supervised learning rate is 0.005, the cross entropy is used as a loss function, and all activation functions use ReLU; when the strategy is solved, the NFSP framework is used integrally, and the optimal reaction of the NFSP is calculated by using the DQN algorithm; when the memory base is updated by sampling the experience segments, a priority experience playback mechanism is used, the expected parameter eta is 0.1, and the discount factor gamma is 0.99; when the experience segments are stored, a summation tree is constructed, the size of an optimal response memory bank MB is 30000, and the size of an average strategy memory bank MA is 1000000; using a priority-weighted learning level control mechanism when learning the experience segments in the memory bank; when the multi-agent game is carried out, the MDP model is used for carrying out modeling on the extended game again so as to simulate the multi-agent game environment; on the premise that each agent has a private memory bank, a common memory bank is further added, so that the learning effect is improved; the final output strategy of the whole system is an N-dimensional vector, N represents the number of legal actions that the intelligent agent can make, the N-dimensional vector represents the probability distribution of the actions of the intelligent agent, and the intelligent agent samples the probability distribution to obtain the real actions.

The beneficial effects of the present invention will be illustrated by the following experiments:

1. experiment setting:

in order to verify the effectiveness of the invention, the invention applies different evaluation platforms to construct two-person and multi-person game intelligent agents and carries out experiments from two aspects of intelligent agent training and evaluation.

The experimental relevant details and the agent main parameters are shown in table 1, table 2 and table 3 below:

table 1 experimental design related details

TABLE 2 Game agent Key parameters

TABLE 3 Game agent network parameters

2. The existing method comprises the following steps:

(1) The NFSP-UCT is the latest NFSP algorithm realized by Harbin university of industry, and the UCT is adopted to solve the optimal reaction, so that the method has the advantage of 0.4667kmbb/g compared with the original NFSP algorithm.

(2) Deep CFR, a CFR algorithm combined with a neural network proposed by Kanai Meilong university, directly obtaining a virtual value by using the network, and further obtaining a game strategy by an regret value minimization method.

3. The experimental results are as follows:

the two-person game evaluation platform comprises a spare and a HUNL, and the evaluation intelligent agent is an intelligent agent based on improved algorithms NFSP-PER and NFSP-PER-LT. The winning and losing conditions when the opponent intelligent agent is played and the average value of the return value won by each game are used as evaluation indexes, and the higher the winning rate is, the larger the return value is, the higher the game level is. Three different opponent agents were configured in the experiment: the NFSP-UCT intelligent agent is constructed based on an NFSP-UCT algorithm; the Random intelligent agent is a two-person game sample intelligent agent provided by an official party in the ACPC competition; the deep CFR agent is constructed based on a CFR algorithm combined with a neural network proposed by Kanai Meilong university.

(1) And (3) evaluation of playing in training: the NFSP-PER, NFSP-PER-LT and NFSP-UCT intelligent bodies are respectively trained for 50 000 rounds on a Leduc platform and a HUNL platform, in the training process, the playing test and evaluation with the Random intelligent body are respectively carried out once every 1000 rounds, 10000 games are carried out on each playing test and evaluation, and the change curves of the winning return values of the three intelligent bodies and the Random intelligent bodies in the Leduc platform and the HUNL platform are shown in fig. 6 (a) and fig. 6 (b).

The qualitative analysis according to fig. 6 shows that in Leduc, the difference of game levels of three different NFSP two-person game agents in training is not large, and NFSP-PER-LT are slightly superior. In HUNL, game levels of NFSP-PER and NFSP-PER-LT are greatly improved relative to game levels of NFSP-UCT, and meanwhile in the later training period, NFSP-PER-LT wins the most returns and has the highest intelligence level.

(2) And (3) evaluation after training: the NFSP-PER and NFSP-PER-LT intelligent bodies after 50 000 rounds of training are played with the NFSP-UCT intelligent body (50 000 rounds of training), the Random intelligent body and the deep CFR intelligent body (100 000 times of iterative training) on a Leduc platform and a HUNL platform respectively, and then the NFSP-PER-LT intelligent body and the NFSP-PER intelligent body are played. Different random seeds are selected in the chess to reflect the randomness in the Texas playing cards, and 2 agent position settings (1,2) and (2,1) are additionally covered to weaken the influence of the positions of the agents on the magnitude of the won returns of the agents. The winning return values of NFSP-UCT, random and DeepCFR of the games of NFSP-PER and NFSP-PER-LT in Leduc are shown in tables 4 and 5. In HUNL, the NFSP-PER and NFSP-PER-LT win return values for NFSP-UCT, random and DeepCFR, respectively, are shown in tables 6 and 7.

TABLE 4 winning reward value (kmbb/g) of NFSP-PER playing in Leduc

TABLE 5 Leduc NFSP-PER-LT winning return value (kmbb/g)

TABLE 6 HUNL in NFSP-PER winning return value (kmbb/g)

TABLE 7 HUNL in NFSP-PER-LT winning return value (kmbb/g)

Comparing laterally according to tables 4, 5, 6 and 7, it can be seen that:

1) NFSP-PER and NFSP-PER-LT prevail at Random and deep CFR with great advantages in playing, wherein the advantages are more obvious when playing Random, in Leduc NFSP-PER and NFSP-PER-LT fight Random comparison battle deep CFR win more 0.7569kmbb/g and 0.7880kmbb/g respectively, in HUNL win more 6.0781kmbb/g and 6.0028kmbb/g respectively.

2) The higher payback achieved in large scale HUNL than in small scale Leduc indicates that NFSP-PER and NFSP-PER-LT are better able to develop their gambling level in large scale gambling.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

As shown in fig. 7, in an embodiment, an electronic device of a multi-person, large-scale non-complete information gaming method based on neural network virtual self-alignment is provided, the electronic device 100 may include a first processor 101, a first memory 102 and a bus, and may further include a computer program stored in the first memory 102 and executable on the first processor 101, such as a multi-party privacy protection machine learning program 103.

The first memory 102 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 102 may in some embodiments be an internal storage unit of the electronic device 100, e.g. a removable hard disk of the electronic device 100. The first memory 102 may also be an external storage device of the electronic device 100 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 100. Further, the first memory 102 may also include both an internal storage unit and an external storage device of the electronic device 100. The first memory 102 may be used to store not only application software installed in the electronic device 100 and various types of data, such as codes of the multi-party privacy protecting machine learning program 103, but also temporarily store data that has been output or will be output.

The first processor 101 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 101 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 100 by running or executing programs or modules (e.g., federal learning defense programs, etc.) stored in the first memory 102 and calling data stored in the first memory 102.

Fig. 7 shows only an electronic device with components, and those skilled in the art will appreciate that the structure shown in fig. 7 does not constitute a limitation of the electronic device 100, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

The multiparty privacy preserving machine learning program 103 stored in the first memory 102 of the electronic device 100 is a combination of instructions that, when executed in the first processor 101, may implement:

when the multi-agent game is carried out, a Markov Decision Process (MDP) is used for carrying out re-modeling on the extended game so as to simulate the multi-agent game environment;

on the premise that each agent has a private memory bank, a common memory bank is further added, so that the learning effect is improved; the private memory bank is used for independently storing experience segments related to a single intelligent agent; the common memory base is used for storing all experience segments generated in the game process.

Further, the modules/units integrated with the electronic device 100 may be stored in a non-volatile computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The multi-person large-scale incomplete information game method based on the neural network virtual self-game is characterized by comprising the following steps of:

on the premise that each agent has a private memory bank, a common memory bank is further added, so that the learning effect is improved; the private memory bank is used for independently storing experience segments related to a single intelligent agent; the common memory base is used for storing all experience segments generated in the game process;

the multi-player and large-scale non-complete information game method based on the neural network virtual self-game is applied to multi-player games to realize multi-agent game, and the realization mode is as follows: extracting the characteristics of the environment where the intelligent agent is located according to the game scene, and coding the characteristics into a characteristic vector as the input of a Q value network; the characteristic vector comprises private card information, public card information and historical action sequences of all game participants in the whole game; the network main body adopts a Q-value convolutional neural network structure, the number of hidden layers is 3, the number of neurons in each layer is 64, the reinforcement learning rate is 0.1, the mean square error is used as a loss function, the supervised learning rate is 0.005, the cross entropy is used as a loss function, and all activation functions use ReLU; when the strategy is solved, the NFSP framework is used integrally, and the optimal reaction of the NFSP is calculated by using the DQN algorithm; when the memory base is updated by sampling the experience segments, a priority experience playback mechanism is used, the expected parameter eta is 0.1, and the discount factor gamma is 0.99; when the experience segments are stored, a summation tree is constructed, the size of an optimal response memory bank MB is 30000, and the size of an average strategy memory bank MA is 1000000; using a priority-weighted learning level control mechanism when learning the experience segments in the memory bank; when the multi-agent game is carried out, the MDP model is used for carrying out modeling on the extended game again so as to simulate the multi-agent game environment; on the premise that each intelligent agent has a private memory bank, a common memory bank is further added, and the learning effect is improved; the final output strategy of the whole system is an L-dimensional vector, wherein L represents the number of legal actions that the intelligent agent can make, the L-dimensional vector represents the probability distribution of the actions of the intelligent agent, and the intelligent agent samples the probability distribution to obtain the real actions.

2. The multi-person large-scale non-complete information game method based on neural network virtual self-game as claimed in claim 1, wherein the agent employs a hybrid action strategy and virtual opponents to play in the game environment, that is, selects the optimal reaction and average strategy according to probability through dynamic expected parameters to resist the average strategy of the virtual opponents.

3. The multi-person large-scale non-complete information game method based on the neural network virtual self-alignment as claimed in claim 1, wherein the agent plays the game in the game environment and accumulates experience segments, and when the experience segments in the average strategy memory base reach a certain number, training and promotion of the agent output strategy are started, specifically:

updating parameters of the optimal reaction network by sampling experience segments of the optimal reaction memory base; when the optimal response experience segments in the average strategy memory base are accumulated to a certain number, the updating of the average strategy network is triggered, the optimal response experience segments in the optimal response memory base are sampled for supervised learning, the optimal expression of the game intelligent body is fitted, the continuous training and the promotion of the strategy are carried out, and the average strategy is gradually converged to approximate Nash equilibrium according to the convergence theoretical guarantee;

the priority of the experience segment is measured by the size of TD-error, and the larger the TD-error is, the more worthy the experience segment is to learn, the higher the priority is; the priority is expressed as follows:

wherein, p (e) _i ) Indicates priority, δ _i Representing empirical fraction e _i TD-error, alpha ∈ [0,1 ]]Alpha controls the influence degree of TD-error, the degradation is simple random sampling at that time, and k is the size of batch sampling; epsilon is a small positive number to avoid zero-priority experience segments and to ensure that all experience segments are likely to be sampled.

4. The multi-person large-scale non-complete information game method based on neural network virtual self-alignment as claimed in claim 3, wherein the probability distribution deviation caused by the experience sampling of the priority is corrected by adopting an annealing algorithm, and the original distribution P is obtained by adding ISW weight coefficient to TD-error _A Equivalent conversion of samples to redistribution P _B Sampling, completing correction, and obtaining the final weight after correction as follows:

in the above formula, because p (e) ^-β Monotonically decreases with increasing p (e), so p (e) ^-β The maximum value of (b) corresponds to the minimum value of p (e), and β represents the annealing coefficient.

5. The multi-person, large-scale non-full information gaming method based on neural network virtual self-centering as recited in claim 1, wherein the summation tree is constructed by:

initializing the value of a tree node to be zero, when an experience segment is stored in a memory bank, storing the experience segment and the priority thereof by using a leaf node, then updating the priority data stored by an ancestor node upwards layer by layer, when sampling is carried out, sampling n experience samples, averagely dividing the sum of the priorities into n intervals, then randomly selecting a priority from the n priority intervals respectively, and marking the priority as p ₁ ,p ₂ ,…,p _n Then, corresponding empirical samples are found in the summing tree according to these priorities.

6. The multi-person large-scale non-complete information game method based on the neural network virtual self-centering as claimed in claim 1, wherein the learning degree control mechanism using priority weighting is used for optimizing the learning degree of the empirical section, and specifically comprises:

defining the experimental segment e in one training by adopting the priority as the weight coefficient of the learning times _i The learning number LT of (a) is shown by the following formula:

LT(e _i )＝clip[p(e _i )N _ltmax ,N _ltmin ,N _ltmax ]

wherein N is _ltmin ,N _ltmax Respectively the upper limit and the lower limit of the learning times of the empirical segment, and the clip clamps the rounded learning times to N _ltmin ,N _ltmax ]Within the range; p (e) _i ) Indicating a priority.

7. The multi-person large-scale non-complete information game method based on neural network virtual self-game playing of claim 1, wherein the extended game is re-modeled using a markov decision process, specifically:

in order to realize correct modeling of multi-person gaming, each opponent agent except the opponent agent determines an MDP together for each agent, and the method is simply interacted with the environment from the perspective of a single gaming agent, experience segments are continuously generated in the interaction process, and then an approximate solution of the MDP is obtained through learning of the experience segments, namely the optimal reaction of each opponent agent except the opponent agent is resisted.

8. The multi-person, large-scale and non-full information gaming method based on neural network virtual self-game as recited in claim 1, wherein the gaming environment comprises a training environment and an evaluation environment, the training environment is used for a plurality of agents to play games to improve the gaming level, and the evaluation environment is used for playing the trained agents with other agents to evaluate the gaming level of the trained agents.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

The instructions are executable by the at least one processor to enable the at least one processor to perform a multi-person, massively non-full information gaming method based on neural network virtual self-pairing as claimed in any one of claims 1-7.

10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the multi-person, large-scale non-full information gaming method based on neural network virtual self-alignment of any of claims 1-7.