CN110404265B - Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium - Google Patents

Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium Download PDF

Info

Publication number
CN110404265B
CN110404265B CN201910676451.XA CN201910676451A CN110404265B CN 110404265 B CN110404265 B CN 110404265B CN 201910676451 A CN201910676451 A CN 201910676451A CN 110404265 B CN110404265 B CN 110404265B
Authority
CN
China
Prior art keywords
game
player
card
strategy
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910676451.XA
Other languages
Chinese (zh)
Other versions
CN110404265A (en
Inventor
王轩
漆舒汉
蒋琳
李化乐
李焜炽
廖清
张加佳
贾丰玮
刘洋
夏文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201910676451.XA priority Critical patent/CN110404265B/en
Publication of CN110404265A publication Critical patent/CN110404265A/en
Application granted granted Critical
Publication of CN110404265B publication Critical patent/CN110404265B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6027Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-user non-complete information machine game method, a device, a system and a storage medium based on game incomplete office online resolving, wherein the multi-user non-complete information machine game method comprises the following steps: step 1: firstly, carrying out real-time card abstraction according to a card abstraction algorithm; step 2: if the S is not the game situation of the action taken by the intelligent agent, the strategy sigma of each player needs to be updated; and 3, step 3: and after waiting for a player needing to take action in the current game situation to take some action, the game is played downwards, if the S is the game situation taking the action by the intelligent body, the hand distribution of the player is updated at first, the strategy sigma of the current game situation is calculated after the sub game tree is established, and then the intelligent body takes an action a according to the sigma and then the game is continuously played downwards. The beneficial effects of the invention are: compared with the prior algorithm, the method has stronger flexibility and applicability, is suitable for game scenes in the real world, and can calculate corresponding strategies according to different real game situations.

Description

Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multi-user non-complete information machine game method, a device, a system and a storage medium based on game residual office online resolving.
Background
The machine game plays an important role in the research of artificial intelligence, and has been used as a gold test stone for verifying the artificial intelligence theory since the birth of the machine game. The theory and technology related to the machine game are widely applied to solve many practical problems, such as financial game, traffic dispersion control, airport and network security, etc. (application field). The machine game is a research field developed by mutually and crossly fusing game theory and computer science, and can be divided into complete information machine game and non-complete information machine game according to whether all information of game state is visible for each party of the game.
Compared with the game of complete information, the game in the real economic society has more problems of game of incomplete information, and the application scene of the game of the incomplete information machine is wider, so that the research of the game of the incomplete information machine receives more and more attention. The texas poker is a typical large-scale non-complete information game, has all the characteristics of the non-complete information game, needs more skills, is less influenced by luck than other poker types, is always taken as an important platform for researching the non-complete information machine game, and is generally distinguished according to the number of players and the betting action without limitation. According to different game numbers, the German poker cards can be divided into two-player German poker cards and multi-player German poker cards; there are unlimited possible categories of restricted and unrestricted texas poker based on wagering actions.
Currently, the research of the two-person Texas poker machine game makes a lot of breakthroughs: the Bowling team of Alberta university in 2015 provides an algorithm for solving approximate Nash equilibrium based on state abstraction and virtual regret minimization algorithm, and the problem of game of two-person restrictive Texas poker is solved; the DeepStack algorithm was proposed at the university of Alberta in 2016 and developed two non-limiting professional players who broke Texas poker Intelligence; the university of kymeilong in the card in 2017 developed Libratus who defeated human top professional players in two non-limiting texas poker. However, in real life, more people are involved in the non-complete information machine game, but in the field of multi-people non-limited texas poker, related research is still in the beginning stage and still remains a challenging task. Multi-player non-limiting texas poker differs from two main characteristics of two-player texas poker: first, the game state space grows exponentially as the number of players increases; second, the nash equilibrium solution is no longer the optimal solution for the multi-person non-restrictive texas poker, and currently there is no exact definition of the optimal solution for the multi-person non-restrictive texas poker. The method of solving the two-person, non-limiting texas poker strategy cannot be directly extended to the multi-person case.
Disclosure of Invention
The invention provides a multi-user non-complete information machine game method based on game residual office online resolving, which comprises the following steps:
step 1: when the strategy of the current game situation S is solved, if the current game situation S is the beginning of a new betting round, firstly, card abstraction is carried out in real time according to a card abstraction algorithm;
step 2: if S is not the game situation of the action taken by the intelligent agent, the strategy sigma of each player needs to be updated, firstly, if As is not null, the last strategy sigma is used, the last player hand distribution D updates the player hand distribution D by using a Bayesian formula, wherein As represents an action sequence occurring during the period from the beginning of each betting round or the beginning of the action taken by the intelligent agent to the end of the betting round or the action taken by the intelligent agent, and the strategy sigma of each player is calculated and updated by using a CFR (computational fluid dynamics) or a UCT (UCT) algorithm;
and step 3: and after waiting for a player needing to take action in the current game situation to take some action, the game is played downwards, if the S is the game situation taking the action by the intelligent body, the hand distribution of the player is updated at first, the strategy sigma of the current game situation is calculated after the sub game tree is established, and then the intelligent body takes an action a according to the sigma and then the game is continuously played downwards.
As a further improvement of the present invention, in step 1, real-time card abstraction is performed by a real-time card abstraction algorithm, where the real-time card abstraction algorithm includes: firstly, off-line calculating isomorphic abstraction of hands and the force of various hand combinations and storing the force so as to be directly used for accelerating the calculation speed when the real-time card is abstracted; when real-time card abstraction is carried out, different card abstractions are carried out according to the position of the current game situation on the original game tree, if the current game situation is located in a front card turning circle, isomorphic abstraction of hands calculated offline is directly used, at the moment, all the hand combinations are C (52, 2) =1326, and 169 is obtained after the hands are isomorphic; if the current game situation is located in a card turning circle, a card turning circle and a card river circle, the issued public cards are removed, then the card force of all possible hand combinations is read from the offline calculated card force file of the hands, and finally clustering is carried out by utilizing a k-means algorithm.
The invention also provides a multi-user non-complete information machine game device based on game residual office online resolving, which comprises the following components:
a first processing module: when the strategy of the current game situation S is solved, if the current game situation S is the beginning of a new betting round, firstly, card abstraction is carried out in real time according to a card abstraction algorithm;
a second processing module: if S is not the game situation of the action taken by the intelligent agent, the strategy sigma of each player needs to be updated, firstly, if As is not null, the last strategy sigma is used, the last player hand distribution D updates the player hand distribution D by using a Bayesian formula, wherein As represents an action sequence which occurs from the beginning of each betting round or the beginning of the action taken by the intelligent agent to the end of the betting round or the action taken by the intelligent agent, and the strategy sigma of each player is calculated and updated by using a CFR (computational fluid dynamics) or a UCT (computational complexity) algorithm;
a third processing module: the method is used for waiting for a player needing to take action in the current game situation to take a certain action, then the game is played downwards, if the S is the game situation taking the turn of the intelligent agent to take action, the hand card distribution of the player is updated, the strategy sigma of the current game situation is calculated after the sub game tree is established, and then the intelligent agent takes an action a according to the sigma and then the game is continued to be played downwards.
The invention also provides a multi-user non-complete information machine game system based on game residual office online resolving, which comprises: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the method of the invention when called by the processor.
The invention also provides a computer-readable storage medium having stored thereon a computer program configured to, when invoked by a processor, perform the steps of the method of the invention.
The invention has the beneficial effects that: compared with the prior algorithm, the method has stronger flexibility and applicability, can be suitable for game scenes of the real world, and can calculate corresponding strategies according to different real game situations.
Drawings
FIG. 1 is a schematic diagram of a policy solution based on state space abstraction;
FIG. 2 is a schematic diagram of close chip number betting action clustering;
FIG. 3 is a graph of the mapping of x to the probability of mapping x to the A and B actions;
FIG. 4 is a graph of the winning rate of squares 6 of grass flowers 6;
FIG. 5 is a graph showing the win ratio distribution of K and Q flowers;
FIG. 6 is a CFR algorithm flow chart;
FIG. 7 is a schematic illustration of Endgame Solving;
FIG. 8 is a schematic diagram of a primary game tree;
FIG. 9 is a schematic diagram of an Endgame dissolving sub-gaming tree;
FIG. 10 is a schematic diagram of a promoted sub-game tree;
FIG. 11 is an exemplary diagram of a three-player gambling game;
FIG. 12 is a diagram of a policy online solution framework;
FIG. 13 is a schematic diagram of a real-time card abstraction scheme;
FIG. 14 is an exemplary diagram of a two-player non-perfect information game;
FIG. 15 is a flow diagram of the MCTS algorithm;
FIG. 16 is an ACPC race frame diagram;
FIG. 17 is a schematic diagram of an agent;
FIG. 18 is a schematic diagram of an improvement in leaf node benefit value estimation;
FIG. 19 is a schematic diagram of a warm-start agent;
FIG. 20 is a graph of the experimental effect trend of LKC _2p _pluswith the conventional algorithm PureCFR agent;
FIG. 21 is a diagram of the experimental effect trend of intelligent playing of LKC _2p _WQDand the conventional algorithm PureCFR;
figure 22 is a graph of the results of a three-player, non-limiting texas poker game using different algorithms and a random agent.
Detailed Description
Aiming at the problem that the state space of a multi-player non-limiting Texas poker is exponentially increased along with the increase of the number of players, the invention mainly solves the problems from two aspects: firstly, an offline abstract method of a two-player Texas poker state space is expanded to a multi-player situation and improved to an online abstract method, and the game tree size is reduced through the abstract method; secondly, on the basis of a game residual solution algorithm of two-player Texas poker, a game residual online solution algorithm for a plurality of players Texas poker is provided, and a sub-game tree for the current game residual is established by the method instead of the original whole game tree, so that the state space is greatly reduced.
Aiming at the problem that the Nash equilibrium solution is no longer the optimal solution of the multi-person non-restrictive Texas poker, the invention solves the strategy of the multi-person non-restrictive Texas poker by two algorithms and verifies the strategy: firstly, the method introduces the concept of the non-strict dominated strategy of the game theory and solves the non-strict dominated strategy through a virtual regret minimization algorithm; second, the solution strategy is searched through a monte carlo tree.
The invention firstly researches and analyzes the algorithm of two-person Texas poker, deeply researches the action abstract algorithm, card abstract algorithm, CFR, monte Carlo tree search algorithm and game residual bureau resolving algorithm, and then researches and analyzes two main characteristics of a plurality of non-limited Texas poker different from the two-person Texas poker: first, the game state space grows exponentially as the number of players increases; second, nash equilibrium solutions have no longer been the problem of optimal solutions for multi-person, non-limiting texas poker. The feasibility of how to extend the algorithm for solving the two-player texas poker to the multi-player texas poker and the solution to the difficulty of solving the multi-player non-limiting texas poker are studied in the research process. Aiming at the problem that the state space of a multi-player non-limiting Texas poker is exponentially increased along with the increase of the number of players, the invention mainly solves the problems from two aspects: firstly, an off-line abstract method of a two-person Texas poker state space is expanded to a multi-person situation and improved to an on-line abstract method, and the game tree size is reduced through the abstract method; secondly, on the basis of a game residual solution algorithm of two-player Texas poker, a game residual online solution algorithm for a plurality of players Texas poker is provided, and a sub-game tree for the current game residual is established by the method instead of the original whole game tree, so that the state space is greatly reduced. Aiming at the second problem of the multi-person non-limiting Texas poker, the strategy of the multi-person non-limiting Texas poker is solved and verified by two algorithms: firstly, the concept of the non-strict dominated strategy of the game theory is introduced, and the non-strict dominated strategy is solved through a virtual regret minimization algorithm; second, the solution strategy is searched through a monte carlo tree.
1.1 state space abstraction:
in the Texas poker Game, there are three main factors affecting the size of the game state space. The first is the kind of the player private hand, the more kinds, the more feasibility that the player needs to consider the opponent hand, and the number of the information sets is increased; the second is the kind of wagering action, in the non-limiting texas poker where the amount of wagered chips is rather continuous, resulting in an expanded breadth of action space, game trees, such as: under the rules of the ACPC tournament, each player may have 20000 chips to play. When the small blind bet starts to act, the small blind bet can take a discard, any number of actions between 50 and 20000 (assuming one case), with a heel and betting chips, there are 19952 actions available for selection, two limiting texas poker' state spaceIs 1017And the state space size of a two-person non-limiting Texas poker is 10165(ii) a Thirdly, the number of players is increased, and the combination of different hands of different players and different actions of different players need to be considered in the process of solving the strategy, so that the scale of the whole game tree is increased explosively. Thus abstracting the state space compresses the state space primarily from two aspects: firstly, a hand card abstraction algorithm is used for designing similarity functions of two pairs of hand cards, and then clustering the hand cards according to the similarity of the two pairs of hand cards; and secondly, action abstraction and action mapping, namely clustering the filling actions according to the similarity of the filling chip numbers, mapping the actions of opponents to an abstract action set in the game process, and designing an action mapping function meeting the requirements of boundary constraint, monotonicity, scale invariance, robustness and boundary robustness.
The most effective algorithm for solving the two-person non-perfect information machine game Nash equilibrium solution is the virtual regret minimization algorithm (CFR), but the problem scale of the CFR algorithm which can process the state under the existing computing power is about 1014, which is far lower than the scale of the actual complex Texas poker game problem. Therefore, a researcher provides an algorithm for solving the complex Texas poker game based on state space abstraction and CFR (computational fluid dynamics), an original game model is abstracted into a game model with a smaller scale in the first step, and the game model with the smaller scale and the original model keep similarity in strategy; secondly, solving an approximate Nash equilibrium solution of the small-scale model by using a CFR algorithm; the third step demaps the strategy of the smaller game model back to the original model. A schematic diagram of a policy solution based on state space abstraction is shown in fig. 1.
1.1.1 action abstraction algorithm:
in the restricted Texas poker, the player can take up to three actions: discarding the cards, following the cards, and filling, wherein the number of the chips to be filled is fixed. In non-limiting texas poker, however, the number of chips that are wagered is rather continuous on a small scale, resulting in a very large number of wagering actions and a wide breadth of the game tree. Clustering wagering actions of close chip numbers can significantly compress the breadth of the game tree, generally clustering wagered chips by different multiples of the current pot, as shown in fig. 2.
When the strategy calculated after the original game tree is abstracted through actions is applied to the original game, actions taken by opponents need to be mapped into an abstract action set. The action mapping algorithm needs to meet the theoretical requirements of boundary constraint, monotonicity, scale invariance, robustness and boundary robustness, and the invention introduces the action mapping algorithm based on probability distribution, which is used in Tartanian 5 intelligent agent developed by the research team of the university of Meilong in the card. Assume that the allowable bet chips S e [ T, T 'in a given set of information']The chip of the opponent is x, and A and B in the abstract action set satisfy A = max { A { (A) }i:Ai≤x},B=min{Ai:Ai≧ x }, then the action mapping problem may be transformed to map x according to the probability that x maps to A and B actions, introducing a probability distribution function that satisfies Lipschitz continuity as
Figure BDA0002143423410000061
The probability of mapping x to the a action is calculated by equation 3-1, the probability of mapping to the B action is calculated, and the actual bet action is mapped to either a or B based on the calculated probability, as shown in fig. 3.
1.1.2 card abstraction algorithm:
the card abstraction algorithm is to cluster the combination of different player private cards and public cards into different clusters with specific number according to a certain similarity characteristic, and the main methods are a card isomorphism algorithm, a clustering algorithm based on the card force of the card hand, a clustering algorithm based on the card force distribution of the card hand and the like.
In texas playing cards, the choice in the game strategy is the same, since there is no size between different suits, and thus the cards are of the same value and the different combinations of suits should be considered the same card type. For example: the red-heart K spades 9, the plum blossom K square pieces 9 and the red-heart 9 spades K are turned around in the front, the spades are considered to be the same card types, namely the strategies corresponding to the hands are the same, the phenomenon is called hand isomorphism, and therefore the hands with the same card types can be combined and clustered into the same barrel through the hand isomorphism, and the scale of a game tree is greatly reduced.
Hand force based clustering algorithms cluster cards by clustering algorithms such as k-means, etc., based on the combined card force of the player's private hand and the public cards. First, the card force EHS of the player's private hand is calculated, assuming that the opponent's hand is randomly distributed and the remaining public cards are also uniformly randomly dealt, the player's hand expected card force EHS is equal to the probability that the player's hand wins plus one-half of the probability of tie, so that each hand has an expected card force, and then the expected card force is used as a characteristic value using a clustering algorithm such as: k-means groups hands into a specified number of clusters. Therefore, the size of the game tree can be further reduced by setting different numbers of clusters according to the storage resources and the computing resources.
The expected card force is only a reasonable one-dimensional approximation of the strength of the hand, and does not reflect the overall distribution of the hand winning rate, such as: as shown in the histograms of fig. 4 and 5, the abscissa represents the card force of a certain specific hand type of the player, i.e., the hand combination consisting of the private hand of the player and the dealt public card, under different non-dealt public cards, and the value is from 0 to 1, the width of the stripe can be set as required, and the value is set to 0.02 according to the invention; the ordinate indicates: in the case of randomly dealing the remaining public cards, the hand combination category for which the combined card force of the player's private hand and public cards lies within the interval represented by the vertical stripes is a proportion of all possible card combination. The vertical bar with 0.3 abscissa and 0.01 ordinate in fig. 4 indicates that the proportion of the card type combination of the player hand medal careless 6 square 6 that wins with a probability of 0.3 to 0.032 in the case where all the common cards are randomly dealt is 0.01 to all the combinations. It can be seen that the grass flower K, the grass flower Q, the grass flower 6 and the square 6 are two hands with similar winning rates, but they have larger difference in winning rate distribution, so it is more reasonable to use the hand force distribution as the hand similarity. The clustering algorithm based on the hand card force firstly establishes distribution histograms of the card forces of different hands according to specific intervals, calculates the earth moving distance of the two histograms as the similarity characteristics of the two different hands, and then clusters the hands into clusters with specific numbers by utilizing k-means or other clustering algorithms.
1.2 virtual regret minimization algorithm:
currently, the best algorithm for approximating nash equilibrium for solving large, non-perfect information games is the virtual regret minimization algorithm (CFR). The CFR algorithm calculates and minimizes a virtual Regret (statistical factual Regret) of each information set, thereby minimizing a global average Regret, which proves that as the number of times of strategy iteration increases, the global average Regret decreases, and the iterated strategy approaches to a nash equilibrium solution. The algorithm is described as follows:
let I be the information set of player I, defining σI→aIs a policy set equivalent to σ except at σI→aIn the information set I, the player I always selects the action a. ZI is defined as the set of termination sequences to which the action sequence of the information set I is prefixed. For Z ∈ Z, let Z [ I]For a sequence of actions from a root node to a set of information, ui(z) represents the yield of z player i at the leaf nodes of the game tree,
Figure BDA0002143423410000071
representing the probability, π, that an opponent of player I will arrive at information set Iσ(z[I]Z) represents the probability product of all players starting from information set I to reach leaf node z of the game tree, defining the virtual Value (countfactual Value) at information set I:
Figure BDA0002143423410000072
in information set I, a virtual Regret (statistical factual Regret) of action a is not taken:
Figure BDA0002143423410000081
for the information set I, calculating the strategy of the next iteration through the following Regret Matching formula:
Figure BDA0002143423410000082
Figure BDA0002143423410000083
the average strategy of players in the CFR algorithm converges to nash equilibrium, where the average strategy of player i is calculated as follows:
Figure BDA0002143423410000084
the CFR algorithm flow is represented in fig. 6: firstly, initializing information of each information set in a game tree before CFR strategy iteration, traversing the whole game tree from a root node, returning the income value of each player if the current node is a leaf node, updating the information of each information set in the game tree, increasing the iteration number by 1, judging whether the iteration is finished, returning a main program if the iteration is finished, or else performing the next iteration, and if the current leaf node is not a leaf node, traversing all actions of the node, including a random node and an action node of the player, and then performing CFR algorithm recursion.
1.3 game residual bureau resolving algorithm:
1.3.1 game incomplete:
the above-described imperfect information machine game algorithm based on state space abstraction and CFR has not been able to perfectly solve two-person unrestricted texas poker, and there are two main problems: firstly, because action abstraction and card abstraction are adopted, a large amount of useful information is lost, so that the calculated approximate Nash equilibrium solution error is relatively large, and the availability is relatively high; second, if the action taken by the adversary is not in the set of abstract actions, an action mapping algorithm is required to map it to the set of abstract actions. Assuming that the opponent wagers chips as s, s does not belong to the abstract action set, and the action map maps s to s ', in which case the agent assumes that the opponent wagers' instead of s; and uses s 'to update the state of the agent's interior, causing the agent to incorrectly estimate the current pot chip count. Generally, the action abstraction algorithm selects a multiple of the pool as a chip number of the filling action, and when the intelligent agent takes the filling action next time, the chip amount filled by the intelligent agent deviates from an expected filling chip amount due to the fact that the pool size is estimated by mistake, and the deviation becomes larger and larger as the game progresses. This problem is called the off-tree problem (off-tree problem), and is also a factor in the underperformance of Tiantianian 2. In order to solve the two problems, researchers provide methods for game tree decomposition and game residual solution, for example, game residual solution algorithms are adopted in deep stack and library, and great success is achieved.
Different from the situation that the game disability in the complete information game is a subtree of the whole game tree, the state in the incomplete information game spans the information set, and the game disability game tree in the incomplete information game is actually in a forest structure, so that the closure relationship between descendants and brothers is met. For processing convenience, a virtual opportunity node is usually added to make it a tree structure. The nature of the game outcome in the perfect information game is as follows:
(1) The nodes of the game outcome M must be a subset of the entire game G;
(2) If in the overall game G the states s and s 'are in the same information set and the state s is in the handicap M, then s' is also in M.
1.3.2 Game Final Algorithm:
the game final solution algorithm (Endgame solving) calculates the approximate Nash equilibrium solution of the sub game tree where the specific public game state is in during the playing process in real time, thereby saving the memory space and effectively solving the problem of deviating from the game tree (off-tree scheme).
The real-time strategy solving divides the game process into two parts, one part is called a main Trunk (Trunk) and the other part is called a sub game (sub game), as shown in fig. 7, the main Trunk part of the game tree is not different from the game tree of the off-line strategy solving algorithm, the off-line strategy is still adopted, the game state is closer to the leaf node of the game tree along with the progress of the game process, the sub game tree is entered, the uncertainty in the game is reduced, the public information is more and more, the state space scale of the game problem is greatly reduced, and a more detailed sub game tree can be reestablished at the moment, even the abstract sub game tree is not solved.
The game final office resolving algorithm flow is as follows: firstly, solving approximate Nash equilibrium solutions of both game parties of the whole game tree through the introduced non-perfect information machine game algorithm based on state space abstraction and CFR; secondly, an off-line strategy obtained in advance is adopted in the main game tree, when the root node of the sub game tree is reached, the off-line strategy is generally the starting node of the river card circle, the original off-line strategy is abandoned, and the strategy of the sub game tree in which the current game state is located is calculated in real time; thirdly, when the strategy of the sub game tree in which the current game state is located is calculated in real time, the probability that two game parties reach the current game state is calculated by using a Bayesian formula according to an offline strategy; and finally, the probability that the two game parties in the game obtained by the third step reach the current game state is calculated, the current game state is read in real time, such as the size of a bottom pool, and then the CFR is operated in the sub game tree to solve the approximate Nash equilibrium solution of the sub game tree. The real-time computed strategies are used in subsequent gaming sessions.
1.3.3 game final office re-resolving:
due to the fact that private information exists in the non-complete information game, the main game tree and the sub game trees are mutually influenced, the whole game tree needs to be solved when Nash equilibrium is solved, the original game tree is decomposed into the main game tree and the sub game trees to be respectively solved, and the combined strategy solution is not necessarily the Nash equilibrium solution of the original game tree. For example, in a guessing game of stone cloth-cutting, an extended game is used for showing, the whole game tree has two nodes, a first player takes action at the first node and serves as a main game tree, the first player takes action and then turns to a second player to take action, the node serves as a sub game tree, and if the Nash balance of the whole game tree is calculated, the probability that the two players cut the cloth-cutting stone cloth in three actions is 1/3. If the backbone strategy is fixed and the sub-game tree is considered alone, then the player two takes some action such as a stone with a one hundred percent probability, then the benefits are all 0, but the combined overall strategy is not a nash equilibrium solution. In the game final-office algorithm, the strategy of the main game tree is fixed and the sub game tree in which the current game state is positioned is simply considered, so that the robustness of the strategy solution formed by combining the solved strategy of the sub game tree and the strategy of the original main game tree is not theoretically ensured. Game terminal re-solution (Game solving) can solve the problem better to a certain extent.
The main steps of the game final game re-solution are different from the game final game solution, and the key difference lies in that the game final game re-solution does not assume that the strategy of the opposite player in the main game tree is fixed, the probability that the opponent takes a corresponding action sequence to reach the current game residual is not considered, but the profit that the opponent selects to reach the current game residual is calculated by utilizing an offline strategy, and the profit is used as a supervision influence factor for solving the current game residual. Such as the two-player, non-limiting texas poker game shown in fig. 8, when player P1 takes action T and proceeds to a turn round, the strategy of player P2 is calculated using a game stump solution. The game disability solving algorithm (Endgame solving) utilizes opponent modeling to calculate the probability of an opponent from a root node to a root node of a sub-game tree, and then independently solves the game disability. As shown in fig. 9, where "C" represents a virtual random node, "ci _1," and "ci _2" represent the probability that players P1 and P2 select to reach the current game outcome, respectively, where "ci _1" can be calculated from the established opponent model for P1, and "ci _2" is calculated by the action sequence of player P2 according to the offline strategy S by using the bayesian formula. And the game terminal is solved to establish a sub-game tree of the promotion, as shown in fig. 10.
The main steps for promoting the sub game tree are as follows: firstly, adding an information set node of an opposite player on a root node of an atomic game tree; secondly, a random node is added as a root node of the whole promoted sub-game tree, and the probability of the random node to an opposite player in each root node information set of the sub-game tree is that the own player reaches the node of the atomic game tree from the root node of the original game treeProbability; finally, two actions a 'are drawn on the information set node to which the opponent player is added'T,a′SWherein the opponent player takes a'SEnter the atomic game tree to take a 'after action'TThe action reaches a leaf node, and the leaf node stores the virtual value of each root node information set of the opposite player in the sub game tree, which is obtained when the offline strategy of the whole game main part is calculated in advance; and then, solving an approximate Nash equilibrium solution by adopting CFR in the lifting sub-game tree, wherein the obtained solution is used as a strategy to be adopted by a player under the current sub-game tree, and the original strategy calculated in advance is abandoned. The utilization degree of the strategy obtained by the game terminal office re-resolving calculation is lower than that of the original main trunk strategy.
1.3.4 game incomplete game resolving algorithm and heuristic search:
the two game residual solution algorithms introduced above can only calculate the strategy of the sub game tree nodes closer to the leaf nodes of the original game tree in real time, and are generally used for calculating the nodes in the river course in real time. How to calculate the strategies of nodes far away from leaf nodes such as front card turning circle nodes and card turning circle nodes in real time through game residual game and re-calculation is mainly provided with two methods at the present stage: the first method is a middle game solution algorithm (Midgame Solving), which is proposed in 2017 by Huliang and Sam Ganzfriend, and is similar to a game final solution algorithm, a sub game tree is established by taking a current game situation as a root node, an end node of a betting round where the current game situation is positioned is virtualized to be an end node of the whole game, the average income of each player is estimated by comparing the hand card force (EHS) of each player, and then an approximate Nash equilibrium solution of the sub game tree is solved by CFR; the second approach differs from the first approach in that instead of virtualizing the end node of each betting round as the end node of the entire game, it is used as the leaf nodes of the sub-betting tree, for which exact feasible heuristic functions are designed for calculating the virtual value of the leaf nodes for each player.
When the strategy of the current game state is calculated in real time, two different calculation approaches can be adopted: the first method is to adopt common game terminal re-calculation, the second method adopts nested game terminal re-calculation, and is different from the first calculation method in that each time a current game node is reached, a leaf node from a current game state node to the betting round is used as a new sub game tree, the game terminal is re-operated, and then the strategy of calculating the current game node is calculated. The DeepStack algorithm is an algorithm combining nested game final offices with heuristic functions designed on the basis of neural networks, the neural networks are trained to be used as the heuristic functions to estimate the virtual value of leaf nodes of intermediate game subtree, and good effect is achieved. The algorithm is briefly described as follows: firstly, setting the end nodes of the front card turning circle, the card turning circle and the turn circle as virtual leaf nodes of the sub game tree, designing a neural network structure for the leaf nodes of each layer as a heuristic function, generating a training data set by using a simplified action set and adopting Monte Carlo random to train the corresponding neural network. The input to each layer of neural network includes the probability vectors (calculated by bayesian) of the own and the other parties to reach the leaf node in their respective hands, the community cards, and the pool size. The output is a virtual value vector for each player under the respective hand; secondly, when the game strategy of the current game state is calculated in real time, the virtual value vector of the opponent in the sub game tree node and the probability of the own party reaching the current game state, which are calculated when the strategy of the last game state is stored, are required to be stored, and the two parameters are initialized randomly when each betting round starts. And then, calculating an approximate Nash equilibrium solution of the sub game tree in which the game is currently located by adopting a game final office re-solving algorithm. And after the game strategy of the current game state of the self party is calculated each time, the virtual value vector of the opponent in the sub game tree node and the probability of the self party reaching the current game state are updated in time.
1.4 multi-player non-complete information game and Nash equilibrium solution:
in the two-player zero-sum game, nash equilibrium has the following properties:
1. if a certain player keeps the own strategy unchanged in a Nash equilibrium strategy group, other players can not obtain larger benefits after changing the own strategy.
2. The combination of any two nash-equalized solutions is still one nash-equalized.
3. The player gains in one nash-balance strategy are equal to the gains in another nash-balance strategy.
Nash equilibrium holds only the first property in the multi-player game of incomplete information, and the other two properties do not hold. Consider a simple game like fig. 1: the 3 players are numbered 1,2,3, each player has two actions h and t, respectively, each player keeps secret after taking the action, and the rest players do not know what action the player takes currently. When the game is started, each player firstly bets one chip, then takes action in turn, finally, three players determine the win or loss and the income of each player after finishing the action, and the game is ended. The decision rule is to divide the pot equally if three players take the same action, each player does not lose or win, and win the pot if one player takes an action different from the other two players.
In this game, we can easily know the strategy group σ1={h,t,t},σ2= h, h, t and σ3The values of = {1/2,1/2 } are all nash equilibrium solutions, and the earnings for each player under these three strategy sets are {2, -1, -1}, { -1,2} and {0, 0}. It is easy to verify that these strategy sets all satisfy the first property, that the third property does not hold, and that the combination of two nash equalization strategies, such as sigma4= { h,1/2 } is also not a nash equalization. Furthermore, if a player adopts the Nash equilibrium strategy, it cannot be guaranteed that he can obtain good profit, for example, 3 players adopt the Nash equilibrium strategy group σ3= 1/2, the benefit is 0 for player No. 1, and the benefit of player No. 1 is increased if only one player changes the strategy, but if only one player changes the strategyAfter the players No. 2 and No. 3 change the strategy simultaneously, the strategy group becomes sigma5If {1/2, h, t }, player number 1 always loses one chip. In conclusion, for the multi-person non-complete information game, the nash equilibrium is no longer the optimal choice, and the algorithm without polynomial time can calculate the nash equilibrium solution of the large-scale multi-person non-complete information machine game.
1.5 dominated strategy and CFR:
in the multi-person non-complete information machine game, the Nash equilibrium solution is no longer the optimal solution. From another point of view, researchers use the CFR algorithm to solve the suboptimal strategy of the game of the multi-person non-complete information machine. The CFR algorithm can be proved to remove a strictly dominated strategy after a certain number of iterations in the strategy iteration process, and the feasibility of the CFR algorithm in solving the game strategy of the multi-person non-complete information machine is verified through experiments. First, the definition of the preempted action and the preempted policy is introduced:
definition 6 (dominant action) in an extended game, actions a ∈ A (I) of player I of an information set, if there is another policy σ 'of player I'i∈∑iFor other players, an arbitrary strategy set σ-i∈∑-iAll have vi(I,σI→a)≤vi(I,(σ′i,σ-i)). Then action a is a weak preempted action. A recursive dominant action may be defined recursively as an action that is dominant or becomes dominant after the strictly dominant action is removed in a policy iteration process.
One strategy sigma for defining 7 (dominated strategy) player iiIs a weakly dominated strategy, if there is another strategy σ 'for player i'iThe following conditions are satisfied:
(1) Arbitrary strategy σ for other players-i∈∑-iAll have uii,σ-i)≤ui(σ′i,σ-i);
(2) Certain strategies σ for other players-i∈∑-iHas uii,σ-i)<ui(σ′i,σ-i);
Whereas if all the strategies sigma for other players-i∈∑-iAll have uii,σ-i)<ui(σ′i,σ-i) Then σiIs a strictly dominated strategy. A recursive dominant policy may be defined recursively as a policy that is dominant or becomes dominant after removing strictly dominant policies in a policy iteration process.
TABLE 3-1 example of a dominated policy
Figure BDA0002143423410000131
Figure BDA0002143423410000141
Consider the game represented in Table 3-1: there are a total of two players, player 1 can take actions a and B, player 2 can take actions a, B, c, and the final game results are shown. Under the definition of a strictly dominated strategy, player No. 2's actions b and c are strictly dominated by action a, because no matter what action is taken by player No. 1, player No. 2 receives less benefit from taking actions b and c than action a. For player No. 1, a and B do not dominate each other, player No. 1 takes a higher profit for B than for a when player No. 2 takes B, and player No. 1 takes a lower profit for B than for a when player No. 2 takes c. According to the recursive dominant strategy definition, since B is a strict dominant strategy, if B is likely to be deleted in the strategy iteration, then action B of player No. 1 becomes a weak dominant strategy, and strategy B is a recursive dominant strategy.
The CFR algorithm can remove a strictly dominated policy in the policy space, and the proving process mainly proves the following three theorems, and the contents of the three theorems are briefly described below:
theorem 1: in an extended game, if player i is presentAction a of the information set I is a weakly dominated action and satisfies
Figure BDA0002143423410000142
Then the strategy sigmaiIs a weak dominated strategy;
theorem 2: assume policy group σ = { [ σ ]1,σ2,...,σnIs calculated by CFR, assuming action a is a recursive strictly dominant action for the information set I, assuming the recursive dominant action removed during the k iterations of the previous strategy is a1,a2,...,akA and ak+1= a, with I1,I2,...,IkRepresenting the set of information I during the first k iterations of the strategy. If there are two real numbers δl,γlAnd number of iterations TlWhen T > TlAt a time
Figure BDA0002143423410000143
Wherein l satisfies 1 < l < k +1, the following:
(1) There is an iteration number T0When the number of iterations T > T0At a time
Figure BDA0002143423410000144
(2) If limT→∞xTT =0, then limT→∞yT(I, a)/T =0, wherein xTExpressing that in an iterative process
Figure BDA0002143423410000145
Number of times of (a), yT(I, a) denotes making σ in an iterative processtThe number of times of (I, α) = 0;
(3) If limT→∞xTif/T =0, then there are
Figure BDA0002143423410000146
Theorem 3: assume policy group σ = { [ σ ]1,σ2,...,σnIs calculated by means of the CFR,assuming an action σiIs a play
A recursive strict dominating strategy for family i, then:
(1) There is one iteration number T0There is a set of information I for a player I and an action a in the set of information when the number of iterations T > T0At a time
Figure BDA0002143423410000147
And is provided with
Figure BDA0002143423410000148
(2) If limT→∞xTT =0, then limT→∞yTi) T =0, wherein yTi) Expressing that in an iterative process
Figure BDA0002143423410000151
The number of times.
On-line resolving algorithm of 1.6 multi-person non-complete information game strategy
1.6.1 Algorithm framework
Because the state space of the Texas poker is huge, the loss of abstract information is large through the state, the state space scale grows exponentially after the number of players is increased, the whole game is considered unreasonable as a whole, the Nash equilibrium strategy is no longer the optimal solution in the game of a plurality of non-complete information machines, and the invention provides the online resolving algorithm of the game strategy aiming at the problems.
The overall framework is shown in fig. 12, a large triangle represents an original game Tree, and assuming that a game progresses to a certain game situation a, a virtual sub game Tree (sub game Tree) is established as shown in a trapezoid, wherein rectangular bars represent hand distribution vectors of own parties, rounded rectangular bars represent hand distributions of other players, the size of the sub game Tree is determined according to the situation of the current game and computing resources, and if the sub game Tree does not reach the end of the game, the profit value of a leaf node of the sub game is estimated by using monte carlo sampling of public cards, a CFR algorithm or a monte carlo Tree search, as shown in a parallelogram bar of the graph. And after the sub game tree is established, the strategy of the current situation is solved by utilizing a strategy solving algorithm.
Compared with the original solving scheme based on state abstraction and Nash balance solving, the new solving scheme is different, the whole game tree is not taken as a whole, an abstract sub-game tree is established for each current game situation through state abstraction, the profits of the sub-game tree are estimated at leaf nodes of the sub-game tree, and then the strategy of the current game situation is calculated in real time through a virtual regret minimization algorithm or Monte Carlo tree searching. As the game progresses, the actions taken by the players and the dealing of the public cards are increased, the hand distribution of the opponent can be predicted according to the actions taken by the players and the dealt public cards in real time, and the false estimation of the bottom pool by the intelligent agent can be corrected by inputting the chips of the current bottom pool into the intelligent agent. In solving the strategy of the current situation, the hand distribution of each player can be predicted according to the action taken by each player before and the public cards dealt. The invention uses Bayes formula to calculate the hand distribution of each player according to the strategy of previous real-time calculation, and the intelligent agent can abstract the cards in real time to further compress the scale of the sub-game tree according to the issued public cards. In summary, the real-time strategy solution algorithm is shown as algorithm 1:
Figure BDA0002143423410000161
the strategy of the current game situation S is required to be solved, if the S is the beginning of a new betting round, card abstraction is carried out in real time according to a card abstraction algorithm; if S is not the game situation of the action taken by the intelligent agent, updating the strategy sigma of each player, firstly, if As is not null, using the last strategy sigma (the initial value is random distribution), updating the player hand distribution D by using a Bayesian formula according to the last player hand distribution D, wherein As represents an action sequence occurring during the period from the beginning of each betting round to the end of the betting round or the time of the action taken by the intelligent agent from the beginning of the action taken by the intelligent agent, and calculating and updating the strategy sigma of each player by using a CFR or UCT algorithm; and finally, after the player needing to take action in the current game situation takes a certain action, the game is carried out downwards. If the game situation of the intelligent agent taking action is the S, the hand distribution of the players is updated at first, the strategy sigma of the current game situation is calculated after the sub game tree is established, and then the intelligent agent takes an action a according to the sigma and then the game continues to go downwards.
1.6.2 real-time card abstraction
Since the number of possible private and public card combinations is reduced as the game progresses and the number of possible private and public card combinations is reduced, the game tree state space can be greatly reduced by using the dealt public cards to remove those impossible hand combinations, for example, if 3 public cards have been dealt in a card turning circle, the remaining number of possible hand combinations is C (49, 2) = C (47, 2) =1271256, whereas if all the hand combinations are considered, the number of combinations is C (52, 2) = C (50, 3) = C (47, 2) =28094800000, similarly all the card combinations in the card turning circle and the card circle are C (52, 2) = C (50, 4) = C (46, 1) =14047000000, C (52, 2) = C (50, 945) = 280000, while the possible card combinations removed in the card turning circle can be reduced in real time to obtain an improved game state space, and the game state can be improved in real time to obtain a reduced state space, for a game. In this regard, the present invention proposes a real-time card abstraction algorithm, which is described in detail below.
As shown in fig. 13, the isomorphic abstraction of the hands and the strength of the various hand combinations are calculated off-line and then stored, so as to be directly used to accelerate the calculation speed when the real-time card is abstracted; when real-time card abstraction is carried out, different card abstractions are carried out according to the position of the current game situation on the original game tree, if the current game situation is located in a front card turning circle, isomorphic abstraction of hands calculated offline is directly used, at the moment, all the hand combinations are C (52, 2) =1326, and 169 is obtained after the hands are isomorphic; if the current game situation is located in a card turning circle, a card turning circle and a card river circle, the issued public cards are removed, then the card force of all possible hand combinations is read from the offline calculated card force file of the hands, and finally clustering is carried out by utilizing a k-means algorithm.
1.6.3 prediction of player hand distribution
As the game progresses, more and more information is exposed to players because actions taken by the players are related to the hand force of the players, for example, if the players bet large bets on the front card turning circle and the card turning circle, the hand force of the players is stronger than that of the hands, if the players only take the following actions, the hands of the players are weaker, the style of the same player basically keeps stable, after a plurality of games, the playing style of the players can be analyzed according to historical game data, so that the private hands of the players can be predicted according to the actions taken by the players more accurately, and the method is equivalent to simulating the card reading action of the human players in playing cards.
Predicting the hand distribution of a player based on the sequence of actions taken by the player requires modeling for the particular player, but in the absence of historical gaming data for the particular player, the predictive modeling is not efficient. In this regard, the present invention assumes that the strategy taken by the opponent is one that we have computed in real-time, and then computes the player's hand distribution using Bayesian equations based on the actions taken by the player. Assuming that the agent computes the strategy in real time as σ, all possible hands for player i are C1,C2,…,CnAnd action a is taken according to strategy σ, the conditional probability of the player taking the hand after a can be calculated as follows:
Figure BDA0002143423410000181
wherein P (A | C)j) Is player i's hand CjProbability of taking action A, pair P (C)i) The normalized hand distribution is that of player i after taking action A. The whole algorithm flow is as follows:
Figure BDA0002143423410000182
1.6.4 intermediate State yield value estimation
When the strategy of the current game situation is calculated in real time, the size of the established sub game tree is determined according to the position of the current game situation in the original game tree and the calculated resources, and if the leaf nodes of the established sub game tree do not reach the end of the game, the profit values of the leaf nodes in the intermediate states need to be estimated. There are various methods to estimate the leaf node profit values of a sub-game tree, such as searching with a monte carlo tree, training a neural network to predict the virtual value of a leaf node, and so on.
When the sub game tree is established, the hands of each player are determined by Monte Carlo sampling according to the hand distribution of the players, and the leaf nodes of the established sub game tree only reach the end nodes of the current betting round; and then determining the win/lose calculation profit value as the profit value of the leaf node according to the strength of the card force of each player hand. In order to estimate the intermediate state profit value more accurately, the invention introduces a UCT algorithm to estimate the profit value of the leaf node.
1.6.5 game strategy solving algorithm
After the sub game tree for the current game situation is established, the strategy of the current game situation can be solved. For two-person non-limiting Texas poker, the invention directly adopts CFR algorithm to solve approximate Nash equilibrium. Aiming at the multi-person non-limiting Texas poker, the CFR algorithm can solve the suboptimal strategy (non-strict dominated strategy) of the multi-person non-limiting Texas poker, and the invention extends the CFR algorithm to solve the non-strict dominated strategy of the multi-person non-limiting Texas poker game stub. In order to accelerate the strategy iteration, the Monte Carlo CFR algorithm is introduced to solve the strategy of the current game situation.
Monte Carlo tree search has made some progress in computing two-and multi-player non-restrictive Texas poker strategies, and the feasibility and effectiveness of Monte Carlo search in game stump strategy resolution are further studied.
1.6.5.1 Monte Carlo CFR Algorithm
For a large-scale non-complete information machine game, the CFR algorithm is directly utilized to solve the Nash equilibrium solution of the game tree, the whole game tree needs to be traversed, the iteration speed is slow, and if only one part of the game tree can be traversed through Monte Carlo sampling in each iteration in the iteration process, the traversal speed can be accelerated. The Monte Carlo CFR (MCCFR) algorithm mainly comprises three kinds, namely, a Chance Sampling CFR (CS-CFR), an External Sampling CFR (ES-CFR), and an outside Sampling CFR (OS-CFR), and other MCCFR modified versions such as CFR +, pureCFR and the like. Let Q = { Q =1,Q2,...,QkIn which QiIs a subset of the set Z of leaf nodes of the original game tree and Q is a span of the set Z of leaf nodes of the original game tree. In CS-CFR, sampling each random node of the original game tree, and assuming that H belongs to H at each random node according to a strategy sigmac(h, a) taking some action a, corresponding to dividing the set of leaf nodes Z, the leaf nodes in the same subset all being in the same subtree of the same random node, e.g. the division Q generated by CS in FIG. 14 has two subsets Q1= { lal, lare, larf, lbl, lbr }, and Q2= { rcl, rcrg, rcrh, rdl, rdr }. At each strategy iteration, the probability that each subset may be drawn is set to qj,qjSatisfy the requirement of
Figure BDA0002143423410000201
In general, the sample virtual value of player i may be calculated as follows:
Figure BDA0002143423410000202
wherein
Figure BDA0002143423410000203
Is the probability that the leaf node z is sampled. In the case of the CS, the CS is,
Figure BDA0002143423410000204
when q (z) > 0,
Figure BDA0002143423410000205
is viAn unbiased estimate of (I, σ).
Define a sample regret value for no action a taken in information set I as
Figure BDA0002143423410000206
The strategy for the T +1 iteration and the average strategy are still calculated according to equations 2-10, 2-11. It can be shown that the iteration through the MCCFR, the calculated strategy approaches the nash equilibrium solution as the number of iterations increases.
Figure BDA0002143423410000207
The MCCFR greatly increases the speed of each iteration by sampling through only a portion of the game tree, but requires more iterations to achieve convergence. By balancing the number of iterations and the time of each iteration, the experimental effect can be significantly improved in the same iteration time. ES-CFR is not only sampled at random node but also at adversary action node
Figure BDA0002143423410000208
Sampling is performed. OS-CFR is further based on sampling at each node, which corresponds to traversing only one path per iteration. It was experimentally verified that both ES-CFR and OS-CFR performed better than the original CFR algorithm, and that ES-CFR performed the best in MCCFR for the two-person, non-limiting Texas poker. The invention adopts ES _ CFR algorithm to solve the strategy of the current game situation of two-person and multi-person non-limiting Texas poker. The algorithm for ES-CFR is shown in Algorithm 3.
1.6.5.2 Monte Carlo Tree search Algorithm
The monte carlo tree search algorithm (MCST) was used primarily in perfect information games and has enjoyed great success in perfect information games such as chess, checkers, and particularly MCST in combination with deep reinforcement learning in AlphaGo, which was developed by Google in 2016. Researchers later extended MCST to non-perfect information machine gaming with some success.
Since in the perfect information game, the player has no private information and no chance component (such as throwing dice or randomly drawing cards) in the game process, it can be represented by establishing a game tree, all game states of the game process can be represented by a node in the game tree, and the leaf node represents the termination of the game and gives the gains obtained by the game parties, such as four-piece chess, cross-word three-joint, chinese checkers, chess, go, and the like. The complete information game tree can exactly represent all possible states in the game process, so that all game parties can traverse the whole game tree and select the most favorable action for the game parties when calculating the game strategy of the game parties. In order to calculate the optimal strategy of each party of the game, the von Neumann proposes a maximum minimum algorithm: assuming that the gaming parties are all rational players, traversing the gaming tree recursively chooses maxima and minima at each level of the gaming tree alternating to match conflicting objectives of different players. The game tree needs to be traversed by the maximum minimization algorithm, the problem that a small complete information game such as the gobang, the tic-tac-toe game and the like can be solved, but for a large complete information game, because the game tree is very large in scale, the establishment and traversal of the complete game tree are not feasible in space and time, and even though the maximum minimization algorithm such as Alpha-beta pruning is improved on the maximum minimization algorithm, the game tree still has no worry about the large complete information game such as the chess and the go. For a large complete information game, because the game tree is too large in scale, traversing the game tree to solve an accurate optimal solution is not feasible, and in this case, if a part of the game tree is searched in a sampling manner to calculate a suboptimal strategy, the calculation speed of the strategy can be accelerated.
A monte carlo tree search algorithm (MCTS) balances the randomness of the simulated pairings and the accuracy of the game tree search by searching a portion of the game tree through monte carlo sampling rather than traversing the entire game tree. The MCTS gives more explored opportunities for those branches that are likely to be the best walks, and searches through sampling for those branches that have not been explored, thus improving the accuracy of the game tree search in a limited time and resources. The Monte Carlo tree search algorithm is introduced as follows:
the MCTS algorithm is a search algorithm based on analog-to-office, and as shown in fig. 15, the iteration process mainly includes the following 4 steps:
(1) Selecting an action: in the current action node, a player randomly selects an action from an action set which is not adopted in the current node or when all legal actions in the current node are accessed, the action is selected according to a specific selection strategy and then the player traverses downwards.
(2) Expanding a search tree: when the leaf nodes of the search tree are reached, if the game is not over, a node is expanded.
(3) Simulation and exchange: after a new node is expanded, a specific simulation game-alignment strategy is adopted to simulate the game until the game is finished, and the game-alignment is simulated in a random simulation mode.
(4) Feeding back upwards: and after the selection action or when the simulated game reaches the end of the game, feeding back the income of each player upwards to update the information of the corresponding node in the search tree.
In order to improve the accuracy of Monte Carlo search, after the algorithm is improved, a UCB formula, namely an upper limit confidence interval algorithm (UCT), is introduced in a selection action link. The algorithm was first proposed by Auer P to balance the utilization-exploration contradiction that occurs in selecting an action link, namely whether to select branches with high average current yield or to select branches that are less popular but likely to be optimal when a node that has traversed all actions selects an action. In the UCT algorithm, an action which enables the UCB value to be maximum is selected each time in an action selecting link, and the UCB value is calculated according to the following two formulas:
Figure BDA0002143423410000221
Figure BDA0002143423410000222
where N(s) represents the number of visits to node s, N (s, a) represents the number of times an action a is taken in node s, and R (s, a, k) represents the total benefit after k actions a for the current node are taken.
In the game of the non-complete information machine, random nodes or private information exists, game players cannot master all information of game situation, monte Carlo tree searching needs to be improved to be expanded to the non-complete information game, action nodes and random nodes of all players are subjected to Monte Carlo sampling according to corresponding strategies, nodes of the game tree of the non-complete information machine are information sets and are mutually crossed, and when data of game tree nodes are updated in the MCTS process, the current nodes need to be mapped to the corresponding information sets and then the information sets need to be updated. The UCT algorithm of the non-complete information machine game is as follows:
Figure BDA0002143423410000231
1.7 Dezhou poker real-time game system design and realization
1.7.1 gaming System framework
The world Computer Poker Competition (ACPC) is held by two international top-level meetings AAAI and IJCAI committees in turn in the field of artificial intelligence. The game started from 2006, which attracts participation of a plurality of scientific research institution representatives all over the world including university of columbia, university of kainmeilong, university college of london, england, and university of alberta, canada, is the highest race and authoritative assessment of international poker machine game, and provides an important platform for researchers in non-complete information machine game to negotiate and exchange. The non-limiting texas poker gaming system of the present invention is implemented according to the game rules of ACPC and the communications framework. The tournament framework as shown in fig. 16, when the game starts, the agent establishes a connection with the server through TCP, and the server acts as a card dealer, message forwarding, and referee. The intelligent agent only exchanges information with the server, and the server sends information such as private hand cards, public cards, action sequences taken by each player, bottom pool size and the like to the intelligent agent; the intelligent agent sends the action information taken by the intelligent agent to the server, the server judges whether the action is legal or not, if so, the server updates the state information of the game according to the action and sends the game state information to the intelligent agent, and the server judges the win or lose and the profit of the intelligent agent when the game is finished. Messages are exchanged in strings. The ACPC tournament rules specify that each game starts, that the chips for each agent are updated to 20000, to 50 small blind bets and 100 large blind bets, and then that the game is played according to the rules of the texas poker game.
1.7.2 texas poker intelligent agent framework
Aiming at the current game situation, the intelligent agent obtains the information of public cards, the size of a bottom pool, the action sequence of each player and the like by analyzing the game state information sent by the server, performs real-time card abstraction, predicts the hand card distribution of each player including the intelligent agent, then establishes a sub game tree, and finally calculates the strategy of the current game situation by utilizing a CFR (computational fluid reactor) or UCT (unified content transfer protocol) algorithm. The size of the sub game tree is determined according to the current game situation and computing resources, the sub game tree established by the invention is only established within the current betting turn, namely leaf nodes of the sub game tree are game state nodes at the end of the current betting turn, and therefore, the benefits of leaf nodes of the sub game tree need to be predicted when the current game situation is positioned in a front card turning circle, a card turning circle and a card turning circle. The present invention first uses EHS size to estimate the leaf node revenue. The whole agent is shown in fig. 17.
1.7.3 improvement of intermediate State yield value estimation
Since the above-described profit-value estimation algorithm for the sub-betting leaf node directly uses the current hand force magnitude EHS of each player's hand combination to determine win or loss, which is equivalent to the game ending at the leaf node of the current sub-betting tree, the player no longer takes action and the dealer deals the remaining unsent public cards. This causes a series of problems: firstly, the calculated strategy has higher availability, the real-time calculated strategy has greater tendency to add large chips or direct All-in when the card ratio of the intelligent agent is larger, and the intelligent agent does not consider the card playing skill of professional players, for example, the professional players adopt a previous card turning circle check when the card ratio is larger, but add the card playing skill in the card turning circle, and if an opponent turns the card turning circle check before, the intelligent agent considers that the card ratio of the player is weaker; secondly, the intelligent agent assumes that the opponent adopts the self-calculated strategy, and predicts the hand of the opponent by using the real-time calculation strategy, if the utilization degree of the strategy is very high, the distribution error of the predicted hand is also very large, and the error is gradually amplified along with the progress of the game. These problems were also seen during the experiment. To address these issues, the present invention utilizes the UCT algorithm to improve the estimation of the intermediate state benefit value. This improved method will now be explained: the method comprises the steps of taking a current leaf node as a root node, establishing a sub game tree through real-time card abstraction and action abstraction, estimating the profit value of each player by using an EHS if the leaf node of the sub game tree does not reach the end of a game, calculating the virtual value of the current leaf node by using an UCT algorithm after the sub game tree is established, or estimating the profit value of the current leaf node by using Monte Carlo tree search. The real-time card abstraction algorithm used for establishing the abstract sub-game tree uses the isomorphism of hands and carries out clustering according to the card force like the real-time card abstraction introduced above, when the current leaf node is positioned at the end position of the front card turning circle or the card turning circle, namely the sub-game tree is positioned at the card turning circle or the river card circle, 300 clusters exist after the real-time card abstraction, and when the current leaf node is the end of the river card circle, the game is ended without establishing the predicted profit value of the sub-game tree.
In fig. 18, white dots indicate the current leaf nodes of the sub-game trees, the middle sub-game tree indicates the strategy for solving the current game situation in real time, and the right sub-game tree is the sub-game tree established by using the current leaf node as the root node for predicting the virtual value of the current leaf node.
1.7.4 Warm Start of Agents
The method improves the intelligent agent in that the strategy of the intelligent agent is divided into two parts, the strategy of the first part is an off-line strategy used for a front card turning circle, and the strategy of the second part is a real-time strategy calculated by the three subsequent betting rounds by using an algorithm based on real-time strategy solution.
The offline strategy of the first part is either solved according to state abstraction and CFR based solving algorithms or by top professional player design rules. Because the opponent does not take any action or takes less action in the forward turning circle, the information exposed by the opponent is less, the public cards are not sent out, and the card distribution error of the opponent is predicted to be larger; at the beginning stage of the current game situation game, the estimation difficulty of the income of the intermediate state is larger; furthermore, the problem of off-tree distribution is not obvious, so that the effect of directly calculating the strategy in real time at the beginning of the game may not be good, and the effect of using the off-line strategy may be good.
The strategy of the second part is calculated online, because the public cards are gradually sent out along with the progress of the game, the actions taken by the players are increased, after the game is played downwards, the states of the game are greatly reduced, more detailed hand abstraction and action abstraction can be adopted, the hand distribution of the players and the profit value of the intermediate state can be estimated more accurately, and the online calculation strategy effect is more obvious.
The intelligent agent introducing the offline strategy into the front turning circle does not establish a sub game tree for each game situation in the front turning circle any more, but establishes a fixed abstract game tree for the front turning circle through hand isomorphism and action abstraction, and the actual game state is mapped to a corresponding node of the abstract game tree in the game process to read the offline strategy from the abstract game tree, and the game is carried out according to the read strategy. When the game enters a turning circle from a front turning circle, the hand distribution of each player is calculated by using a Bayes formula according to an offline strategy and an action sequence taken by each player, and then the game enters a second part, and the strategy of the current game situation is solved by using a method based on online solution of game strategies. The modified agent framework is shown in fig. 19.
The invention also discloses a multi-user non-complete information machine game device based on game residual office online resolving, which comprises the following components:
a first processing module: when the strategy for solving the current game situation S is used, if the current game situation S is the beginning of a new betting round, firstly, card abstraction is carried out in real time according to a card abstraction algorithm;
a second processing module: if S is not the game situation of the action taken by the intelligent agent, the strategy sigma of each player needs to be updated, firstly, if As is not null, the last strategy sigma is used, the last player hand distribution D updates the player hand distribution D by using a Bayesian formula, wherein As represents an action sequence which occurs from the beginning of each betting round or the beginning of the action taken by the intelligent agent to the end of the betting round or the action taken by the intelligent agent, and the strategy sigma of each player is calculated and updated by using a CFR (computational fluid dynamics) or a UCT (computational complexity) algorithm;
a third processing module: the method is used for waiting for a player needing to take action in the current game situation to take some action, then the game is played downwards, if S is the game situation which turns to take action by the intelligent agent, the hand card distribution of the player is updated at first, the strategy sigma of the current game situation is calculated after the sub game tree is established, and then the intelligent agent takes an action a according to the sigma and then the game is continued to be played downwards.
The invention also discloses a multi-user non-complete information machine game system based on game residual office online resolving, which comprises the following components: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of the invention when called by the processor.
The invention also discloses a computer-readable storage medium storing a computer program configured to, when invoked by a processor, implement the steps of the method of the invention.
The invention has the following beneficial effects:
(1) The existing hand card abstract algorithm is integrated, and the real-time card abstract algorithm is provided aiming at the characteristics of real-time game of the Texas poker game. In the real-time game process of the Texas poker game, public cards are gradually sent out, the number of potential hand card combinations is greatly reduced, and the game state space is greatly reduced through real-time card abstraction to accelerate the convergence of strategy iteration.
(2) The strategy online solving algorithm capable of effectively solving the multi-person unrestricted texas playing cards is provided by integrating the real-time card abstraction, the action abstraction and the game stump resolving algorithm, and the effectiveness and feasibility of the CFR and UCT algorithms in solving the real-time strategy of the unrestricted texas playing cards are verified.
(3) According to the defects of the experimental result analysis algorithm, aiming at the problem that the intermediate state yield value is inaccurate to estimate, the CFR algorithm is utilized to effectively improve the intermediate state yield value, and the improved effect is verified through experiments.
(4) Compared with the prior algorithm, the real-time strategy solution-based method provided by the invention has stronger flexibility and applicability, can be suitable for game scenes in the real world, and can calculate corresponding strategies according to different real game situations.
2.1 Experimental analysis
2.1.1 Intelligent agent experiment result and analysis of CFR-based game strategy online solution
The invention develops a non-limiting Texas poker AI intelligent body of two persons, three persons and six persons by using the algorithm, in the realization of the intelligent body, 169 clusters are in total after the card abstraction of a front card turning circle uses the isomorphism of hands, 200 clusters are All formed after the clustering of the force of three betting rounds, and the action abstraction respectively maps the betting actions into 0.5,1.0, 2.0, 5.0 times and All-in of a current bottom pool according to the action mapping algorithm introduced above according to the betting chips. In order to accelerate the iterative computation of the real-time strategy, the intelligent agent adopts multiple threads, and the number of the threads is 32.
The validity of the algorithm is verified by playing the agent against other agents. Playing with the two-player poker agent is HITSZ _ HKL. HITSZ _ HKL participates in 2017 ACPC competition to obtain the third name of a two-player group, the intelligent agent calculates approximate Nash equilibrium by using PureCFR after traditional state abstraction is utilized to obtain an off-line strategy, card abstraction of the intelligent agent is that a front card turning circle is 169, a card turning circle is 300, the card turning circle and a river card circle are both 1000, action abstraction is that a bet chip is mapped to 0.75, 1.0, 2.0 and 3.0 times of a current bottom pool and All-in, and the off-line strategy is obtained after one-week training. The experimental results are as follows:
TABLE 4-1LKC _2p _CFRand PureCFR Algorithm intelligent HITSZ _ HKL game result
Figure BDA0002143423410000271
The experimental results of the three-person unrestricted texas playing card agent LKC _3p and the six-person unrestricted texas playing card agent LKC _6p developed by the present invention are as follows:
TABLE 4-2LKC 3p _CFRand PureCFR Algorithm agent Gibson 3p _CFRresults of playing
Figure BDA0002143423410000272
TABLE 4-3LKC _6p _CFRand the results of playing with the random action agent
Figure BDA0002143423410000281
The LKC _2p _CFR, LKC _3p _CFRand LKC _6p _CFRare non-limiting Texas poker AI agents which are developed based on a game strategy online solving algorithm and are two persons, three persons and six persons respectively, and mbb/h represents the number of one thousandth of large blind notes won by one game. In the three-player texas poker game, two modified agents Gibson _3p _cfrbased on agent code published by Gibson, a member of the university of pre-alberta research team, were employed as two additional opponents. The action abstraction of Gibson _3p _CFRis to map bet chips to 0.5,1.0 times the current pool and All _ in without card abstraction, which is added here and trained with PureCFR for one week to arrive at an offline strategy. Three restrictive texas poker agents developed by Gibson using this traditional method to participate in the ACPC game have been leading in the three restrictive texas poker tournament. This document uses the 5 ACPC-provided random action agents as the other 5 opponents to play in a six-player poker game.
From the experimental results, the intelligent agent developed by the invention wins the random action intelligent agent (only the game results of six persons are given) in a large score under the conditions of two persons, three persons and six persons playing games, and the algorithm is proved to be effective to a certain extent. However, the effect of the agent in the non-limiting texas playing cards of two and three persons is not better than that of the agent developed by the traditional algorithm for calculating the nash equilibrium solution by using the CFR after state abstraction, the possible reason for analysis is that the information known by the agent is less at the beginning stage of the game, the strategy is directly calculated in real time, and the strategy is used for modeling the opponent to predict the distribution error of the opponent hand cards is larger, and because the game is just started, the public cards are not sent out, the scale of the game state is larger, the profit estimation error of the intermediate state is larger, and the performance of the agent is not ideal. The following section will improve on both of these problems and then perform experiments.
2.1.2 Intelligent agent experiment effect based on UCT strategy on-line solution
In order to verify the feasibility of the UCT algorithm in the game calculation of the non-complete information machine, the method for solving the strategy in the algorithm is changed from CFR to UCT algorithm, other card abstractions, action abstractions and the like are the same, and the intelligent agents of two persons, three persons and six persons are also developed, and the experimental results are as follows:
TABLE 4-4LKC _2p _UCTand the playing result of intelligent HITSZ _ HKL of PureCFR algorithm
Figure BDA0002143423410000291
TABLE 4-5LKC _3p _UCTand PureCFR Algorithm intelligent agent Gibson _3p _CFRchess playing results
Figure BDA0002143423410000292
TABLE 4-6LKC _6p _UCTand result of playing with random action agent
Figure BDA0002143423410000293
The invention respectively develops the two-person, three-person and six-person intelligent bodies to play chess with the intelligent bodies which act randomly, and the large scores of the intelligent bodies easily win the success, thereby showing that the UCT algorithm has certain effectiveness. The experimental result shows that the experimental effect is poorer than that of the original CFR algorithm after the strategy solving algorithm is changed into the UCT algorithm, and the two reasons can be found out: the first Monte Carlo tree search needs a certain amount of search, and the number of search times is too small because real-time calculation is limited in time; the second reason may be that a proper parameter C needs to be set in the UCB formula to balance whether the next search is to go to a branch with a large profit value or to a branch with a small number of searches, and this parameter is not well set and needs to be found out most appropriately through a lot of experiments.
2.1.3 Intelligent agent experiment results and analysis with improved intermediate State revenue value estimation
The previous section introduces an algorithm experiment result for calculating the current game situation strategy in real time, and analyzes the reasons of poor experiment effect, wherein one important reason is that the yield value of the intermediate state is not estimated accurately. In the original algorithm, the sub-game tree established by the invention is limited in the current betting round, the win and the lose are directly determined according to the EHS of each player hand at a leaf node, which is equivalent to that the game sends out All public cards at the leaf node, then compares the sizes of the hands to determine the win and the lose, then the game is ended without entering the next betting round, so that the intelligent body tends to be larger in All _ in or the bet when the hand strength is large, and the counter is considered to be weak in the hand when the counter follows the cards or the bet is small, and a skill taken by the counter is ignored: check is taken on this round and the next round is taken a raise action. For example, in experiments, the number of times the agent has turned the turn before and the turn All _ in is found to be large, and the card force is not too strong and often All _ in. The method of estimating the value of the intermediate state yield is thus improved in accordance with the method described above, and experiments on two-and three-player texas playing cards are carried out. The experimental results are as follows:
TABLE 4-7LKC _2p _plusand PureCFR algorithm intelligent HITSZ _ HKL game result
Figure BDA0002143423410000301
The experiment result is that the intelligent agent calculates the game strategy in real time for 400 ten thousand times each time. In the experiment process, the experiment effect is better and better along with the increase of the iteration times, and the relationship between the experiment result and the iteration times is shown in fig. 20.
TABLE 4-8LKC _3p _ plus and PureCFR algorithm agent Gibson _3p _CFR
Figure BDA0002143423410000302
From the experimental results, it can be seen that after the estimation of the yield value of the intermediate state is improved, the experimental results are obviously improved compared with the previous one, and the experimental effect tends to become better as the iteration number increases. Although the improved algorithm is still better than the traditional algorithm which calculates the Nash equilibrium solution by using CFR after state abstraction from the experimental result, the improved algorithm has the advantages: firstly, the method is more flexible to adapt to the real game situation, the traditional method is to calculate an off-line strategy, each player is updated to 20000 chips to calculate according to each game beginning specified by ACPC rules, cards and actions can not be changed after abstraction and fixation, the method is not in line with the real game situation, the method based on the real-time calculation strategy can flexibly change the chips of the players, the betting actions of the players are not fixed any more, and can be changed according to requirements, and the like, and the method is closer to the practical application.
2.1.4 improving Intelligent agent experiment results and analysis of intermediate State yield value estimation and Warm Start
After the estimation of the income value of the intermediate state is improved, the experimental effect is obviously improved, but the experimental effect is still not compared with the traditional algorithm for calculating the Nash equilibrium solution by using the CFR after the state abstraction, the analysis before the process shows that the information known by the intelligent agent is less when the game starts, the difficulty and the error of directly calculating the strategy of the current game situation in real time are larger, and at the moment, if the offline strategy is used at the beginning of the game, the experimental effect can be improved by using the method for calculating the strategy in real time when the game is carried out to a certain stage. The intelligent agent is developed according to the warm start method, firstly, an off-line strategy is calculated through a traditional method, the intelligent agent uses the off-line strategy in a front card turning circle, and an algorithm for calculating the current game situation strategy in real time is used after the intelligent agent enters the card turning circle. The experiment develops two-person and three-person non-limiting Texas playing card intelligent bodies, and the playing results with other intelligent bodies are as follows:
TABLE 4-9LKC _3p _WQDand the playing result of the agent Gibson _3p _CFRof the PureCFR algorithm
Figure BDA0002143423410000311
TABLE 4-10LKC _2p _WQDand agent HITSZ _ HKL for PureCFR Algorithm
Figure BDA0002143423410000312
The middle bureau of disability resolving algorithm does not perform well in the experimental effect of two-person non-restrictive texas poker. The method uses an off-line strategy in a front card turning ring and a card turning ring, wherein the off-line strategy is a strategy adopted by HITSZ _ HKL; and then if the game enters a card turning circle or a river card circle, respectively calculating the strategy of the card turning circle or the river card circle by using an intermediate game residue solving algorithm, and estimating the intermediate state by comparing the sizes of the EHSs of the hands of the players to determine the win or lose. Finally, the intelligent agent and HITSZ _ HKL play a game, the experiment is carried out for 60000 rounds, and the intelligent agent loses about 280mbb/h per round on average. The improved intermediate state income value estimation method and the method for calculating the current game situation strategy in real time are also an improvement on the intermediate residual station calculation algorithm, and from the experimental result, the average winning HITSZ _ HKL per station of LKC _2p _WQDis about 200mbb/h, so that the improvement effect is obvious.
Finally, the invention is combined with other four algorithms: ES-CFR, Q-learning, neural networks, and adversary modeling were compared. In the experimental process, the three unrestricted texas playing card agents of the five algorithms are respectively played with the random agent, the result is shown in figure 22, and the experimental result shows that the experimental effect of the algorithm of the invention on the multi-person unrestricted texas playing card is better than that of the other four algorithms.
The invention researches the condition that the experimental effect of a middle game residual office resolving algorithm (Midgame Solving) in two-person non-limiting Texas poker is poor in the research process. The invention improves the intermediate game residual game resolving algorithm and obviously improves the experimental effect. The main work of the invention is as follows:
(1) Aiming at the problem of huge state space of multi-person non-restrictive Texas poker, a real-time card abstraction method is provided on the basis of the existing card abstraction algorithm. The strategy online solving algorithm capable of effectively solving the multi-person non-limiting Texas playing cards is provided by integrating the real-time card abstraction, the action abstraction and the game residual solution algorithm, and the effectiveness and feasibility of the CFR and UCT algorithms in solving the real-time strategy of the non-limiting Texas playing cards are verified.
(2) Aiming at the condition that the experiment effect of a middle game residual game resolving algorithm (Midgame Solving) in two non-limiting Texas poker is poor, the estimation of the income value in the middle state is improved by analyzing the reason, the middle game residual game resolving algorithm is changed into a strategy on-line calculating method, and the experiment is carried out.
(3) Developing the Texas poker intelligent body according to the algorithm designed in the front, playing with other poker intelligent bodies participating in ACPC, analyzing the experimental result, improving the experimental scheme and verifying the effectiveness of the algorithm.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (9)

1. A multi-user non-complete information machine game method based on game incomplete office online resolving is characterized by comprising the following steps:
step 1: when the strategy of the current game situation S is solved, if the current game situation S is the beginning of a new betting round, firstly, card abstraction is carried out in real time according to a card abstraction algorithm;
and 2, step: if S is not the game situation of the action taken by the intelligent agent, the strategy sigma of each player needs to be updated, firstly, if As is not null, the last strategy sigma is used, the last player hand distribution D updates the player hand distribution D by using a Bayesian formula, wherein As represents an action sequence which occurs from the beginning of each betting round or the beginning of the action taken by the intelligent agent to the end of the betting round or the action taken by the intelligent agent, and the strategy sigma of each player is calculated and updated by using a CFR (computational fluid dynamics) or a UCT (learning algorithm);
and step 3: after a player needing to take action in the current game situation takes a certain action, the game is played downwards, if S is the game situation taking the action by the intelligent agent, the hand distribution of the player is updated at first, a strategy sigma of the current game situation is calculated after a sub game tree is established, and then the intelligent agent takes an action a according to the sigma and then the game is continued to be played downwards;
in the step 1, real-time card abstraction is performed through a real-time card abstraction algorithm, which includes: firstly, calculating isomorphic abstraction of hands and the card force of various hand combinations off line and then storing the isomorphic abstraction and the card force so as to be directly used to accelerate the calculation speed when the real-time card is abstracted; when real-time card abstraction is carried out, different card abstractions are carried out according to the position of the current game situation on the original game tree, if the current game situation is located in a front card turning circle, isomorphic abstraction of hands calculated offline is directly used, at the moment, all the hand combinations are C (52, 2) =1326, and 169 is obtained after the hands are isomorphic; if the current game situation is located in a card turning circle, a card turning circle and a river card circle, removing the issued public cards, reading the card force of all possible card combinations from the card force file of the off-line calculated hand cards, and clustering by using a k-means algorithm; the multiplayer non-complete information machine game method is applied to the Texas poker game.
2. The method of claim 1, wherein the step 2 of updating the hand distribution D of the player using the bayesian formula comprises: assuming that the strategy calculated by the agent in real time is sigma, the player
Figure DEST_PATH_IMAGE002
All possible hands of are
Figure DEST_PATH_IMAGE004
,
Figure DEST_PATH_IMAGE006
, …,
Figure DEST_PATH_IMAGE008
And action a is taken according to strategy σ, the conditional probability of the player taking the hand after a can be calculated as follows:
P(
Figure DEST_PATH_IMAGE010
) = P(
Figure DEST_PATH_IMAGE012
) =
Figure DEST_PATH_IMAGE014
(3-7)
wherein
Figure DEST_PATH_IMAGE016
Is a player
Figure 186566DEST_PATH_IMAGE002
The hand is
Figure DEST_PATH_IMAGE018
Probability of taking action A, pair P: (
Figure DEST_PATH_IMAGE020
) After normalization, the player is
Figure 166023DEST_PATH_IMAGE002
Hand distribution after taking action a.
3. The gaming method of the multi-player non-complete information machine as recited in claim 1, wherein when the strategy of the current gaming situation is calculated in real time, the size of the established sub-gaming tree is determined according to the position of the current gaming situation in the original gaming tree and the calculated resources, if the leaf nodes of the established sub-gaming tree do not reach the end of the game, the profit values of the leaf nodes in the intermediate states need to be estimated, and the profit values of the leaf nodes of the sub-gaming tree are estimated by searching the monte carlo tree and training the neural network to predict the virtual value of the leaf nodes.
4. The method of claim 1, wherein when building the sub-game tree, the hands of each player are determined by monte carlo sampling according to the hand distribution of the player, and the leaf nodes of the built sub-game tree only reach the end nodes of the current betting round; then, determining a win-win value as a profit value of the leaf node according to the strength of the hand force of each player; the return value of the leaf node is estimated by the UCT algorithm.
5. The multi-player non-complete information machine gaming method of claim 1, wherein the strategy of the two-player and multi-player non-limiting texas poker current gaming situation is solved by ES-CFR algorithm.
6. The multi-user non-complete information machine game method of claim 3, wherein the Monte Carlo tree search needs to be improved to extend to the non-complete information game, firstly, the action nodes and the random nodes of all players are sampled by Monte Carlo according to corresponding strategies, secondly, the nodes of the non-complete information machine game tree are information sets and are intersected with each other, and in the process of using the Monte Carlo tree search algorithm, when updating the data of the game tree nodes, the current nodes need to be mapped to the corresponding information sets, and then the data of the information sets are updated.
7. A multi-user non-complete information machine game device based on game incomplete office online resolving is characterized by comprising:
a first processing module: when the strategy for solving the current game situation S is used, if the current game situation S is the beginning of a new betting round, firstly, card abstraction is carried out in real time according to a card abstraction algorithm;
a second processing module: if S is not the game situation of the action taken by the intelligent agent, updating the strategy sigma of each player, firstly, if As is not null, updating the player hand distribution D by using the last strategy sigma and a Bayesian formula by using the last player hand distribution D, wherein As represents an action sequence which occurs during the period from the beginning of each betting round or the beginning of the action taken by the intelligent agent to the end of the betting round or the time taken by the intelligent agent, and calculating and updating the strategy sigma of each player by using a CFR (computational fluid dynamics) or a UCT (UCT) algorithm;
a third processing module: the game system is used for waiting for a player needing to take an action in the current game situation to take a certain action, then the game is played downwards, if S is the game situation which turns to take the action by the intelligent agent, the hand card distribution of the player is updated at first, a strategy sigma of the current game situation is calculated after a sub game tree is established, and then the intelligent agent takes an action a according to the sigma and then the game is continued to be played downwards;
in the first processing module, real-time card abstraction is performed through a real-time card abstraction algorithm, which includes: firstly, calculating isomorphic abstraction of hands and the card force of various hand combinations off line and then storing the isomorphic abstraction and the card force so as to be directly used to accelerate the calculation speed when the real-time card is abstracted; when real-time card extraction is carried out, different card extractions are carried out according to the position of the current game situation in the original game tree, if the current game situation is in a front card turning circle, isomorphic abstraction of off-line calculated hands is directly used, at the moment, all the hand combinations are C (52, 2) =1326, and 169 are obtained after the hands are isomorphic; if the current game situation is located in a card turning circle, a card turning circle and a river card circle, the sent public cards are removed, then the card force of all possible hand card combinations is read from the offline calculated hand card force file, and finally clustering is carried out by utilizing a k-means algorithm; the multi-player non-complete information machine game device is applied to the Texas poker game.
8. A multi-user non-complete information machine game system based on online resolving of game residual offices is characterized by comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of any one of claims 1-6 when invoked by the processor.
9. A computer-readable storage medium characterized by: the computer-readable storage medium stores a computer program configured to implement the steps of the method of any one of claims 1-6 when invoked by a processor.
CN201910676451.XA 2019-07-25 2019-07-25 Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium Active CN110404265B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910676451.XA CN110404265B (en) 2019-07-25 2019-07-25 Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910676451.XA CN110404265B (en) 2019-07-25 2019-07-25 Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium

Publications (2)

Publication Number Publication Date
CN110404265A CN110404265A (en) 2019-11-05
CN110404265B true CN110404265B (en) 2022-11-01

Family

ID=68363136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910676451.XA Active CN110404265B (en) 2019-07-25 2019-07-25 Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium

Country Status (1)

Country Link
CN (1) CN110404265B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110841295B (en) * 2019-11-07 2022-04-26 腾讯科技(深圳)有限公司 Data processing method based on artificial intelligence and related device
CN112041811B (en) 2019-12-12 2022-09-16 支付宝(杭州)信息技术有限公司 Determining action selection guidelines for an execution device
SG11202010721QA (en) * 2019-12-12 2020-11-27 Alipay Hangzhou Inf Tech Co Ltd Determining action selection policies of execution device
CN112997198B (en) 2019-12-12 2022-07-15 支付宝(杭州)信息技术有限公司 Determining action selection guidelines for an execution device
CN111185010B (en) * 2019-12-25 2021-08-03 北京理工大学 System and method for constructing landlord card-playing program by using pulse neural network
CN111359213B (en) * 2020-03-24 2022-11-25 腾讯科技(深圳)有限公司 Method and apparatus for controlling virtual players in game play
CN111507475A (en) * 2020-04-14 2020-08-07 杭州浮云网络科技有限公司 Game behavior decision method, device and related equipment
CN111291890B (en) * 2020-05-13 2021-01-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN111905373A (en) * 2020-07-23 2020-11-10 深圳艾文哲思科技有限公司 Artificial intelligence decision method and system based on game theory and Nash equilibrium
CN113779870A (en) * 2021-08-24 2021-12-10 清华大学 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium
CN113626720B (en) * 2021-10-12 2022-02-25 中国科学院自动化研究所 Recommendation method and device based on action pruning, electronic equipment and storage medium
CN114048833B (en) * 2021-11-05 2023-01-17 哈尔滨工业大学(深圳) Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game
CN114492749B (en) * 2022-01-24 2023-09-15 中国电子科技集团公司第五十四研究所 Game decision method for motion space decoupling of time-limited red-blue countermeasure problem
CN117258268B (en) * 2023-10-13 2024-04-02 杭州乐信圣文科技有限责任公司 Card gate arrangement design method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
CN106469317A (en) * 2016-09-20 2017-03-01 哈尔滨工业大学深圳研究生院 A kind of method based on carrying out Opponent Modeling in non-perfect information game
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8545332B2 (en) * 2012-02-02 2013-10-01 International Business Machines Corporation Optimal policy determination using repeated stackelberg games with unknown player preferences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN106469317A (en) * 2016-09-20 2017-03-01 哈尔滨工业大学深圳研究生院 A kind of method based on carrying out Opponent Modeling in non-perfect information game

Also Published As

Publication number Publication date
CN110404265A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110404265B (en) Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium
CN110404264B (en) Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
Tak et al. Monte Carlo tree search variants for simultaneous move games
CN111729300A (en) Monte Carlo tree search and convolutional neural network based bucket owner strategy research method
Cazenave et al. The α μ search algorithm for the game of bridge
Whitehouse Monte Carlo tree search for games with hidden information and uncertainty
Bouzy et al. Recursive monte carlo search for bridge card play
CN114676757A (en) Multi-person non-complete information game strategy generation method and device
CN111905373A (en) Artificial intelligence decision method and system based on game theory and Nash equilibrium
Kowalski et al. Summarizing strategy card game AI competition
Bitan et al. Combining prediction of human decisions with ISMCTS in imperfect information games
Tan et al. Winning rate prediction model based on Monte Carlo tree search for computer Dou Dizhu
Billings et al. Using selectivesampling simulations in poker
Chen et al. Challenging artificial intelligence with multiopponent and multimovement prediction for the card game Big2
Ganzfried Reflections on the first man vs. machine no-limit Texas Hold'em competition
Wang Cfr-p: Counterfactual regret minimization with hierarchical policy abstraction, and its application to two-player mahjong
Chen et al. A Novel Reward Shaping Function for Single-Player Mahjong
Ramirez et al. Pokerbot: Hand strength reinforcement learning
CN114048833B (en) Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game
Li et al. Estimating Winning Probability for Texas Hold'em Poker
CN107292389A (en) The formulating method and device of computer game strategy
Li et al. Scalable sub-game solving for imperfect-information games
CN114429213A (en) Incomplete information game decision method and device, electronic equipment and storage medium
Li et al. Speedup training artificial intelligence for mahjong via reward variance reduction
US8734219B2 (en) System and method for ex post relative skill measurement in poker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant