CN111729300A - Monte Carlo tree search and convolutional neural network based bucket owner strategy research method - Google Patents

Monte Carlo tree search and convolutional neural network based bucket owner strategy research method Download PDF

Info

Publication number
CN111729300A
CN111729300A CN202010589925.XA CN202010589925A CN111729300A CN 111729300 A CN111729300 A CN 111729300A CN 202010589925 A CN202010589925 A CN 202010589925A CN 111729300 A CN111729300 A CN 111729300A
Authority
CN
China
Prior art keywords
monte carlo
node
game
algorithm
tree search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010589925.XA
Other languages
Chinese (zh)
Inventor
王以松
彭啟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202010589925.XA priority Critical patent/CN111729300A/en
Publication of CN111729300A publication Critical patent/CN111729300A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/45Controlling the progress of the video game
    • A63F13/46Computing the game score
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/822Strategy games; Role-playing games
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/61Score computation
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/80Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game specially adapted for executing a specific type of game
    • A63F2300/807Role playing or strategy games

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a Monte Carlo tree search and convolutional neural network bucket landholder strategy research method based on the technical field of machine learning, which comprises the following steps: randomly starting a game and taking the current state of each player as a root node when each player plays cards, wherein the player possibly adopts actions as direct child nodes of the root node according to the ground fighting master rule; starting from the root node of the game tree, performing continuous analog sampling learning by using a Monte Carlo tree search algorithm; when the data obtained by using the Monte Carlo tree search algorithm is enough, continuously training a convolutional neural network CNN learning network by taking (the state and the possible card-playing, the corresponding income of the possible card-playing in the current state) as a data sample until the network is stable; and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.

Description

Monte Carlo tree search and convolutional neural network based bucket owner strategy research method
Technical Field
The invention relates to the technical field of machine learning, in particular to a Monte Carlo tree search and convolutional neural network based bucket landowner strategy research method.
Background
In recent years, with the development of machine learning, the method also achieves remarkable results in perfect information game. Among them, what has the milestone significance is: in 2016, 3 and 15 days, Google uses deep learning, reinforcement learning and other methods, developed AlphaGo beats world I-go champion plum stone in the I-go field, and the marking machine realizes the expression of superman in the I-go field. Then the training of AlphaGo Zero is done entirely by self-learning, which starts from random games without any supervision or use of human data and finally outperforms AlphaGo training with significant advantage using human data by continuously learning, using only the black and white on the chessboard as the original input features. Card games such as texas playing cards, "landlord" do not achieve better performance than the proud achievements achieved by machine learning in board games such as go, chess, etc.
Researchers' research on card games has focused on imperfect information gambling games, such as texas poker, and achieved good results, such as Libratus, developed by the university of canakymeilong, and DeepStack, developed by the university of alberta, both in a one-to-one mode with human players. The libaratus assumes that two players are solved by using Monte Carlo statistical regression Minimization (MCCFR) at each decision, so that the strategy is close to Nash equilibrium, but most of game knowledge such as Nash equilibrium is built on non-cooperative game, and currently, no better solution is provided for cooperative game embodied by peasants in game such as 'fighting landlord'; deep learning is used for deep learning. Furthermore, either Libratus or DeepStack can only play one-to-one games.
There has been less research on the "fighting landowner" game than on the texas poker, wherein YangYou et al, shanghai university of transportation, proposed a composite Q-learning (cql) solution that achieved significant results in comparison of the original deep reinforcement learning, such as DQN, A3C (where textual experiments showed that the DQN algorithm did not converge in the "fighting landowner" game). But often do not work as well in the course of playing a real game with humans.
The poker game is popular in China and is popular among people. In the annual "fighting landowner" champion tournament of Tengchong corporation in 2018, the number of participants is as much as 8000 ten thousand. In contrast, the "bucket owners" are now under less research primarily because they are more difficult and less valued. The fighting owner is a card game of 3 persons, and the rules of the 'fighting owner' game belong to the prior art. The research difficulty is mainly reflected in the following aspects: (1) during the game, the poker information is partially hidden; (2) according to the game rule of 'fighting the land owner', the game space is larger in the game process; (3) the game is a problem of multi-player gaming, and cooperative gaming is also embodied in the game.
Based on the method, the Monte Carlo tree search and convolutional neural network bucket landholder strategy research method is designed to solve the problems.
Disclosure of Invention
The invention aims to provide a Monte Carlo tree search and convolutional neural network bucket owner strategy research method to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: the Monte Carlo tree search and convolutional neural network based bucket owner strategy research method comprises the following steps:
randomly starting a game and taking the current state of each player as a root node when each player plays cards, wherein the player possibly adopts actions as direct child nodes of the root node according to the ground fighting master rule;
starting from the root node of the game tree, performing continuous analog sampling learning by using a Monte Carlo tree search algorithm;
when the data obtained by using the Monte Carlo tree search algorithm is enough, continuously training a convolutional neural network CNN learning network by taking (the state and the possible card-playing, the corresponding income of the possible card-playing in the current state) as a data sample until the network is stable;
and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.
Preferably, the status includes, but is not limited to, the number of cards played by each player, the number of players 'hands, and the current player's hand.
Preferably, the monte carlo tree search algorithm is a method combining a monte carlo method and game tree search.
Preferably, the monte carlo method is a method for replacing expectation of random variables by empirical averaging, and the specific method is as follows:
obtaining a series of proceeds G by a Monte Carlo method1(s),G2(s),……,Gn(s). According to the law of large numbers, when n tends to infinity, the mean value of the sampled returns approaches the expected value, and v(s) is defined as the mean value of the series returns, namely:
Figure BDA0002555040390000031
ν(s)→νπ(s)as n→∞ (2)
wherein s is game state vπ(s) is an expected value in the state of the game s.
Preferably, the game tree search algorithm comprises the following steps:
selection of an extension node: recursively applying a node selection function, selecting a node from all nodes to be selected as a root node of the current expansion, and performing primary simulation on a game situation represented by the node from the node;
and (3) an expansion step: adding one or more nodes into an MCTS search tree, wherein the common strategy is that only one new node is added to a game tree in each iteration;
simulation: simulating the actual player game process, and carrying out a game process from the node starting simulation to the termination state;
and (3) feedback: and the simulation result is fed back to the parent node layer by layer from the termination node of the simulation, and finally the simulation result is fed back to the root node.
Preferably, the algorithm used for selecting the extension node is a UCT algorithm, and the UCT algorithm is:
Figure BDA0002555040390000041
in the formula of gammaiIndicates the selected evaluation value of the node i,
Figure BDA0002555040390000042
representing the average benefit of the node i; c is a constant, which acts to balance exploration and utilization; ni is the number of times the ith node is the root node of the simulation search.
Preferably, after the monte carlo tree search algorithm meets the maximum sampling times or reaches the time exhaustion and the like, a decision is selected from the monte carlo tree search algorithm as the optimal decision of the MCTS algorithm at this time according to the evaluation value of each node in the first-layer sub-nodes in the game tree.
Preferably, the CNN learning network is composed of 4 convolutional layers and 3 fully-connected layers, wherein 3 pooling layers are added and Relu is used as an activation function, the network uses the current state and a certain possible card-out as input, the input size is 15 × 29, and the output is the benefit of the current state and a certain possible card-out.
Preferably, the strategy improvement method is essentially a monte carlo tree search algorithm, after a certain state and one possible card-out in the state are input to the CNN learning network, the CNN learning network outputs a profit value of the possible card-out in the current state, and the profit of each possible card-out in the current state can be solved by circulating each possible card-out; and then, taking the income value of each possible card in the current state as an initial value of Monte Carlo tree search in the strategy improvement module, and performing continuous sampling simulation on the current state within a certain time range to correct errors possibly existing in the CNN learning network.
Compared with the prior art, the invention has the beneficial effects that:
aiming at the problems caused by large behavior space and the relationship among players, the invention uses a Monte Carlo tree searching method to sample and simulate the behavior space so as to solve the profit value of each possible card-playing action after continuous simulation, and selects the action corresponding to the maximum profit value as the best card-playing action of the round. In the invention, continuous game is carried out by using MCTS, the result is used as original data and is provided for the convolutional neural network for learning, so that the game state is provided for the CNN network, the income of each possible card playing in the current game state is calculated, and the card playing with the maximum income is selected as the best card playing action of the round, thereby greatly shortening the decision time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a MCTS game tree search algorithm process according to the present invention;
FIG. 3 is a diagram of a CNN learning framework according to the present invention;
FIG. 4 is a graph of the loss variation obtained from the L1 loss function of the present invention;
FIG. 5 is a graph of the algorithm and RHCP algorithm win ratio change when the present invention is used as a landowner;
FIG. 6 is a graph showing the algorithm and RHCP algorithm win ratio change when the present invention is used as a farmer.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: the Monte Carlo tree search and convolutional neural network based bucket owner strategy research method comprises the following steps:
starting a game randomly, and taking the current state of each player (the number of the played cards of each player, the number of the hands of each player, the hands of the current player and the like) as a root node when each player plays cards (decides), wherein the player can take actions possibly adopted by the player according to the ground fighting main rule as direct child nodes of the root node;
starting from the root node of the game tree, performing continuous analog sampling learning by using an MCTS algorithm;
calculating and storing the income value of each possible card under the current state, and continuously repeating the process;
when enough data are obtained by using the MCTS algorithm, continuously training a CNN learning network by taking (state and possible card-playing, income corresponding to possible card-playing in the current state) as a data sample until the network is stable;
and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.
For the problem of large problem scale, related researchers combine a Monte Carlo method and game tree search to provide an MCTS algorithm, specifically a Monte Carlo Tree Search (MCTS) method.
The monte carlo method is an expectation that an empirical average will be used instead of a random variable. For example, in the game, the expectation value is v during the state sπ(s) the expected value is difficult to calculate, but a series of gains G can be obtained by the Monte Carlo method1(s),G2(s),……,Gn(s). According to the law of large numbers, as n approaches infinity, the mean of the sampled gains approaches the expected value. Define ν(s) as the average of the series gains, i.e.:
Figure BDA0002555040390000061
ν(s)→νπ(s)as n→∞ (2)
the MCTS algorithm starts with initializing the search tree, generally abstracting the current game state as a single game tree root node, and once the game tree initialization is completed, the MCTS game tree search algorithm process shown in fig. 2 is repeated within a specified time, and the search process can be divided into 4 steps:
selection of an extension node: and recursively applying a node selection function, selecting one node from all nodes to be selected as a root node of the current expansion, and performing primary simulation on the game situation represented by the node from the node.
And (3) an expansion step: one or more nodes are added to the MCTS search tree. The common strategy is that only one new node is added to the game tree in each iteration.
Simulation: simulating the actual player game process, and carrying out a game process from the node starting simulation to the termination state;
and (3) feedback: and the simulation result is fed back to the parent node layer by layer from the termination node of the simulation, and finally the simulation result is fed back to the root node.
The MCTS algorithm can select a decision as the best decision of the MCTS algorithm according to the evaluation value of each node in the first-layer sub-nodes in the game tree after the maximum sampling times are met or the reaching time is exhausted.
There are great differences in the algorithms used when expanding the node selection. Among them, the UCT algorithm proposed by Kocsis and Szepesvari in 2006 aiming at the problem of expanding node selection in the MCTS algorithm on the basis of the UCB algorithm is widely used. The UCT algorithm is as follows:
Figure BDA0002555040390000071
in equation 3 γiIndicates the selected evaluation value of the node i,
Figure BDA0002555040390000072
representing the average benefit of the node i; c is a constant, which acts to balance exploration and utilization; ni is the ith node as the simulation searchNumber of root nodes of the rope. Practice has proven that the algorithm is very efficient.
When the Monte Carlo tree search module is used as a strategy to play games, although the effect is good, when each game starts, MCTS needs to be performed once by taking the current state as a root node, and in the process of performing analog sampling by using the MCTS, in order to ensure the accuracy, each round of games also needs to consume several minutes of time, which is obviously unacceptable in the actual game process. In addition, in each round of games, the MCTS is used for spending a large amount of time to calculate results but is not effectively used, and aiming at the condition, the method and the device use the Convolutional Neural Network (CNN) to learn the results after the MCTS search, fully utilize the results of the MCTS module and achieve the purpose of shortening game decision time.
The CNN network consists of 4 convolutional layers and 3 fully-connected layers, in which 3 pooling layers are added and Relu is used as an activation function. The network uses the current state and a possible deal as inputs, with an input size of 15 x 29, and outputs the revenue for the current state and a possible deal. Wherein the input is specifically represented as follows:
x dimension: each position in the X dimension sequentially represents 15 playing cards such as {3, 4, 5, 6, 7, 8, 9, T, J, Q, K, A, 2, L, B }.
Dimension Y: the (subscript) in the Y dimension indicates the number of playing cards in the corresponding X dimension, where (0-4 are represented by 0000, 0001, 0010, 0100, 1000, respectively), 4-13 (subscript) in the Y dimension indicate 10 of the dealt card types, single, pair, tree _ plus _ one, tree _ plus _ two, sequence _ one, sequence _ two, sequence _ three, bond, k _ bond, etc., and 14 (subscript) indicates whether the deal is a landowner player.
The Z dimension is as follows:
Figure BDA0002555040390000081
Figure BDA0002555040390000091
as shown in fig. 3, the CNN network learning framework includes 4 convolutional layers, 3 pooling layers, and 3 full connections.
Using a CNN web learning model to learn over 500 tens of thousands of records held by MCTS, the L1 loss function is used to obtain the loss variation as shown in the graph of fig. 4 where the x-axis represents epoch and the y-axis represents loss of revenue. As can be seen from the figure, the loss of revenue greatly decreases at 0 to 30epoch, and after 30epoch, the loss of revenue does not change much and finally floats around 0.033.
After the strategy learned by the Monte Carlo tree searching module is learned by using the CNN network, the CNN learning network can directly play a 'landlord' game. However, when the MCTS module is used for learning, some decision branches cannot be sufficiently explored due to time limitation, and errors inevitably exist when the CNN learning network learning strategy is used later. In view of this, a strategy improvement approach was introduced.
The strategy improvement method is essentially Monte Carlo tree search. In the game process, after a certain state and one possible card-out in the state are input into the CNN learning network, the learning network can output the income value of the possible card-out in the current state, and the income of each possible card-out in the current state can be solved by circulating each possible card-out; and then taking the result as an initial value of Monte Carlo tree search in the strategy improvement module, and continuously sampling and simulating the current state within a possible time range to correct errors possibly existing in the CNN learning network. In addition, the CNN learning network is essentially a strategy learned in the MCTS module, so that the MCTS is used again for sampling simulation on the basis of the CNN network, and the sampling times are increased within a limited time range on the basis of the sampling result of the MCTS module, so that the sampling result is closer to a real result.
The experimental results are as follows:
firstly, the intelligent agent of the 'hopper landowner' which can be tested at present is simply introduced, and then the intelligent agent of the 'hopper landowner' is directly or indirectly compared with the intelligent agent of the 'hopper landowner' which is realized by the invention.
The RCHP algorithm: the general idea is to combine the hands according to the rules of 'fighting the land as a main' and to select the deal with higher value of the hand after dealing as the best deal in the round.
Naive DQN and A3C: by taking the experience of deep reinforcement learning in the game as a reference, the thought is applied to the game of fighting the landlord.
CQL: the algorithm was proposed by YangYou et al, Shanghai university of transportation.
Compared with the classic deep reinforcement learning algorithm:
and (4) setting the game, namely using RCHP as a decision algorithm of a player of the farmer and playing the game as the farmer in other algorithms respectively, wherein the game is as shown in the following table.
Algorithm Rate of victory
NaiveDQN About 15 percent
A3C Close to 20 percent
CQL Close to 50 percent
Algorithm of the invention 65.6%
In the table
Figure BDA0002555040390000101
DQN, data of A3C is direct reference to Yangyou et al, Shanghai university of transportationThe results are reported herein.
As can be seen from the table, when the algorithm is used as a landowner and an RHCP algorithm for game playing, the effect is superior to that of classical deep reinforcement learning.
Comparison with the RHCP algorithm:
game setting: in the process of playing the game, 51 playing cards are randomly distributed to each player in an average mode through a program (in the process of dealing, 17 hands are distributed to each player), then one player is randomly assigned as a landowner, and the rest 3 landowners are highlighted and distributed to the landowner player. And starting from the landowner player, playing according to the landowner fighting game rule. In the gaming process: the decision algorithm of one role is MCM algorithm, and the decision algorithm of the other role is RHCP algorithm; wherein, the decision algorithms of the two farmers are both the decision algorithms of the role of the farmer. According to the game settings, the algorithm of the invention is used as a landowner and a farmer to respectively play 500 games with the RHCP algorithm, and the results are as follows:
1. when the algorithm of the invention is taken as the landowner, compared with the RHCP algorithm, the win ratio change graph is shown in FIG. 5, and it can be seen from the graph that when the algorithm of the invention is taken as the landowner, the win ratio is 65.6%, which is obviously higher than about 34.4% of the RHCP algorithm.
2. When the algorithm is used as a farmer, compared with the RHCP algorithm, the algorithm of the invention has a change graph of the winning rate as shown in FIG. 6, and as can be seen from the graph, when the algorithm and the RHCP algorithm are used as the farmer, the total winning rate is about 57%, and the RHCP algorithm is used as a landowner, and the obtained winning rate is about 43%.
3. When the roles are not distinguished, the algorithm rate of the invention is 61.3%, and the RHCP algorithm rate is 38.7%.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (9)

1. The Monte Carlo tree search and convolutional neural network based bucket owner strategy research method is characterized by comprising the following steps of: the method comprises the following steps:
randomly starting a game and taking the current state of each player as a root node when each player plays cards, wherein the player possibly adopts actions as direct child nodes of the root node according to the ground fighting master rule;
starting from the root node of the game tree, performing continuous analog sampling learning by using a Monte Carlo tree search algorithm;
when the data obtained by using the Monte Carlo tree search algorithm is enough, continuously training a convolutional neural network CNN learning network by taking (the state and the possible card-playing, the corresponding income of the possible card-playing in the current state) as a data sample until the network is stable;
and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.
2. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the states include, but are not limited to, the number of cards played by each player, the number of player hands, and the current player hand.
3. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the Monte Carlo tree searching algorithm is a method combining a Monte Carlo method and game tree searching.
4. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 3, wherein: the Monte Carlo method is an expectation that an empirical average is used for replacing a random variable, and the specific method is as follows:
obtaining a series of proceeds G by a Monte Carlo method1(s),G2(s),……,Gn(s). According to the law of large numbers, when n tends to infinity, the mean value of the sampled returns approaches the expected value, and v(s) is defined as the mean value of the series returns, namely:
Figure FDA0002555040380000021
ν(s)→νπ(s)as n→∞ (2)
wherein s is game state vπ(s) is an expected value in the state of the game s.
5. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 3, wherein: the game tree search algorithm comprises the following steps:
selection of an extension node: recursively applying a node selection function, selecting a node from all nodes to be selected as a root node of the current expansion, and performing primary simulation on a game situation represented by the node from the node;
and (3) an expansion step: adding one or more nodes into an MCTS search tree, wherein the common strategy is that only one new node is added to a game tree in each iteration;
simulation: simulating the actual player game process, and carrying out a game process from the node starting simulation to the termination state;
and (3) feedback: and the simulation result is fed back to the parent node layer by layer from the termination node of the simulation, and finally the simulation result is fed back to the root node.
6. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the algorithm used by the selection of the extended node is a UCT algorithm, and the UCT algorithm is as follows:
Figure FDA0002555040380000022
in the formula of gammaiIndicates the selected evaluation value of the node i,
Figure FDA0002555040380000023
representing the average benefit of the node i; c is a constant, which acts to balance exploration and utilization; ni is the number of times the ith node is the root node of the simulation search.
7. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: after the Monte Carlo tree search algorithm meets the maximum sampling times or reaches the time exhaustion and the like, a decision is selected from the Monte Carlo tree search algorithm as the optimal decision of the MCTS algorithm according to the evaluation value of each node in the first-layer sub-nodes in the game tree.
8. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the CNN learning network is composed of 4 convolutional layers and 3 fully-connected layers, 3 pooling layers are added in the CNN learning network, Relu is used as an activation function, the network uses the current state and a certain possible card-out as input, the input size is 15 × 29, and the output is the benefit of the current state and a certain possible card-out.
9. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the strategy improvement method is essentially a Monte Carlo tree search algorithm, after a certain state and one possible card-out in the state are input to the CNN learning network, the CNN learning network can output the income value of the possible card-out in the current state, and the income of each possible card-out in the current state can be solved by circulating each possible card-out; and then, taking the income value of each possible card in the current state as an initial value of Monte Carlo tree search in the strategy improvement module, and performing continuous sampling simulation on the current state within a certain time range to correct errors possibly existing in the CNN learning network.
CN202010589925.XA 2020-06-24 2020-06-24 Monte Carlo tree search and convolutional neural network based bucket owner strategy research method Pending CN111729300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010589925.XA CN111729300A (en) 2020-06-24 2020-06-24 Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010589925.XA CN111729300A (en) 2020-06-24 2020-06-24 Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Publications (1)

Publication Number Publication Date
CN111729300A true CN111729300A (en) 2020-10-02

Family

ID=72651022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010589925.XA Pending CN111729300A (en) 2020-06-24 2020-06-24 Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Country Status (1)

Country Link
CN (1) CN111729300A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765242A (en) * 2021-04-07 2021-05-07 中至江西智能技术有限公司 Decision model data processing method and system based on game tree search algorithm
CN113159313A (en) * 2021-03-02 2021-07-23 北京达佳互联信息技术有限公司 Data processing method and device of game model, electronic equipment and storage medium
CN113628699A (en) * 2021-07-05 2021-11-09 武汉大学 Inverse synthetic problem solving method and device based on improved Monte Carlo reinforcement learning method
CN113673672A (en) * 2021-07-08 2021-11-19 哈尔滨工业大学 Curling game strategy generation method based on Monte Carlo reinforcement learning
WO2022247791A1 (en) * 2021-05-28 2022-12-01 南京邮电大学 Chess self-learning method and apparatus based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106422332A (en) * 2016-09-08 2017-02-22 腾讯科技(深圳)有限公司 Artificial intelligence operation method and device applied to game
CN108694440A (en) * 2018-05-14 2018-10-23 南京邮电大学 A kind of online extensive method of search in real time
CN109284812A (en) * 2018-09-19 2019-01-29 哈尔滨理工大学 A kind of video-game analogy method based on improvement DQN
CN109843401A (en) * 2017-10-17 2019-06-04 腾讯科技(深圳)有限公司 A kind of AI object behaviour model optimization method and device
CN110368690A (en) * 2019-07-31 2019-10-25 腾讯科技(深圳)有限公司 Gaming decision model training method, tactics of the game generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106422332A (en) * 2016-09-08 2017-02-22 腾讯科技(深圳)有限公司 Artificial intelligence operation method and device applied to game
CN109843401A (en) * 2017-10-17 2019-06-04 腾讯科技(深圳)有限公司 A kind of AI object behaviour model optimization method and device
CN108694440A (en) * 2018-05-14 2018-10-23 南京邮电大学 A kind of online extensive method of search in real time
CN109284812A (en) * 2018-09-19 2019-01-29 哈尔滨理工大学 A kind of video-game analogy method based on improvement DQN
CN110368690A (en) * 2019-07-31 2019-10-25 腾讯科技(深圳)有限公司 Gaming decision model training method, tactics of the game generation method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113159313A (en) * 2021-03-02 2021-07-23 北京达佳互联信息技术有限公司 Data processing method and device of game model, electronic equipment and storage medium
CN112765242A (en) * 2021-04-07 2021-05-07 中至江西智能技术有限公司 Decision model data processing method and system based on game tree search algorithm
WO2022247791A1 (en) * 2021-05-28 2022-12-01 南京邮电大学 Chess self-learning method and apparatus based on machine learning
CN113628699A (en) * 2021-07-05 2021-11-09 武汉大学 Inverse synthetic problem solving method and device based on improved Monte Carlo reinforcement learning method
CN113673672A (en) * 2021-07-08 2021-11-19 哈尔滨工业大学 Curling game strategy generation method based on Monte Carlo reinforcement learning
CN113673672B (en) * 2021-07-08 2024-03-29 哈尔滨工业大学 Curling competition strategy generation method based on Monte Carlo reinforcement learning

Similar Documents

Publication Publication Date Title
CN111729300A (en) Monte Carlo tree search and convolutional neural network based bucket owner strategy research method
Torrado et al. Deep reinforcement learning for general video game ai
CN110404264B (en) Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
Lee et al. The computational intelligence of MoGo revealed in Taiwan's computer Go tournaments
CN110404265B (en) Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium
Arabzad et al. Football match results prediction using artificial neural networks; the case of Iran Pro League
CN110119804A (en) A kind of Ai Ensitan chess game playing algorithm based on intensified learning
Zhao et al. Alphaholdem: High-performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning
CN110141867A (en) A kind of game intelligence body training method and device
Tak et al. Monte Carlo tree search variants for simultaneous move games
CN111506514B (en) Intelligent testing method and system applied to elimination game
Li et al. Research and implementation of Chinese chess game algorithm based on reinforcement learning
CN112691383A (en) Texas poker AI training method based on virtual regret minimization algorithm
Lu et al. Danzero: Mastering guandan game with reinforcement learning
Galván et al. On the evolution of the mcts upper confidence bounds for trees by means of evolutionary algorithms in the game of carcassonne
CN111617479B (en) Acceleration method and system of game artificial intelligence system
Vieira et al. Exploring Deep Reinforcement Learning for Battling in Collectible Card Games
Chia et al. Designing card game strategies with genetic programming and monte-carlo tree search: A case study of hearthstone
Cao et al. Research on the DouDiZhu's playing strategy based on XGBoost
Shen et al. Imperfect and cooperative guandan game system
CN114048833B (en) Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game
Conradie et al. Training bao game-playing agents using coevolutionary particle swarm optimization
Li et al. Study on the play strategy of dou dizhu poker based on convolution neural network
He A Review of the Application of Artificial Intelligence in Imperfect Information Games Represented by DouDiZhu
Gonzalez-Castro et al. Opponent models comparison for 2 players in GVGAI competitions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination