CN111729300A

CN111729300A - Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Info

Publication number: CN111729300A
Application number: CN202010589925.XA
Authority: CN
Inventors: 王以松; 彭啟文
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-02

Abstract

The invention discloses a Monte Carlo tree search and convolutional neural network bucket landholder strategy research method based on the technical field of machine learning, which comprises the following steps: randomly starting a game and taking the current state of each player as a root node when each player plays cards, wherein the player possibly adopts actions as direct child nodes of the root node according to the ground fighting master rule; starting from the root node of the game tree, performing continuous analog sampling learning by using a Monte Carlo tree search algorithm; when the data obtained by using the Monte Carlo tree search algorithm is enough, continuously training a convolutional neural network CNN learning network by taking (the state and the possible card-playing, the corresponding income of the possible card-playing in the current state) as a data sample until the network is stable; and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.

Description

Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Technical Field

The invention relates to the technical field of machine learning, in particular to a Monte Carlo tree search and convolutional neural network based bucket landowner strategy research method.

Background

In recent years, with the development of machine learning, the method also achieves remarkable results in perfect information game. Among them, what has the milestone significance is: in 2016, 3 and 15 days, Google uses deep learning, reinforcement learning and other methods, developed AlphaGo beats world I-go champion plum stone in the I-go field, and the marking machine realizes the expression of superman in the I-go field. Then the training of AlphaGo Zero is done entirely by self-learning, which starts from random games without any supervision or use of human data and finally outperforms AlphaGo training with significant advantage using human data by continuously learning, using only the black and white on the chessboard as the original input features. Card games such as texas playing cards, "landlord" do not achieve better performance than the proud achievements achieved by machine learning in board games such as go, chess, etc.

Researchers' research on card games has focused on imperfect information gambling games, such as texas poker, and achieved good results, such as Libratus, developed by the university of canakymeilong, and DeepStack, developed by the university of alberta, both in a one-to-one mode with human players. The libaratus assumes that two players are solved by using Monte Carlo statistical regression Minimization (MCCFR) at each decision, so that the strategy is close to Nash equilibrium, but most of game knowledge such as Nash equilibrium is built on non-cooperative game, and currently, no better solution is provided for cooperative game embodied by peasants in game such as 'fighting landlord'; deep learning is used for deep learning. Furthermore, either Libratus or DeepStack can only play one-to-one games.

There has been less research on the "fighting landowner" game than on the texas poker, wherein YangYou et al, shanghai university of transportation, proposed a composite Q-learning (cql) solution that achieved significant results in comparison of the original deep reinforcement learning, such as DQN, A3C (where textual experiments showed that the DQN algorithm did not converge in the "fighting landowner" game). But often do not work as well in the course of playing a real game with humans.

The poker game is popular in China and is popular among people. In the annual "fighting landowner" champion tournament of Tengchong corporation in 2018, the number of participants is as much as 8000 ten thousand. In contrast, the "bucket owners" are now under less research primarily because they are more difficult and less valued. The fighting owner is a card game of 3 persons, and the rules of the 'fighting owner' game belong to the prior art. The research difficulty is mainly reflected in the following aspects: (1) during the game, the poker information is partially hidden; (2) according to the game rule of 'fighting the land owner', the game space is larger in the game process; (3) the game is a problem of multi-player gaming, and cooperative gaming is also embodied in the game.

Based on the method, the Monte Carlo tree search and convolutional neural network bucket landholder strategy research method is designed to solve the problems.

Disclosure of Invention

The invention aims to provide a Monte Carlo tree search and convolutional neural network bucket owner strategy research method to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the Monte Carlo tree search and convolutional neural network based bucket owner strategy research method comprises the following steps:

randomly starting a game and taking the current state of each player as a root node when each player plays cards, wherein the player possibly adopts actions as direct child nodes of the root node according to the ground fighting master rule;

starting from the root node of the game tree, performing continuous analog sampling learning by using a Monte Carlo tree search algorithm;

when the data obtained by using the Monte Carlo tree search algorithm is enough, continuously training a convolutional neural network CNN learning network by taking (the state and the possible card-playing, the corresponding income of the possible card-playing in the current state) as a data sample until the network is stable;

and further correcting and improving the CNN network learning result by using a strategy improvement algorithm aiming at the possible error of the CNN network during learning.

Preferably, the status includes, but is not limited to, the number of cards played by each player, the number of players 'hands, and the current player's hand.

Preferably, the monte carlo tree search algorithm is a method combining a monte carlo method and game tree search.

Preferably, the monte carlo method is a method for replacing expectation of random variables by empirical averaging, and the specific method is as follows:

obtaining a series of proceeds G by a Monte Carlo method₁(s)，G₂(s)，……，G_n(s). According to the law of large numbers, when n tends to infinity, the mean value of the sampled returns approaches the expected value, and v(s) is defined as the mean value of the series returns, namely:

ν(s)→ν_π(s)as n→∞ (2)

wherein s is game state v_π(s) is an expected value in the state of the game s.

Preferably, the game tree search algorithm comprises the following steps:

selection of an extension node: recursively applying a node selection function, selecting a node from all nodes to be selected as a root node of the current expansion, and performing primary simulation on a game situation represented by the node from the node;

and (3) an expansion step: adding one or more nodes into an MCTS search tree, wherein the common strategy is that only one new node is added to a game tree in each iteration;

simulation: simulating the actual player game process, and carrying out a game process from the node starting simulation to the termination state;

and (3) feedback: and the simulation result is fed back to the parent node layer by layer from the termination node of the simulation, and finally the simulation result is fed back to the root node.

Preferably, the algorithm used for selecting the extension node is a UCT algorithm, and the UCT algorithm is:

in the formula of gamma_iIndicates the selected evaluation value of the node i,

representing the average benefit of the node i; c is a constant, which acts to balance exploration and utilization; ni is the number of times the ith node is the root node of the simulation search.

Preferably, after the monte carlo tree search algorithm meets the maximum sampling times or reaches the time exhaustion and the like, a decision is selected from the monte carlo tree search algorithm as the optimal decision of the MCTS algorithm at this time according to the evaluation value of each node in the first-layer sub-nodes in the game tree.

Preferably, the CNN learning network is composed of 4 convolutional layers and 3 fully-connected layers, wherein 3 pooling layers are added and Relu is used as an activation function, the network uses the current state and a certain possible card-out as input, the input size is 15 × 29, and the output is the benefit of the current state and a certain possible card-out.

Preferably, the strategy improvement method is essentially a monte carlo tree search algorithm, after a certain state and one possible card-out in the state are input to the CNN learning network, the CNN learning network outputs a profit value of the possible card-out in the current state, and the profit of each possible card-out in the current state can be solved by circulating each possible card-out; and then, taking the income value of each possible card in the current state as an initial value of Monte Carlo tree search in the strategy improvement module, and performing continuous sampling simulation on the current state within a certain time range to correct errors possibly existing in the CNN learning network.

Compared with the prior art, the invention has the beneficial effects that:

aiming at the problems caused by large behavior space and the relationship among players, the invention uses a Monte Carlo tree searching method to sample and simulate the behavior space so as to solve the profit value of each possible card-playing action after continuous simulation, and selects the action corresponding to the maximum profit value as the best card-playing action of the round. In the invention, continuous game is carried out by using MCTS, the result is used as original data and is provided for the convolutional neural network for learning, so that the game state is provided for the CNN network, the income of each possible card playing in the current game state is calculated, and the card playing with the maximum income is selected as the best card playing action of the round, thereby greatly shortening the decision time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a MCTS game tree search algorithm process according to the present invention;

FIG. 3 is a diagram of a CNN learning framework according to the present invention;

FIG. 4 is a graph of the loss variation obtained from the L1 loss function of the present invention;

FIG. 5 is a graph of the algorithm and RHCP algorithm win ratio change when the present invention is used as a landowner;

FIG. 6 is a graph showing the algorithm and RHCP algorithm win ratio change when the present invention is used as a farmer.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: the Monte Carlo tree search and convolutional neural network based bucket owner strategy research method comprises the following steps:

starting a game randomly, and taking the current state of each player (the number of the played cards of each player, the number of the hands of each player, the hands of the current player and the like) as a root node when each player plays cards (decides), wherein the player can take actions possibly adopted by the player according to the ground fighting main rule as direct child nodes of the root node;

starting from the root node of the game tree, performing continuous analog sampling learning by using an MCTS algorithm;

calculating and storing the income value of each possible card under the current state, and continuously repeating the process;

when enough data are obtained by using the MCTS algorithm, continuously training a CNN learning network by taking (state and possible card-playing, income corresponding to possible card-playing in the current state) as a data sample until the network is stable;

For the problem of large problem scale, related researchers combine a Monte Carlo method and game tree search to provide an MCTS algorithm, specifically a Monte Carlo Tree Search (MCTS) method.

The monte carlo method is an expectation that an empirical average will be used instead of a random variable. For example, in the game, the expectation value is v during the state s_π(s) the expected value is difficult to calculate, but a series of gains G can be obtained by the Monte Carlo method₁(s)，G₂(s)，……，G_n(s). According to the law of large numbers, as n approaches infinity, the mean of the sampled gains approaches the expected value. Define ν(s) as the average of the series gains, i.e.:

ν(s)→ν_π(s)as n→∞ (2)

the MCTS algorithm starts with initializing the search tree, generally abstracting the current game state as a single game tree root node, and once the game tree initialization is completed, the MCTS game tree search algorithm process shown in fig. 2 is repeated within a specified time, and the search process can be divided into 4 steps:

selection of an extension node: and recursively applying a node selection function, selecting one node from all nodes to be selected as a root node of the current expansion, and performing primary simulation on the game situation represented by the node from the node.

And (3) an expansion step: one or more nodes are added to the MCTS search tree. The common strategy is that only one new node is added to the game tree in each iteration.

The MCTS algorithm can select a decision as the best decision of the MCTS algorithm according to the evaluation value of each node in the first-layer sub-nodes in the game tree after the maximum sampling times are met or the reaching time is exhausted.

There are great differences in the algorithms used when expanding the node selection. Among them, the UCT algorithm proposed by Kocsis and Szepesvari in 2006 aiming at the problem of expanding node selection in the MCTS algorithm on the basis of the UCB algorithm is widely used. The UCT algorithm is as follows:

in equation 3 γ_iIndicates the selected evaluation value of the node i,

representing the average benefit of the node i; c is a constant, which acts to balance exploration and utilization; ni is the ith node as the simulation searchNumber of root nodes of the rope. Practice has proven that the algorithm is very efficient.

When the Monte Carlo tree search module is used as a strategy to play games, although the effect is good, when each game starts, MCTS needs to be performed once by taking the current state as a root node, and in the process of performing analog sampling by using the MCTS, in order to ensure the accuracy, each round of games also needs to consume several minutes of time, which is obviously unacceptable in the actual game process. In addition, in each round of games, the MCTS is used for spending a large amount of time to calculate results but is not effectively used, and aiming at the condition, the method and the device use the Convolutional Neural Network (CNN) to learn the results after the MCTS search, fully utilize the results of the MCTS module and achieve the purpose of shortening game decision time.

The CNN network consists of 4 convolutional layers and 3 fully-connected layers, in which 3 pooling layers are added and Relu is used as an activation function. The network uses the current state and a possible deal as inputs, with an input size of 15 x 29, and outputs the revenue for the current state and a possible deal. Wherein the input is specifically represented as follows:

x dimension: each position in the X dimension sequentially represents 15 playing cards such as {3, 4, 5, 6, 7, 8, 9, T, J, Q, K, A, 2, L, B }.

Dimension Y: the (subscript) in the Y dimension indicates the number of playing cards in the corresponding X dimension, where (0-4 are represented by 0000, 0001, 0010, 0100, 1000, respectively), 4-13 (subscript) in the Y dimension indicate 10 of the dealt card types, single, pair, tree _ plus _ one, tree _ plus _ two, sequence _ one, sequence _ two, sequence _ three, bond, k _ bond, etc., and 14 (subscript) indicates whether the deal is a landowner player.

The Z dimension is as follows:

as shown in fig. 3, the CNN network learning framework includes 4 convolutional layers, 3 pooling layers, and 3 full connections.

Using a CNN web learning model to learn over 500 tens of thousands of records held by MCTS, the L1 loss function is used to obtain the loss variation as shown in the graph of fig. 4 where the x-axis represents epoch and the y-axis represents loss of revenue. As can be seen from the figure, the loss of revenue greatly decreases at 0 to 30epoch, and after 30epoch, the loss of revenue does not change much and finally floats around 0.033.

After the strategy learned by the Monte Carlo tree searching module is learned by using the CNN network, the CNN learning network can directly play a 'landlord' game. However, when the MCTS module is used for learning, some decision branches cannot be sufficiently explored due to time limitation, and errors inevitably exist when the CNN learning network learning strategy is used later. In view of this, a strategy improvement approach was introduced.

The strategy improvement method is essentially Monte Carlo tree search. In the game process, after a certain state and one possible card-out in the state are input into the CNN learning network, the learning network can output the income value of the possible card-out in the current state, and the income of each possible card-out in the current state can be solved by circulating each possible card-out; and then taking the result as an initial value of Monte Carlo tree search in the strategy improvement module, and continuously sampling and simulating the current state within a possible time range to correct errors possibly existing in the CNN learning network. In addition, the CNN learning network is essentially a strategy learned in the MCTS module, so that the MCTS is used again for sampling simulation on the basis of the CNN network, and the sampling times are increased within a limited time range on the basis of the sampling result of the MCTS module, so that the sampling result is closer to a real result.

The experimental results are as follows:

firstly, the intelligent agent of the 'hopper landowner' which can be tested at present is simply introduced, and then the intelligent agent of the 'hopper landowner' is directly or indirectly compared with the intelligent agent of the 'hopper landowner' which is realized by the invention.

The RCHP algorithm: the general idea is to combine the hands according to the rules of 'fighting the land as a main' and to select the deal with higher value of the hand after dealing as the best deal in the round.

Naive DQN and A3C: by taking the experience of deep reinforcement learning in the game as a reference, the thought is applied to the game of fighting the landlord.

CQL: the algorithm was proposed by YangYou et al, Shanghai university of transportation.

Compared with the classic deep reinforcement learning algorithm:

and (4) setting the game, namely using RCHP as a decision algorithm of a player of the farmer and playing the game as the farmer in other algorithms respectively, wherein the game is as shown in the following table.

Algorithm	Rate of victory
		NaiveDQN	About 15 percent
A3C	Close to 20 percent
		CQL	Close to 50 percent
Algorithm of the invention	65.6％

In the table

DQN, data of A3C is direct reference to Yangyou et al, Shanghai university of transportationThe results are reported herein.

As can be seen from the table, when the algorithm is used as a landowner and an RHCP algorithm for game playing, the effect is superior to that of classical deep reinforcement learning.

Comparison with the RHCP algorithm:

game setting: in the process of playing the game, 51 playing cards are randomly distributed to each player in an average mode through a program (in the process of dealing, 17 hands are distributed to each player), then one player is randomly assigned as a landowner, and the rest 3 landowners are highlighted and distributed to the landowner player. And starting from the landowner player, playing according to the landowner fighting game rule. In the gaming process: the decision algorithm of one role is MCM algorithm, and the decision algorithm of the other role is RHCP algorithm; wherein, the decision algorithms of the two farmers are both the decision algorithms of the role of the farmer. According to the game settings, the algorithm of the invention is used as a landowner and a farmer to respectively play 500 games with the RHCP algorithm, and the results are as follows:

1. when the algorithm of the invention is taken as the landowner, compared with the RHCP algorithm, the win ratio change graph is shown in FIG. 5, and it can be seen from the graph that when the algorithm of the invention is taken as the landowner, the win ratio is 65.6%, which is obviously higher than about 34.4% of the RHCP algorithm.

2. When the algorithm is used as a farmer, compared with the RHCP algorithm, the algorithm of the invention has a change graph of the winning rate as shown in FIG. 6, and as can be seen from the graph, when the algorithm and the RHCP algorithm are used as the farmer, the total winning rate is about 57%, and the RHCP algorithm is used as a landowner, and the obtained winning rate is about 43%.

3. When the roles are not distinguished, the algorithm rate of the invention is 61.3%, and the RHCP algorithm rate is 38.7%.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. The Monte Carlo tree search and convolutional neural network based bucket owner strategy research method is characterized by comprising the following steps of: the method comprises the following steps:

2. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the states include, but are not limited to, the number of cards played by each player, the number of player hands, and the current player hand.

3. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the Monte Carlo tree searching algorithm is a method combining a Monte Carlo method and game tree searching.

4. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 3, wherein: the Monte Carlo method is an expectation that an empirical average is used for replacing a random variable, and the specific method is as follows:

ν(s)→ν_π(s)as n→∞ (2)

5. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 3, wherein: the game tree search algorithm comprises the following steps:

6. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the algorithm used by the selection of the extended node is a UCT algorithm, and the UCT algorithm is as follows:

in the formula of gamma_iIndicates the selected evaluation value of the node i,

7. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: after the Monte Carlo tree search algorithm meets the maximum sampling times or reaches the time exhaustion and the like, a decision is selected from the Monte Carlo tree search algorithm as the optimal decision of the MCTS algorithm according to the evaluation value of each node in the first-layer sub-nodes in the game tree.

8. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the CNN learning network is composed of 4 convolutional layers and 3 fully-connected layers, 3 pooling layers are added in the CNN learning network, Relu is used as an activation function, the network uses the current state and a certain possible card-out as input, the input size is 15 × 29, and the output is the benefit of the current state and a certain possible card-out.

9. The Monte Carlo tree search and convolutional neural network-based geosteering strategy study method of claim 1, wherein: the strategy improvement method is essentially a Monte Carlo tree search algorithm, after a certain state and one possible card-out in the state are input to the CNN learning network, the CNN learning network can output the income value of the possible card-out in the current state, and the income of each possible card-out in the current state can be solved by circulating each possible card-out; and then, taking the income value of each possible card in the current state as an initial value of Monte Carlo tree search in the strategy improvement module, and performing continuous sampling simulation on the current state within a certain time range to correct errors possibly existing in the CNN learning network.