CN110717591A

CN110717591A - Falling strategy and layout evaluation method suitable for various chess

Info

Publication number: CN110717591A
Application number: CN201910929174.9A
Authority: CN
Inventors: 路红; 王琳; 杨博弘
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-09-28
Filing date: 2019-09-28
Publication date: 2020-01-21
Anticipated expiration: 2039-09-28
Also published as: CN110717591B

Abstract

The invention belongs to the technical field of computer games, and particularly relates to a falling strategy and a layout evaluation method suitable for various chess games. The method comprises the following steps: predicting the falling probability and the falling estimation value through a neural network; generating training data by using an MCTS algorithm and an Update Board Value algorithm; iteratively training a neural network by a reinforcement learning method; finally, the MCTS algorithm is used for outputting the drop strategy and the local estimation. The invention provides a situation evaluation function and a falling strategy function which are friendly to human beings, do not need to know the advantage value of a prior party and are suitable for various chess types (such as go, black and white chess, Chinese chess and checkers).

Description

Falling strategy and layout evaluation method suitable for various chess

Technical Field

The invention belongs to the technical field of computer games, and particularly relates to a method for evaluating a plurality of chess falling strategies and positions.

Background

With the development of computer game technology, the chess world champions are defeated by the deep blue of the Alphabeta algorithm, and the Weiqi world champions are defeated by the AlphaGo Zero algorithm. However, the situation assessment function in computer gaming is still unsolved. Taking computer vision as an example, the situation assessment function of AlphaZero is equivalent to picture classification, i.e. to find the winning probability of the current situation. The human method is more like picture pixel level segmentation, namely, the occupation probability of each point in the current situation is obtained.

The difficulty of solving the form judgment function of the chess such as go, black and white chess and the like by utilizing human knowledge is higher. In addition, the relative value of each piece in chess and Chinese chess is difficult to evaluate. If the form judgment function can be solved, the analysis of the chess related to occupation areas, such as go and black and white chess, by human beings is greatly facilitated. If the relative value of chess pieces of chess with various chess pieces, such as chess and Chinese chess, can be solved, the method is greatly helpful for human understanding and analysis.

At present, chess related to region occupation is mostly judged in a form by adopting artificial knowledge matching or adopting a mode of training a neural network by supervised learning. The relative value of the chessmen is determined according to the experience of the chess related to various chessmen. In addition, all chess games have first-hand advantages and need first-hand and last-hand balance through human experience.

For chess related to regional occupation, because the size of the chessboard is variable and the advantages of hands are not obvious on different chessboard sizes, at present, human beings are only limited to a few common chessboard sizes (such as 19-way, 13-way and 9-way chessboard of go chess and 8-way chessboard of black and white chess), and the advantages of hands of the common chessboard are understood more deeply. For some unusual boards, only approximate guesses can be made about the value of the prior dominance. Meanwhile, because the value of the chess pieces is difficult to judge, the win and lose results of the chess such as the international chess, the Chinese chess and the like only have 3 types of win, lose and average. If the chess piece values of the chess pieces are quantified, the winning and losing results of the chess pieces can be converted into a real numerical value, so that the advantages and disadvantages of the chess pieces in different states can be better distinguished like go, black and white chess and the like.

The AlphaZero algorithm is intended to work properly and must know an approximate prior dominance assessment. If one wants to know the true value of the first-hand dominance, one has to have a high level on a board of a certain board size. This clearly violates the idea of the algorithm "learn from scratch, without human knowledge". Therefore, an algorithm is needed that does not require prior knowledge of the value of the prior dominance and that can gradually evaluate the value of the prior dominance.

Disclosure of Invention

The invention aims to provide a dropping strategy and a layout evaluation method which are friendly to human beings, do not need to know the advantage value of a prior party, are based on a deep neural network and reinforcement learning and are suitable for various chess.

The invention provides a falling strategy and a local evaluation method suitable for various chess games, which predict falling probability and falling evaluation through a neural network, generate training data by using an MCTS algorithm and an Update Board Value algorithm, and iteratively train the neural network through a reinforcement learning method. Finally, the MCTS algorithm is used for outputting the drop strategy and the local estimation. The invention is suitable for various chess (such as go, black and white chess, Chinese chess and Chinese checkers), and comprises the following steps:

(1) utilizing a residual error neural network [1] and a pixel level segmentation method [2] [3] to realize a local evaluation function and a drop strategy function;

(2) generating training data by using an MCTS algorithm [4], an Early Stop algorithm and an Update Board Value algorithm proposed by the patent;

(3) repeating the step (1) and the step (2), and performing iterative training to obtain a trained neural network;

(4) and (4) generating a final situation evaluation function and a final fall strategy function by using the neural network trained in the step (3) and the MCTS algorithm.

Wherein:

the implementation of the situation evaluation function and the drop strategy function in the step (1) specifically comprises the following steps:

(11) inputting 8-step history situations, wherein each situation comprises C_KA plurality of channels constituting an input block;

(12) the input block is sequentially processed by a residual error tower, batch normalization and a ReLU activation function; wherein the residual tower contains K residual blocks, each residual block having C channels;

(13) adopting different structures for the falling chess (such as go and black and white chess) and the moving chess (such as chess, Chinese chess and Chinese checkers) to obtain a falling strategy function;

(14) sequentially performing channel number C on the output of the step (12) by using a pixel level segmentation method_KThe 1x1 convolution and the channel Softmax are obtained to obtain the situation evaluation function.

Generating training data in the step (2), wherein the specific process is as follows:

(21) using the number of searches as S₁Generating a falling probability and a situation evaluation function of each step by the MCTS algorithm; selecting the next step according to the probability, and performing falling;

(22) continuously repeating the step (21) until the end; simultaneously, an Early Stop algorithm is used, and if the detected evaluation value in 2 continuous steps is stable or the advantage of one party is detected to be overlarge in 4 continuous steps, the opposite bureau is terminated in advance;

(23) for the resulting drop probability of length T and the local evaluation function, the Update Board Value algorithm was used to synthesize training data.

Performing iterative training in the step (3), wherein the specific process is as follows:

(31) inserting the training data generated in the step (2) into an experience pool with the number of hits being R, and if the number of hits in the experience pool is larger than R, eliminating the oldest data;

(32) after G game, randomly selecting data from the experience pool, and training a neural network by using the selected data;

(33) replacing the neural network used by the MCTS with the trained neural network parameters;

(34) and repeating the steps to carry out iterative training.

Generating the final situation evaluation function and the final fall strategy function in the step (4), wherein the specific process is as follows:

(41) using the neural network generated in the step (3), and using MCTS algorithm to perform the operation times of S in each step₂Searching for (2);

(42) after searching, selecting the situation evaluation value of the root node of the search tree as a final situation evaluation function, and selecting the child node with the most search times for child falling, namely obtaining a child falling strategy function.

In step (11) of the present invention, the input block is composed as follows:

(111) inputting 8-step history situation, each situation C_KA channel; if the number of the falling steps is less than 8, setting the values of all channels corresponding to the part with less than 8 steps as 0; in addition to the historical situation, there are "number of players" color channels. When in input, the value of the plane of the current falling party is assigned to be 1, and the values of the planes of the other channels are assigned to be 0;

(112) wherein, C_KThe number of players is multiplied by the chess variety of each player; if the specific position on each channel has the chess pieces of the corresponding type corresponding to the player, the value of the corresponding position is 1, otherwise, the value is 0.

The input block in the step (12) is sequentially processed by a residual error tower, batch normalization and a ReLU activation function, and the specific process is as follows:

(122) sequentially carrying out batch normalization, 3x3 convolution, a ReLU activation function, batch normalization, 3x3 convolution and a ReLU activation function on the input of the residual block;

(123) summing the input of the residual block and the output of step (122), the result being the output of the residual block;

(124) obtaining the output of a residual error tower through K stacked residual error blocks;

(125) and carrying out batch normalization and a ReLU activation function on the output of the residual error tower.

In the step (13), the method for obtaining the drop strategy function specifically comprises the following steps:

(131) for the chess of the dropping type, carrying out 1x1 convolution with the channel number of 2 on the output of the residual error tower in the step (12); averaging the plane of one channel, and performing Softmax on the average value and all values of the other plane; outputting the falling probability of each point and the falling probability of the abandoning right;

(132) for the mobile chess, the output of the residual error tower in the step (12) is sequentially subjected to 1x1 convolution with the channel number of 2 and the output dimension is legal action number C_AFull connection layer, Softmax processing, output legalProbability of action (including disclaimer).

In step (21) of the present invention, the MCTS algorithm specifically comprises the following steps:

(211) the MCTS is divided into 4 stages, namely selection, evaluation, backup and drop;

(212) in the selection phase, child nodes are selected using the following formula until a leaf node is reached:

bound＝c_r×W×H+c_sdσ(s)

π_SC(s)＝argmax_a(Q_SC(s,a)+U(s,a))

where s is the current state, Q_SC(s, a) is the mean of the actions of action a in the current state s, P (a | s) is the falling probability of the added noise, σ(s) is the standard deviation N (s, a) of the current state s is the number of visits of action a in the current state s, W is the width of the chessboard, H is the height of the chessboard, c is the height of the chessboard_r、c_sd、c_puctIs a constant.

(213) In the evaluation phase, if the office is not finished, the neural network input is prepared according to the step (11), and the falling probability P (a | s; theta) and the office evaluation function BV(s) are obtained after forward propagation_LPos; θ); and (3) summing the situation evaluation function according to the following formula to obtain a value function:

V_SC(s_L；θ)＝∑_posBV(s_L,pos；θ)

wherein, V_SC(s_L(ii) a Theta) is in the leaf node state s_LState value of the neural network output, BV(s)_LPos; theta) is in the leaf node state s_LThe probability of node occupancy for the point pos output by the neural network.

(214) In the evaluation phase, if the game is over, BV(s) is obtained by the rules for the drop-type chess_LPos; θ); for the mobile chess, the value of the corresponding positions of all the chess pieces of the winning side is set to be 1, and BV(s) is obtained through phase change_L,pos；θ)。

(215) In the backup stage, for the chess type of the boy type, the following formula is used for updating the values on the nodes from the leaf nodes to the root nodes:

N(s,a)＝N(s,a)+1

wherein N (s, a) is the number of node accesses, Q_SC(s, a) is the mean value of the node actions, V_SC(s_L(ii) a Theta) is in the leaf node state s_LState value of neural network output, BV_π(s, a, pos) is the mean probability of occupation of node actions for point pos, BV(s)_LPos; theta) is in the leaf node state s_LThe node occupation probability of the point pos output by the neural network;

in the backup stage, for the mobile chess, the following formula is used for updating the values on the nodes from the leaf nodes to the root nodes:

N(s,a)＝N(s,a)+1

wherein N (s, a) is the number of node accesses, Q_SC(s, a) is the mean value of the node actions, V_SC(s_L(ii) a Theta) is in the leaf node state s_LState value of neural network output, BV_π(s, a, pos) is the mean probability of occupation of node actions for point pos, BV(s)_LPos'; theta) is in the leaf node state s_LThe node occupation probability of the point pos' output by the neural network;

unlike the point-to-point update of the falling chess, the mobile chess needs to update the survival probability of the corresponding chess piece of the child node to the position of the corresponding chess piece of the parent node in an accumulated manner. China (China)In the chess example, the white side moves f2 to f3 and reaches the leaf node state s_L. BV(s) is obtained after neural network evaluation_LPos; theta), then BV(s)_LPos; theta) the values of the corresponding positions of f2 and f3 are interchanged, and then the above-mentioned updating formula is used. That is, pos is interchanged at certain positions according to the falling position to form pos'.

(216) In the child falling stage, the child falling probability of each child node is obtained according to the access times of each child node of the root node divided by the access times of the root node; selecting one of the child nodes according to the child falling probability of the child nodes, taking the selected child node as a new root node, and adding noise to the probability of the new root node according to the following formula:

P(a|s)＝(1-∈)P(a|s；θ)+∈η

wherein, W is the width of the chessboard, H is the height of the chessboard, epsilon is 0.25, P (a | s; theta) is the falling probability of the output of the neural network, and P (a | s) is the falling probability used in the step (212) of adding noise.

In the step (22), the Early Stop algorithm comprises the following specific processes:

(221) and performing stability evaluation on the situation evaluation function after MCTS search. For the falling chess, if the probability value of each point is either greater than 0.95 or less than-0.95, the situation evaluation function is considered to be stable; for the mobile chess, if the probability value of the son on each chessboard is either greater than 0.95 or less than-0.95, the situation evaluation function is considered to be stable; if the situation evaluation function is stable in the continuous 2 steps, the office is ended;

(222) in the first step of the game, an analog quantity S is used_vtThe MCTS algorithm evaluates a situation evaluation function of an initial situation, and obtains a prior-hand advantage value K by using the situation evaluation function_vt(ii) a Then if there is a 95% probability in each step, its absolute value | MCTS's estimated prior-hand dominance value-K_vtIf | is greater than 4, the advantage of one party is considered to be too great; if the advantages of one party are detected to be too great in 4 continuous steps, the method is used for solving the problems that the advantages of the other party are too greatAnd ending the game.

In step (23), the Update Board Value algorithm specifically comprises the following steps:

(231) defining a smoothing function using the following function

(232) The training data BV is synthesized using the following formula_π(s_t,pos)：

δ_t＝BV_π(s_t+1,pos)-BV_MC(s_t,pos)

Where T is the length of the match, T indicates the time, s_tIs the state at time t, pi (a)_t|s_t) Is a state s_tNext, MCTS selected action a_tFalling probability, BV_MC(s_tPos) MCTS search occupancy probability, BV, of point pos_π(s_t+1Pos) is the synthetic occupancy probability of the point pos that will be used to train the neural network;

(233) the method directly uses MCTS algorithm to search the falling probability pi (a) without synthesis_t|s_t) As training data for the probability of a neural network falling.

In step (32), the neural network training method specifically comprises the following steps:

(321) inputting data randomly selected from the experience pool into a neural network for forward propagation;

(322) the neural network loss L is calculated using the following formula_total(θ)：

L_policy(θ)＝-∑_aπ(a|s)lnP(a|s；θ)

L_total(θ)＝L_policy(θ)+c₁L_BV(θ)+c₂L_V(θ)+c₃|θ|²

Wherein, C_KThe number of game players is multiplied by the variety of each game player, N is the width of the chessboard multiplied by the height of the chessboard, s is the current state, pi (a | s) is the target probability of the action a in the current state s, P (a | s; theta) is the predicted probability of the action a in the current state s output by the neural network,

target occupation probability, BV, for the ith channel point pos of the current state sⁱ(s, pos; theta) is the predicted occupation probability of the ith channel point pos in the current state s output by the neural network, c₁、c₂、c₃Is a constant related to the board size;

(323) and (4) utilizing the neural network loss to perform back propagation.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the prior output method, the method outputs the situation evaluation function, can judge the situation in a detailed form and is more meaningful for human;

2. the algorithm does not have the limiting condition, the neural network can be trained without knowing the first recruitment advantage of the first-handed party, and meanwhile, the algorithm can gradually learn the value of the first recruitment advantage;

3. the neural network of the previous algorithm is related to the checkerboard width height, while the present algorithm does not have this limitation. Therefore, in the chess with the changeable width and height of the chessboard (such as go, black and white chess and international checkers), the same neural network can be used for carrying out the strategy and the local evaluation of the chess falling on the chessboard with different widths and heights.

Drawings

Fig. 1 is a general flow chart of a multi-chess drop strategy and layout evaluation method according to the present invention.

Fig. 2 is a diagram illustrating the structure of the neural network in step (1) of fig. 1.

Fig. 3 is a supplementary explanatory view of fig. 2.

Fig. 4 is an explanatory diagram of the MCTS selection, backup, and update stages of step (2) in fig. 1.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Referring to fig. 1, the method for evaluating the strategy and layout of falling chess of various chess comprises the following specific steps:

(1) the situation evaluation function and the drop strategy function are implemented, as shown in fig. 2, and are shown as follows:

(12) the input block then goes through the residual tower, batch normalization, ReLU activation function in sequence, as shown in fig. 2 (a). Wherein the residual tower contains K residual blocks, each residual block having C channels;

(13) adopting the structure shown in (b) in the figure 2 for the chess of the piece falling type (such as go, black and white chess), adopting the structure shown in (c) in the figure 2 for the chess of the moving type (such as chess, Chinese chess and Chinese checkers), and outputting a piece falling strategy;

(14) using the pixel level segmentation method, the output of step (12) is sequentially subjected to the number of channels C as shown in FIG. 2 (d)_K1x1 convolution, channel Softmax. And obtaining a situation evaluation function.

Taking 6 paths of go distance, fig. 3 (a) is the situation to be evaluated, fig. 3 (b) is the falling strategy, fig. 3 (c) is the situation evaluation function of the black visual angle (1.0 is the point according to which the neural network predicts that the black chess completely occupies, and-1.0 is the point according to which the neural network predicts that the white chess completely occupies), and fig. 3 (d) is the situation evaluation function of the white chess visual angle (1.0 is the point according to which the neural network predicts that the black chess completely occupies, and 1.0 is the point according to which the neural network predicts that the white chess completely occupies).

(2) Generating training data by using an MCTS algorithm, an Early Stop algorithm and an Update Board Value algorithm;

MCTS is divided into 4 stages, selection, evaluation, backup, and drop, respectively. The selection phase as shown in fig. 4 (a), child nodes are selected step by step starting from the root node until a leaf node is reached. The evaluation stage is shown in fig. 4(b), in which the triangle is the local estimation value of the neural network output, the black node is the newly generated child node, and the prior probability of each child node is evaluated by the neural network. The backup phase updates the neural network estimate (triangle) output by the neural network of fig. 4(b) along the path to the root node, as shown in fig. 4 (c). Repeating the 3 stages of fig. 4 several times, the search tree grows gradually, and the evaluation value of the root node is more and more accurate. And the last child node is selected as a new root node on the root node according to the access probability of the child nodes of the root node.

(3) And (3) repeating the steps (1) and (2) to carry out iterative training.

(4) And (4) generating a final situation evaluation function and a final landing strategy by using the neural network trained in the step (3) and the MCTS algorithm.

Reference to the literature

[1]K.He,X.Zhang,S.Ren,and J.Sun."Deep residual learning for imagerecognition."Internaltional Conference on Computer Vision and PatternRecogintion,2016

[2]T.Wu,I.Wu,G.Chen,T.Wei,H.Wu,T.Lai,and L.Lan,"Multi-Labelled ValueNetworks for Computer Go."IEEE Transactions on Games,pp.1–1,2018.

[3]Jmgilmer,"GoDCNN."https://github.com/jmgilmer/GoCNN/,2016.

[4]C.Browne,E.J.Powley,D.Whitehouse,S.M.Lucas,P.I.Cowling,P.Rohlfshagen,S.Tavener,D.Perez,S.Samothrakis,and S.Colton,"A Survey of MonteCarlo Tree Search Methods."IEEE Transactions on Computational Intelligenceand AI in Games,vol.4,no.1,pp.1–43,2012.。

Claims

1. The falling strategy and layout evaluation method suitable for various chess categories is characterized by comprising the following specific steps:

(1) a local evaluation function and a drop strategy function are realized by utilizing a residual error neural network and a pixel level segmentation method;

(3) repeating the step (1) and the step (2), and performing iterative training to obtain a neural network;

(4) generating a final situation evaluation function and a final fall strategy function by using the neural network trained in the step (3) and an MCTS algorithm;

wherein:

the situation evaluation function and the drop strategy function realized in the step (1) specifically comprise the following processes:

(13) adopting different structures for the falling chess and the moving chess, and outputting a falling strategy function;

(14) sequentially performing channel number C on the output of the step (12) by using a pixel level segmentation method_KObtaining a situation evaluation function by the convolution of 1x1 and a channel Softmax;

(23) synthesizing training data by using an Update Board Value algorithm for the generated drop probability with the length of T and a local evaluation function;

(34) repeating the steps and carrying out iterative training;

(42) after searching, selecting the situation evaluation value of the root node of the search tree as a final situation evaluation function, and selecting the child node with the most searching times for falling.

2. The method of claim 1, wherein the input block in step (11) is implemented as follows:

(111) inputting 8-step history situation, each situation C_KA channel; if the number of the falling steps is less than 8, setting the values of all channels corresponding to the part with less than 8 steps as 0; besides the historical situation, the color channels of the number of the players are also provided; when in input, the value of the plane of the current falling party is assigned to be 1, and the values of the planes of the other channels are assigned to be 0;

3. The method according to claim 2, wherein the input block is sequentially processed by the residual tower, batch normalization, and ReLU activation function in step (12) as follows:

(121) sequentially carrying out batch normalization, 3x3 convolution, a ReLU activation function, batch normalization, 3x3 convolution and a ReLU activation function on the input of the residual block;

(122) summing the input of the residual block and the output of step (121), the result being the output of the residual block;

(123) obtaining the output of a residual error tower through K stacked residual error blocks;

(124) and carrying out batch normalization and ReLU activation function processing on the output of the residual error tower.

4. The method according to claim 3, wherein the step (13) obtains the drop strategy function by the following specific process:

(132) for the mobile chess, 1x1 convolution with the channel number of 2 and the output dimension of legal action number C are sequentially carried out on the output of the residual error tower in the step (12)_AThe full connection layer of (2), Softmax; and outputting the probability of legal action.

5. The method of claim 4, wherein the MCTS algorithm of step (21) is implemented as follows:

(212) in the selection phase, selecting child nodes by using the following formula until leaf nodes are reached;

bound＝c_r×W×H+c_sdσ(s)

π_SC(s)＝argmax_a(Q_SC(s,a)+U(s,a))

wherein s is the current state, Q_SC(s, a) is in the current state sMean value of action a, P (as | s) is the falling probability of the added noise, σ(s) is the standard deviation N (s, a) of the current state s is the number of accesses of action a in the current state s, W is the width of the chessboard, H is the height of the chessboard, c_r、c_sd、c_puctIs a constant;

V_SC(s_L；θ)＝∑_posBV(s_L,pos；θ)

wherein, V_SC(s_L(ii) a Theta) is in the leaf node state s_LState value of the neural network output, BV(s)_LPos; theta) is in the leaf node state s_LThe node occupation probability of the point pos output by the neural network;

(214) in the evaluation phase, if the office is finished; for the falling chess, BV(s) is obtained by the rule_LPos; θ); for the mobile chess, the value of the corresponding positions of all the chess pieces of the winning side is set to be 1, and then the phase is changed to obtain BV(s)_L,pos；θ)；

N(s,a)＝N(s,a)+1

wherein N (s, a) is the number of node accesses, Q_SC(s, a) is the mean value of the node actions, V_SC(s_L(ii) a Theta) is in the leaf node state s_LState value of neural network output, BV_π(s, a, pos) is the mean probability of occupation of node actions for point pos, BV(s)_LPos; theta) is atLeaf node status s_LThe node occupation probability of the point pos output by the neural network;

N(s,a)＝N(s,a)+1

(216) in the child falling stage, the child falling probability of each child node is obtained according to the access times of each child node of the root node divided by the access times of the root node; selecting one of the child nodes according to the child falling probability of the child node, wherein the selected child node becomes a new root node, and meanwhile, adding noise to the probability of the new root node according to the following formula:

P(a|s)＝(1-∈)P(a|s；θ)+∈η

6. The method of claim 5, wherein the Early Stop algorithm of step (22) is as follows:

(221) performing stability evaluation on the situation evaluation function after MCTS search; for the falling chess, if the probability value of each point is greater than 0.95 or less than-0.95, the situation evaluation function is considered to be stable; for the mobile chess, if the probability value of the son on each chessboard is greater than 0.95 or less than-0.95, the situation evaluation function is considered to be stable; if the situation evaluation function is stable in the continuous 2 steps, the office is ended;

(223) in the first step of the game, an analog quantity S is used_vtThe MCTS algorithm evaluates a situation evaluation function of an initial situation, and obtains a prior-hand advantage value K by using the situation evaluation function_vt(ii) a Then if there is a 95% probability in each step, its absolute value | MCTS's estimated prior-hand dominance value-K_vtIf | is greater than 4, the advantage of one party is considered to be too great; if the advantages of one party are detected to be too large in the 4 continuous steps, the game is ended.

7. The method according to claim 5, wherein the Update Board Value algorithm in step (23) is as follows:

(231) defining a smoothing function using the following function

δ_t＝BV_π(s_t+1,pos)-BV_MC(s_t,pos)

Where T is the length of the match, T indicates the time, s_tIs the state at time t, pi (a)_t|s_t) Is a state s_tDown, MCTS selected action a_tFalling probability, BV_MC(s_tPos) MCTS search occupancy probability, BV, of point pos_π(s_t+1Pos) is the synthetic occupancy probability of the point pos that will be used to train the neural network;

(233) the probability of falling is searched by MCTS without synthesis_t|s_t) As training data for the probability of a neural network falling.

8. The method according to claim 7, wherein the neural network training method in step (32) comprises the following steps:

L_policy(θ)＝-∑_aπ(a|s)lnP(a|s；θ)

L_total(θ)＝L_policy(θ)+c₁L_BV(θ)+c₂L_V(θ)+c₃|θ|²

Wherein, C_KThe number of game players is multiplied by the variety of each game player, N is the width of the chessboard multiplied by the height of the chessboard, s is the current state, pi (a | s) is the target probability of the action a in the current state s, P (a | s; theta) is the predicted probability of the action a in the current state s output by the neural network,target occupation probability, BV, for the ith channel point pos of the current state sⁱ(s, pos; theta) is the predicted occupation probability of the ith channel point pos in the current state s output by the neural network, c₁、c₂、c₃Is a constant related to the board size;

(323) and (4) utilizing the neural network loss to perform back propagation.