CN114722998B

CN114722998B - Construction method of soldier chess deduction intelligent body based on CNN-PPO

Info

Publication number: CN114722998B
Application number: CN202210232129.XA
Authority: CN
Inventors: 张震; 臧兆祥
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2024-02-02
Anticipated expiration: 2042-03-09
Also published as: CN114722998A

Abstract

The invention discloses a method for constructing a deduction intelligent body of a chess game based on CNN-PPO, which comprises the following steps: collecting initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data; an influence map module is constructed, target situation data is input into the influence map module, and influence characteristics are output and obtained; and (3) constructing a hybrid neural network model based on convolutional neural network and near-end strategy optimization, splicing target situation data and influence characteristics, inputting the hybrid neural network model for model iterative training until an objective function is minimum and the network converges, and constructing the CNN-PPO intelligent body. The invention increases the understanding degree of the intelligent agent to the situation and increases the intensity of the intelligent agent to a certain extent.

Description

Construction method of soldier chess deduction intelligent body based on CNN-PPO

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a construction method of a chess deduction intelligent body based on CNN-PPO.

Background

The deduction of the chess mainly uses experience and rules summarized from war practice to carry out deduction analysis on the fight process. With the rapid development of computing capability of a computer, various new technologies are applied to the deduction of the chess, the deduction of the chess by the computer becomes a main branch of the deduction of the chess, and the deduction of the chess by the computer is regarded as a means for improving the military capability in various countries of the world.

In a specific chess deduction, it is generally simplified to be one such problem: under the limitation of a certain objective rule, certain targets are realized through the deployment, maneuvering, attack and other actions of the force, such as capturing control points or extinguishing the force of the enemy force. The purpose of constructing the weapon deduction agent is to obtain a director which can autonomously make corresponding action decisions according to the current battlefield situation. According to whether the agent has learning ability, it is classified into a rule type and a learning type. Wherein the regular agent is implemented by hard programming means, multiple branch loops are used to specify that the agent takes some action at a certain moment, a commonly used technique is behavior tree. Learning agents are agents with autonomous learning ability represented by machine learning models, which can update network parameters during the fight process, thus obtaining more excellent models.

The existing intelligent body construction method is mainly divided into a rule type model and a neural network model, and because of huge state space in the deduction of the chess, all conditions are difficult to cover according to rules formulated by expert experience, and only the states can be classified in a more general way, so that the rule type intelligent body is relatively stiff in decision making and cannot flexibly cope with emergency. The difficulty faced by the neural network model is mainly that sparse rewards given by the environment are difficult to effectively update network parameters, dimension explosion and the like.

Disclosure of Invention

In order to solve the problems, the invention provides the following scheme: a method for constructing a chess deduction intelligent body based on CNN-PPO comprises the following steps:

collecting initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;

an influence map module is constructed, the target situation data is input into the influence map module, and influence characteristics are obtained through output;

and constructing a hybrid neural network model based on convolutional neural network and near-end strategy optimization, splicing the target situation data and the influence characteristics, inputting the hybrid neural network model for model iterative training until an objective function is minimum and the network converges, and constructing the CNN-PPO intelligent body.

Preferably, preprocessing the initial situation data to screen the initial situation data, and removing nonstandard data to obtain target situation data;

the initial situation data comprises attribute information of a host combat entity, attribute information of an enemy combat entity, map view attribute information and score board information;

the nonstandard data comprise redundant data, data with missing formats, null values and error information.

Preferably, the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an actor_new network, an actor_old network and a Critic network;

the convolutional neural network is used for mining potential relations between target situation data, and extracting hidden features;

the actor_new network, the actor_old network and the Critic network are all three-layer fully-connected neural networks.

Preferably, before the mixed neural network model is input to perform model iterative training, the method further comprises the step of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output to a Critic network to obtain the output of the Critic network.

Preferably, the output of the convolutional neural network is input into an Actor network in the PPO architecture, and obtaining the output of the Actor network includes inputting the output of the convolutional neural network into an actor_new network to obtain two parameter values of μ and σ; establishing normal distribution based on the two parameter values, wherein mu is the mean value of the normal distribution, and sigma is an equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining a reward value given by the environment and a next time state through interaction of the action and the environment.

Preferably, splicing the output of the Actor network and the output of the convolutional neural network, inputting the spliced output into a Critic network, wherein obtaining the output of the Critic network comprises inputting situation data at the next moment into the Critic network, obtaining the output V_ of the network, and calculating a discount rewarding value; inputting the state values of the T moments into a Critic network to obtain T V_values; the mean square error of the discount prize values R and V_is calculated, and the Critic network is updated by using a back propagation mechanism. Where V_is the estimated benefit value obtained by taking action a in state S and calculating the discount prize value.

Preferably, the mixed neural network model is input to perform model iterative training, namely N times of optimization is performed on network parameters by using a mean square error loss function, B times of optimization is performed on the Actor network and the convolutional neural network until an objective function is minimum and the network converges.

Preferably, the optimization of the network parameters for N times by using the mean square error loss function, and the optimization of the Actor network and the convolutional neural network for B times comprise respectively inputting all state values in the experience pool into the actor_new network and the actor_old network to obtain normal distributions N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training a model until convergence, and constructing the CNN-PPO intelligent body.

The invention discloses the following technical effects:

the invention provides a method for constructing a soldier chess deduction intelligent body based on CNN-PPO, which is based on a convolutional neural network to perform potential association mining on initial situation data, obtain influence characteristic information, input the influence characteristic and the initial situation data into a PPO algorithm model together for learning, form a hybrid neural network model by adopting the Convolutional Neural Network (CNN) and a near-end strategy optimization (PPO), and artificially add characteristics formed by an influence map in terms of characteristic processing. This makes the convolutional neural network converge faster when processing the feature data, and the action choices given by the whole agent are also more careful. The understanding degree of the intelligent agent on the situation is increased, and the intensity of the intelligent agent fight is increased to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in FIG. 1, the invention provides a method for constructing a deduction intelligent body of a chess game based on CNN-PPO, which comprises the following steps:

Preprocessing the initial situation data to screen the initial situation data, removing nonstandard data and obtaining target situation data;

The hybrid neural network model is of a CNN-PPO architecture and comprises a convolutional neural network, an actor_new network, an actor_old network and a Critic network;

Before the mixed neural network model is input for model iterative training, the output of the convolutional neural network is input into an Actor network in the PPO architecture, and the output of the Actor network is obtained; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output to a Critic network to obtain the output of the Critic network.

The output of the convolutional neural network is input into an Actor network in the PPO architecture, and the obtaining of the output of the Actor network comprises the input of the output of the convolutional neural network into the actor_new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values, wherein mu is the mean value of the normal distribution, and sigma is an equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining a reward value given by the environment and a next time state through interaction of the action and the environment.

Splicing the output of the Actor network and the output of the convolutional neural network, inputting the spliced output into a Critic network, wherein obtaining the output of the Critic network comprises inputting situation data at the next moment into the Critic network to obtain an output V_ of the network, and calculating a discount rewarding value; inputting the state values of the T moments into a Critic network to obtain T V_values; the mean square error of the discount prize values R and V_is calculated, and the Critic network is updated by using a back propagation mechanism. Where V_is the estimated benefit value obtained by taking action a in state S and calculating the discount prize value.

And (3) inputting the hybrid neural network model to perform model iterative training, namely performing N times of optimization on network parameters by using a mean square error loss function, and performing B times of optimization on the Actor network and the convolutional neural network until an objective function is minimum and the network converges.

The method comprises the steps of optimizing network parameters for N times by using a mean square error loss function, and optimizing an Actor network and a convolutional neural network for B times, wherein all state values in an experience pool are respectively input into the actor_new network and the actor_old network to obtain normal distributions N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training a model until convergence, and constructing the CNN-PPO intelligent body.

Example 1

As shown in fig. 1, the method for constructing the deduction intelligent agent for the chess based on CNN-PPO provided by the invention comprises the following steps:

step 1: and operating the chess deduction platform, creating a scene of chess deduction, and obtaining situation data returned by the platform. These situation data are generated by the random initialization of the neural network model actor_new network against robots built into the environment. The method specifically comprises the following steps:

1.1 the rule type intelligent agent is arranged in the chess deduction platform, so that the training of man-machine countermeasure and machine countermeasure can be provided. And using an actor_new network to fight against the built-in intelligent agent and generating situation data. The actor_new network is a three-layer fully-connected neural network.

Step 2: and (3) screening situation data returned by the platform in the step (1) to remove nonstandard data. The nonstandard data mainly refer to redundant data, data with missing formats and the like, and the data are removed. In the data generated by the fight built-in robot, some rewards are positive and most rewards are negative, and experience of positive rewards is collected preferentially during collection.

The step 2 specifically comprises the following steps:

2.1 the situation data mainly comprises own entity attributes, entity attributes of which enemies have been discovered, map attributes and scoreboard information.

2.2, the nonstandard data mainly refer to null values, error information and the like.

The invention adopts the reinforcement learning idea and the influence map idea. Reinforcement learning is to program the problem as a Markov decision process and solve the problem through iteration. The influence map divides situation features into primary features and secondary features. The first-level features comprise attribute information of own combat entity and attribute information of enemy combat entity; the secondary features include map viewing information, scoreboard information, and influence map information.

Step 3: and inputting the screened data into an influence map module, wherein the input of the influence map module is situation information comprising own party/enemy entity information and map information. And outputting the influence characteristics of a certain point of the map.

The step 3 specifically comprises the following steps:

3.1 constructing an influence map module, wherein the influence map module is a module for further extracting situation data, and the influence in a certain range around own entity is given by the following formula:

e＝ine+high+da+di

ine in the formula is a view coefficient, and the view is whether there is a shielding between two coordinates, and the non-shielding is called visible view, and the shielding is called invisible view. high is the elevation, i.e. the altitude in a popular sense. da is a risk coefficient, and di is the distance from the robbed control point.

3.2 outputting map points which are generally set as a certain area around a host entity, taking hexagonal lattices as an example, and outputting influence coefficients of all hexagonal lattices in n hexagonal lattices from the host unit.

When the own entity is in the area with negative influence, the rewarding function gives a negative value as punishment to the intelligent agent, and when the own entity is in the area with positive influence, the rewarding function takes a positive value as rewarding to the intelligent agent.

The form of the reward function is as follows:

R＝r _a +r _c +r _d +a

wherein r is _a A score representing the current surviving caprine units; r is (r) _c A score representing the occupied robbery control point; r is (r) _d A unit score representing a annihilation enemy; a representsThe current situation score is the blood volume lost by the hit at the previous moment, or the effective score of hit against the adversary.

Step 4: a hybrid neural network is constructed that is constructed using a near-end policy optimization algorithm (PPO) architecture.

The step 4 specifically comprises the following steps:

4.1 construction of convolutional neural networks to mine potential links between situation data.

4.2 constructing a hybrid neural network overall architecture according to a PPO algorithm architecture, wherein the hybrid neural network overall architecture is a CNN-PPO architecture and consists of 4 neural networks, namely a convolutional neural network, an actor_new network, an actor_old network and a Critic network.

The convolutional neural network is used for extracting hidden features, the CNN uses 3 convolutional kernels with different sizes, different potential features are respectively concerned, and the calculation formula of a CNN model is as follows:

x ^t ＝σ _cnn (w _cnn ⊙x _t +b _cnn )

wherein x is ^t Representing the current state characteristics, w _cnn Representing the weights of the filters, b _cnn Representing the deviation parameter, sigma _cnn Representing an activation function.

The Actor network is used for generating a target state according to the current state s _t Obtaining mu and sigma values, establishing normal distribution N according to the mu and sigma values, sampling from the distribution N to obtain action a, obtaining a reward value r given by the environment, and observing the next state s after the environment changes _t+1 . The gradient is given as:

further, the Actor network is updated using the gradient described above.

Wherein P is _θ (a _t |s _t ) For sampling strategy, P _θ '(a _t |s _t ) And (5) a sampling strategy after parameter updating.

The network of the Critic is a network of the Critic,according to the input state s _t Action a _t Calculate the action value function Q (s _t ,a _t ) The Critic network loss calculation formula is:

loss＝(r+γ(maxQ(s',a'))-Q(s,a)) ²

where r is the prize value given by the environment, γ is the discount factor, and Q (s, a) is a function of the action value, representing the benefit of taking action a in state s.

And constructing an actor_new network, an actor_old network and a Critic network according to the PPO algorithm architecture. The actor_new network uses three layers of fully-connected neural networks, wherein the number of neurons of a first layer is 42, the number of neurons of a second layer is 128, and the number of neurons of a third layer is 15. The Critic network uses a three-layer fully-connected neural network, wherein the number of neurons of the first layer is 57, the number of neurons of the second layer is 64, and the number of neurons of the third layer is 1. The actor_new network is consistent with the actor_old network architecture. After the model is built, network parameters are initialized randomly.

Step 5: and (3) splicing the situation information with the influence characteristics output by the influence module in the step (3), and inputting the spliced information into a convolutional neural network to obtain the output of the convolutional neural network. And (3) splicing the situation information with the influence characteristics output by the influence module in the step (3), and inputting the spliced information into a convolutional neural network to obtain the output of the convolutional neural network. The input of the convolutional neural network is an 80-dimensional input vector formed by splicing 26-dimensional initial situation and 54-dimensional influence characteristics, and the output is a 42-dimensional output vector.

The step 5 specifically comprises the following steps:

and 5.1, splicing the initial situation information and the characteristic information extracted by the influence map module, and merging and inputting the initial situation information and the characteristic information into the convolutional neural network. The splicing mode is direct splicing and adding according to the phase.

5.2 convolutional neural networks use a number of different sizes of convolutional kernels to focus on different potential features.

Step 6: and inputting the output of the convolutional neural network into an Actor network in the PPO architecture, and obtaining the output of the Actor network.

The step 6 specifically comprises the following steps:

6.1 building an experience pool, storingStore each experience information in the form of

Step 7: and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output to the Critic network to obtain the output of the Critic network. And (3) optimizing the network parameters N times by using a mean square error loss function, and optimizing the Actor network and the convolutional neural network B times until the objective function is minimum and the network converges.

The overall flow order of data in the four networks is: the initial situation data is input into an influence map module to obtain a secondary influence characteristic; the initial situation data and the secondary influence characteristics are spliced and input into a convolutional neural network to obtain the output of the convolutional neural network; the output of the convolutional neural network is input into an actor_new network to obtain two values of mu and sigma, and a normal distribution is established by using the two values to represent the distribution of actions; sampling from the normal distribution to obtain an action; the action interacts with the environment to obtain a rewarding value given by the environment and a next time state; and inputting situation data at the next moment into the Critic network to obtain an output V_ of the network, and calculating a discount rewarding value. Inputting the state values of the T moments into a Critic network to obtain T V_values; calculating the mean square error of the discount rewards value R and V; the Critic network is then updated using the back-propagation mechanism.

Inputting all state values in the experience pool into an actor_new network and an actor_old network respectively to obtain normal distributions N1 and N2 of actions; inputting all a values in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and then calculating the value of p2/p1 by using p1 and p 2; and then the following formula is used for calculating the error of the Actor network, and the parameter is updated by using a back propagation mechanism.

Training the model until convergence, namely, the establishment of the CNN-PPO agent is completed.

The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. The method for constructing the deduction intelligent body of the chess based on the CNN-PPO is characterized by comprising the following steps:

constructing a hybrid neural network model based on convolutional neural network and near-end strategy optimization, splicing the target situation data and the influence characteristics, inputting the hybrid neural network model for model iterative training until an objective function is minimum and the network converges, and constructing a CNN-PPO intelligent body;

the actor_new network, the actor_old network and the Critic network are all three-layer fully-connected neural networks;

before the mixed neural network model is input for model iterative training, the output of the convolutional neural network is input into an actor_new network in the PPO architecture, and the output of the actor_new network is obtained; splicing the output of the actor_new network and the output of the convolutional neural network, and inputting the spliced output into a Critic network to obtain the output of the Critic network;

the output of the convolutional neural network is input into an actor_new network in the PPO architecture, and the obtaining of the output of the actor_new network comprises the steps of inputting the output of the convolutional neural network into the actor_new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values, wherein mu is the mean value of the normal distribution, and sigma is an equation of the normal distribution; obtaining an action according to the normal distribution sampling, and obtaining a rewarding value given by the environment and a next time state through interaction of the action and the environment;

splicing the output of the Actor network and the output of the convolutional neural network, inputting the spliced output into a Critic network, wherein obtaining the output of the Critic network comprises inputting situation data at the next moment into the Critic network to obtain an output V_ of the network, and calculating a discount rewarding value; inputting the state values of the T moments into a Critic network to obtain T V_values; calculating the mean square error of the discount rewards value R and V_and updating the Critic network by using a back propagation mechanism; where V_is the estimated benefit value obtained by taking action a in state S.

2. The construction method of the deduction agent for the chess based on CNN-PPO according to claim 1, wherein,

3. The construction method of the deduction agent for the chess based on CNN-PPO according to claim 1, wherein,

4. The method for constructing a deduction agent for chess based on CNN-PPO according to claim 3, wherein,