CN117474077B

CN117474077B - Auxiliary decision making method and device based on OAR model and reinforcement learning

Info

Publication number: CN117474077B
Application number: CN202311824731.3A
Authority: CN
Inventors: 段一平; 陶晓明; 祖曰然; 崔洲涓; 李明哲
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-04-23
Anticipated expiration: 2043-12-28
Also published as: CN117474077A

Abstract

The disclosure provides an assistant decision-making method and device based on an OAR model and reinforcement learning, relates to the technical field of reinforcement learning, and aims to accurately predict the action probability distribution of an agent. The method comprises the following steps: acquiring attribute sets of targets observed by each agent; carrying out graph reasoning calculation on the attribute set of each target observed by each agent to obtain an attribute set matrix of each agent; acquiring a relation adjacency matrix; carrying out reasoning calculation on the attribute set matrix and the relation adjacency matrix of each agent to obtain a target attribute matrix fused with the whole-graph information; extracting background features from the environment background, and attaching the background features to a target attribute matrix to obtain OAR global features observed by each agent at each moment; processing the OAR global features by using a cyclic neural network to obtain target OAR global features which are corresponding to each agent and fused with history information; based on the target OAR global feature, an action of the agent is determined.

Description

Auxiliary decision making method and device based on OAR model and reinforcement learning

Technical Field

The disclosure relates to the technical field of reinforcement learning, in particular to an auxiliary decision-making method and device based on an OAR model and reinforcement learning.

Background

Reinforcement learning is a type of deep learning application which is widely used in recent years, and a typical application scenario thereof is to solve a decision problem. The training paradigm of reinforcement learning is different from supervised learning in that reinforcement learning does not require the use of tagged data, but is different from completely unsupervised learning in that reinforcement learning requires minimal necessary feedback from the environment to determine whether a strategy is good or bad, but is much simpler and more readily available than supervised learning.

The subject of learning, updating, iterating, and making decision actions in reinforcement learning may be referred to as a proxy. Under the single-agent reinforcement learning task, the agent only considers the influence of the environment, and the environment is used as a relatively fixed condition, so long as the action distribution of the agent is stable, the distribution of feedback information such as state observation, rewarding value and the like given by the environment is also stable, and the agent is relatively easy to reach a stable convergence state.

In addition to having a different form in mathematical models from the single-agent problem, the multi-agent reinforcement learning problem has higher training difficulty in practice, and one of the biggest challenges is the problem of environmental instability caused by multiple agents. Under the multi-agent reinforcement learning task, from the perspective of each agent, the environment not only comprises the narrow-sense objective environment in the single-agent task, but also comprises other agents, and under the condition that the action distribution of any other agent is unstable, the distribution of feedback information received by the agent is also unstable, so that the agent cannot easily judge whether the current situation is good or bad, and further the situation is difficult to judge whether the strategy is good or bad according to the environment.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide an OAR model and reinforcement learning-based decision-aiding method and apparatus to overcome or at least partially solve the above-described problems.

In a first aspect of the embodiments of the present disclosure, an auxiliary decision method based on OAR model and reinforcement learning is provided, and applied to a policy network, where the policy network includes a cyclic neural network and action networks corresponding to each type of agents, and the policy network is obtained by reinforcement learning; the method comprises the following steps:

acquiring attribute sets of targets observed by each agent, wherein the targets comprise the agents;

Carrying out graph reasoning calculation on the attribute set of each target observed by each agent to obtain an attribute set matrix of each agent;

Acquiring the relation among the targets, and acquiring a relation adjacency matrix according to the relation among the targets;

Carrying out reasoning calculation on the attribute set matrix of each agent and the relation adjacency matrix to obtain a target attribute matrix fused with the whole graph information;

Extracting background features from an environmental background, and attaching the background features to the target attribute matrix to obtain OAR global features observed by each agent at each moment;

processing the OAR global features by using the cyclic neural network to obtain target OAR global features fused with history information;

Inputting the target OAR global characteristic corresponding to each agent into an action network corresponding to the agent to obtain action probability distribution of each agent;

and determining the action of each agent according to the action probability distribution of the agent.

Optionally, the acquiring the attribute set of each target observed by each agent includes:

Acquiring dynamic properties of the targets observed by each agent;

Acquiring vector representations of static attributes of the targets;

mapping the dynamic attribute of each target observed by each agent to a vector space in which the vector representation of the static attribute is located, so as to obtain the vector representation of the dynamic attribute of each target observed by each agent;

and splicing the vector representation of the static attribute of each target with the vector representation of the dynamic attribute of each target observed by each agent to obtain an attribute set of each target observed by each agent.

Optionally, the target further comprises a non-agent; the acquiring the relation among the targets comprises the following steps:

Acquiring a first relation of each non-agent pointing to the agent and a second relation of each agent pointing to other agents;

and calculating respective distances of the first relation and the second relation, and screening a plurality of the first relation and the second relation according to the respective distances of the first relation and the second relation to obtain the relation among the targets.

Optionally, the adding the background feature to the target attribute matrix to obtain an OAR global feature observed by each agent at each moment includes:

Taking out the observed characteristics of each agent from the target attribute matrix;

And splicing the features observed by each agent with the background features to obtain the OAR global features observed by each agent at each moment.

Optionally, after the deriving the OAR global feature observed by each agent at each time, the method further comprises:

Inputting OAR global features observed by each agent at each moment into a value network corresponding to the agent to obtain the observation features of each agent;

Extracting the characteristics of the observed characteristics of each agent by using a second cyclic neural network to obtain the environmental RNN characteristics of each agent;

And predicting the environmental value corresponding to each agent according to the OAR global characteristic observed by each agent at each moment, wherein the environmental value is used for determining a reward base line value, and the reward base line value is used for performing reinforcement learning on the strategy network.

Optionally, the method further comprises:

for each agent, acquiring the corresponding environmental value of the agent in multi-step calculation;

Obtaining a standardized value estimation value according to the environment value corresponding to the agent in the multi-step calculation;

And calculating the standardized value estimation value by adopting a generalization advantage estimation method to obtain a reward base line value corresponding to each agent.

In a second aspect of the embodiments of the present disclosure, an auxiliary decision device based on OAR model and reinforcement learning is provided, and applied to a policy network, where the policy network includes a cyclic neural network and action networks corresponding to each type of agents, and the policy network is obtained by reinforcement learning; the device comprises:

the attribute acquisition module is used for acquiring attribute sets of targets observed by each agent, wherein the targets comprise the agents;

The first calculation module is used for carrying out graph reasoning calculation on the attribute set of each target observed by each agent to obtain an attribute set matrix of each agent;

The relationship acquisition module is used for acquiring the relationship among the targets and acquiring a relationship adjacency matrix according to the relationship among the targets;

The second calculation module is used for carrying out reasoning calculation on the attribute set matrix of each agent and the relation adjacency matrix to obtain a target attribute matrix fused with the whole-graph information;

the additional module is used for extracting background features from the environment background, and attaching the background features to the target attribute matrix to obtain OAR global features observed by each agent at each moment;

the processing module is used for processing the OAR global features by using the cyclic neural network to obtain target OAR global features which are corresponding to each agent and fused with history information;

The input module is used for inputting the target OAR global characteristics corresponding to each agent into the action network corresponding to the agent to obtain the action probability distribution of each agent;

and the action determining module is used for determining the actions of the agents according to the action probability distribution of each agent.

Optionally, the attribute obtaining module is specifically configured to perform:

Acquiring dynamic properties of the targets observed by each agent;

Acquiring vector representations of static attributes of the targets;

Optionally, the target further comprises a non-agent; the relationship acquisition module is specifically configured to perform:

Optionally, the additional module is specifically configured to perform:

In a third aspect of the disclosed embodiments, there is provided an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute instructions to implement the OAR model and reinforcement learning based decision-making aid method as in the first aspect.

In a fourth aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, which when executed by a processor of an electronic device, causes the electronic device to perform the OAR model and reinforcement learning based auxiliary decision method as in the first aspect.

Embodiments of the present disclosure include the following advantages:

In the embodiment of the disclosure, acquiring an attribute set of each target observed by each agent, wherein the targets comprise the agents; carrying out graph reasoning calculation on the attribute set of each target observed by each agent to obtain an attribute set matrix of each agent; acquiring the relation among the targets, and acquiring a relation adjacency matrix according to the relation among the targets; carrying out reasoning calculation on the attribute set matrix of each agent and the relation adjacency matrix to obtain a target attribute matrix fused with the whole graph information; extracting background features from an environmental background, and adding the background features to the target attribute matrix to obtain OAR (object-attribute-relation) global features observed by each agent at each moment; processing the OAR global features by using the cyclic neural network to obtain target OAR global features, which are fused with history information, corresponding to each agent; inputting the target OAR global characteristics into an action network corresponding to the agents to obtain action probability distribution of each agent; and determining the action of each agent according to the action probability distribution of the agent. Thus, through the target OAR global feature which is corresponding to each agent and is fused with the history information, the environment observed by each agent can be determined, and further the action probability distribution of the agent can be accurately predicted.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of interactions of an agent and an environment in an embodiment of the present disclosure;

FIG. 2 is a flow chart of steps of an auxiliary decision method based on an OAR model and reinforcement learning in an embodiment of the disclosure;

FIG. 3 is a graph of the OAR relationship of agents versus non-agents in an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an auxiliary decision device based on OAR model and reinforcement learning in the embodiment of the present disclosure.

Detailed Description

In order that the above-recited objects, features and advantages of the present disclosure will become more readily apparent, a more particular description of the disclosure will be rendered by reference to the appended drawings and appended detailed description.

In reinforcement learning decision-making problems, a module that performs learning, updating, iterating, and making decision actions is called an agent, and a model that is affected by the agent's decision is called an environment (environment) that gives feedback information to the agent. For example, in a go AI (ARTIFICIAL INTELLIGENCE ) application, the rules, etc. that the board pieces include go are part of the environment; the football game receives the operation instructions of the player and simulates the physical process, and finally feeds back the game picture and the result of the victory and defeat settlement to the player, and if the player is regarded as an agent, the football game is also an environment. The environment mainly feeds back two kinds of information to the agent, one is observation (observation) and the other is rewards (reward). The observation describes the current state of the environment and is the basis for the agent to make decisions, and rewards describe the influence of strategies made by the agent on a certain level, wherein rewards can be given in each step or only given at specific time, and non-zero rewards can be fed back only when the last chess game ends to judge whether to win or lose.

Agent and environment interactions occur continuously in theory, but in computer simulation, it is often necessary to facilitate modeling by setting sampling intervals, reducing computations, which hardly has an adverse effect on reinforcement learning modeling, e.g. 24 frames of sampling video games, which is not very different from the human visual system in view of observation, and making a decision per frame is generally higher than the standards of human and game design in view of interaction.

After discrete sampling, a discrete time sequence can be formed by a process of environment simulation, and at the time t, the agent obtains observation from the environment stateAnd accordingly selects the action/>Environment execution action/>State observation/>And feed back rewards/>For reinforcement learning systems, as shown in FIG. 1, this process may be cycled several times until a stop condition is triggered. Such a process would produce a trace: /(I)。

In reinforcement learning tasks, the final goal of the agent is to maximize the rewards earnedAnd rewards/>In practice, it is often desirable to be able to optimize/>, in practice, a real sequenceThe sum of the sequences is greatest because of/>The sum of the sequences represents a long-term return if only every moment of interest/>Values and optimizations for them, agents tend to fall into local optima, becoming "shortsighted". Thus define a return function/>Representing long term return, is calculated as follows:

；

Wherein, Is/>, after time tSummation of values because events before time t have occurred, no matter what decisions the agent makes will affect the previous/>Values only affect/>Therefore use/>It is reasonable to optimize the policy of the agent. In a practical algorithm, in order to make the model easier to converge, a discount factor γ may be added to the calculation of the return function, namely:

。

The policy of the agent is a function of the probability of mapping from the environmental observations to the action profile, denoted pi. If the policy pi of the agent is fixed, then a state cost function may be defined To assist in training of reinforcement learning agents, wherein:

。

on the depth policy gradient algorithm (DDPG) and its modified version of the near-end policy optimization algorithm (PPO), a state-cost function is commonly used to assist in training the policy function, and the policy model of the present disclosure is based on the PPO algorithm, and the PPO algorithm principle is briefly described below.

The trajectory of the one-time decision process is considered as:

。

Under the condition that the strategy parameter is theta, the probability of occurrence of the strategy track is as follows:

。

Let the length of the track tau be Optimization target/>Is the expectation of the sum of rewards of trajectories (in practice, a reward function/>, with a discount factor, can also be used)For optimization purposes, the reasoning process is similar), i.e.

。

Is state/>And action/>Is independent of the policy parameter θ, and is dependent only on the environment itself, so/>In the process of deriving theta, the/>Consider constant, namely:

。

If the strategy gradient Can be precisely calculated, then the policy network can be directly according to/>Is updated. However, since the above calculation formula requires a desired operation, it cannot be obtained/>The analytical expression of (2) can only be approximated by a Monte Carlo algorithm, which is the basis of reinforcement learning training. The more times the Monte Carlo simulation, the more accurate the expected value, the closer the policy network is to the optimal solution.

The derivative calculation formula of the optimization target has a definite physical meaning,The direction of optimization of this term is increase/>Value of/>Equivalent to pair/>Weighted, the weight is/>. And/>Larger means that the policy in the trajectory τ makes better actions, and therefore/>Is the direction that increases the probability of a "good" action.

In actual reinforcement learning training, the convergence difficulty of training is high due to the introduction of Monte Carlo approximation. To solve this problem, a base line (base) may be added to the prize such that the prize value has both a positive and a negative distribution, thereby creating a certain difference in the direction of optimization, reducing the probability for bad actions and increasing the probability for good actions. The strategy gradient with bonus base lines is as follows:

；

Wherein, To reward the base line, it can be seen that upon policy convergence,/>At 0, the policy is not changed anymore, but/>Is 0 because of this/>The method is very likely to be positive or negative, and the aim of reducing the probability of bad action and increasing the probability of good action is fulfilled.

Due toIt is also a desire, not directly solved, and a value network/>, is typically used in existing reinforcement learning timesTo fit the value of the bonus primer. The optimization comparison for the value network is straightforward, and for the trajectory τ, it can be based on/>Or/>Directly calculating the true value/>As a label for value network training, the predicted value of the value network is/>The loss is typically measured using MSE (mean square error) and the gradient is solved for this loss to optimize the value network/>。

If the strategic network is trained on current data only, it is on-pole (on-pole) and if it needs to be trained on historical data, it is called off-pole (off-pole). To adapt the strategy gradient algorithm to offline training, further modifications to the strategy gradient are required.

If parameters are to be usedThe trace training parameter theta, namely the use/>Representation/>In fact, it isThe following data distribution replaces the data distribution under θ, and the policy gradient expression is considered as follows:

。

Requiring save actions during training State/>At the parameter/>The action probability/>, given by the action model belowAnd/>It may be calculated temporarily when the model is updated.

The definition and algorithm are aimed at a single-agent reinforcement learning task, and a multi-agent reinforcement learning task is more complex. The structure of the OAR model is suitable for processing multi-target multimedia content, so that the OAR model is combined with a multi-agent reinforcement learning algorithm, a multi-agent strategy model based on the OAR model is provided, and mathematical definition of multi-agent reinforcement learning tasks is introduced.

The number of agents in the multi-agent reinforcement learning problem is more than two, so that each variable is expanded and changed to different degrees. The most common multi-agent reinforcement learning problem occurs in the distributed local observation Markov process (DEC-POMDP), where if there are n agents, the environmental state at time t isThe observation of the environment by the agent is an array/>Wherein/>Represents the/>Individual agents in state/>Under observation of the environment, the actions of the proxy/>And rewards/>Also as an array,/>. In many practical scenarios, many agents in a multi-agent task are isomorphic, for example, the unit of football AI manipulation is a player, their behavior logic and targets are basically consistent, which means that their networks can share a part of parameters, a basic network (BaseNet) can be set for a multi-agent reinforcement learning model by referring to a backbone network (backbone) similar to a neural network on a visual task, the basic network mainly has the function of extracting features from an environmental state, an action network (ActNet) is set for each agent, and the actions of the agents are selected according to the features extracted by the basic network, wherein BaseNet shares the parameters and ActNet does not share the parameters. The action calculation process is as follows:

；

Wherein f represents the extracted characteristics of the basic network, and a represents the actions selected by the action network according to the characteristics.

In particular, in a complete information environment, each agent can obtain all information of the environment, all local observationsAs such, the feature extraction process of the basic network only needs to be calculated once, which greatly simplifies the computational complexity and training difficulty of the multi-proxy network.

In addition to having a different form in mathematical models from the single-agent problem, the multi-agent reinforcement learning problem has higher training difficulty in practice, and one of the biggest challenges is the problem of environmental instability caused by multiple agents. Under the single agent task, the agent only considers the influence of the environment, and the environment is used as a relatively fixed condition, so long as the action distribution of the agent is stable, the distribution of feedback information such as state observation, rewarding value and the like given by the environment is also stable, and the agent is relatively easy to reach a stable convergence state. However, under the multi-agent task, from the perspective of each agent, the "environment" includes not only the narrow objective environment in the single-agent task, but also other agents, and even if the motion distribution of one agent is stable, the motion distribution of the other agent is unstable, the distribution of the feedback information received by the agent is also unstable, which results in that the agent cannot easily judge whether the current situation is good or bad. To alleviate this problem, the present disclosure takes the environmental state S as input to the value network rather than employingAs an input, this is because the value network needs to consider all the global information to best predict the current situation, if only a single agent is considered, when the behavior of other agents changes, the situation that the observation of the single agent is the same, but the situation value is different is easy to appear, and the function of the observed value becomes a multi-shot function, which affects the performance of the value network to a great extent, because the neural network as a deterministic map cannot simulate the multi-shot function. It is noted that the agent actually performing the decision task can only select actions based on its own partial observations when the last training is completed.

Typically, each agent requires a separate value network due to the different rewards received by the different agents. In some scenarios, if multiple agents are in a collaborative relationship, their value networks may also be shared if rewards are shared.

Referring to fig. 2, a flowchart illustrating steps of an assistant decision method based on OAR model and reinforcement learning in an embodiment of the present disclosure is shown, where the assistant decision method based on OAR model and reinforcement learning is applied to a policy network, where the policy network includes a cyclic neural network and action networks corresponding to each type of agent, and the policy network is obtained by reinforcement learning. As shown in FIG. 2, the auxiliary decision method based on the OAR model and reinforcement learning specifically includes steps S11 to S18.

Step S11: acquiring attribute sets of targets observed by each agent, wherein the targets comprise the agents;

Step S12: carrying out graph reasoning calculation on the attribute set of each target observed by each agent to obtain an attribute set matrix of each agent;

Step S13: acquiring the relation among the targets, and acquiring a relation adjacency matrix according to the relation among the targets;

step S14: carrying out reasoning calculation on the attribute set matrix of each agent and the relation adjacency matrix to obtain a target attribute matrix fused with the whole graph information;

step S15: extracting background features from an environmental background, and attaching the background features to the target attribute matrix to obtain OAR global features observed by each agent at each moment;

Step S16: processing the OAR global features by using the cyclic neural network to obtain target OAR global features which are corresponding to each agent and fused with history information;

step S17: inputting the target OAR global characteristic corresponding to each agent into an action network corresponding to the agent to obtain action probability distribution of each agent;

step S18: and determining the action of each agent according to the action probability distribution of the agent.

There are multiple agents, and possibly non-agents, in a multi-agent reinforcement learning task, which may be collectively referred to as targets. The target is a relatively independent entity unit according to semantic segmentation, and the target is a proxy for which a policy network is required to select actions. If there are m targets, n agents, generally m > n.

Optionally, the acquiring the attribute set of each target observed by each agent may include: acquiring dynamic properties of the targets observed by each agent; acquiring vector representations of static attributes of the targets; mapping the dynamic attribute of each target observed by each agent to a vector space in which the vector representation of the static attribute is located, so as to obtain the vector representation of the dynamic attribute of each target observed by each agent; and splicing the vector representation of the static attribute of each target with the vector representation of the dynamic attribute of each target observed by each agent to obtain an attribute set of each target observed by each agent.

In a multi-agent reinforcement learning task, each agent's observations of the environmentIs a vector without special structure, and the policy network pi used by the agent directly accepts/>As an input, this limits the ability of feature extraction modules in the policy network. The disclosed embodiments will first/>Expressed in the form of an OAR graph, because of/>Typically consisting of information about the state of the individual objects in the environment and the environment context itself, and therefore:

；

Wherein, Characterization of the first aspectIndividual agent is/>Observation of the Environment at time,/>For/>Individual agent is/>Time observed/>Attribute information of each object is marked/>, because dynamic attribute information which can change with time is included in the attribute informationRepresenting,/>Characterization of the first aspectIndividual agent is/>Environmental background information observed at the moment.

The targets have static properties in addition to dynamic properties, the firstStatic properties of individual targets/>Is a vector that does not change over time, and the static attribute includes at least any one or more of: category information, ranking information, inherent capabilities. Static attributes may help the network distinguish between different targets because the dynamic attributes of the targets typically have a similar structure, e.g., in an environment containing only location information, because the location information may change constantly, different targets may have the same location at different times, so that their information is not different, which makes it easy for only the way in which the dynamic attributes are characterized to carry confusing data.

The static attribute is calculated in a manner similar to word embedding (Embedding) in natural language processing, and a learnable parameter matrix is setThe matrix is optimized continuously in the subsequent training process, and the proper vector representation of the static attribute is learned by itself. The static attribute of the jth target is matrix/>I.e.:

。

The dynamic attribute is subjected to characteristic transformation, the dynamic attribute is mapped into a semantic space similar to the vector representation of the static attribute, a layer of linear neural network can be used for transforming the dynamic attribute, the vector representation of the dynamic attribute after transformation and the vector representation of the static attribute are spliced, and attribute information of each target can be obtained as follows:

；

Wherein, Representing concatenation,/>Is a parameter matrix of the linear layer,/>Is the bias vector of the linear layer, and relu (Linear rectification function ) loss function is used.

After the attribute information of all the targets is constructed, the relationship between the targets needs to be further constructed. Optionally, obtaining a first relationship that each of the non-agents points to the agent, and a second relationship that each of the agents points to other agents; and calculating respective distances of the first relation and the second relation, and screening a plurality of the first relation and the second relation according to the respective distances of the first relation and the second relation to obtain the relation among the targets.

The construction of the relationships may be in a specially designed manner according to the needs of the scene and task, in the embodiment of the disclosure, the reinforcement learning problem of m targets and n agents is considered, and in fact, the targets of non-agents are not affected by the dynamic decision strategy, that is, the behaviors of other targets do not affect the next actions of the targets, which may be more similar to the environment, while the targets of agents need to consider more information, particularly the actions of other agents tend to affect their behaviors, so that the targets of non-agents should have a simple relationship structure and the targets of agents need to have a more comprehensive relationship structure. Based on this, as shown in FIG. 3, two relationships are set in the embodiment of the present disclosure, one is a relationship from non-agent to agent) Another is the relationship from agent to agent (/ >)). In FIG. 3 there are 9 targets, 5 of which are proxy targets (use/>Representation), 4 are non-proxy targets (in/>Representation). To simplify the calculation, each class of relationships holds only the strips with the shortest distance, e.g./>Only the three most recent relationships are retained, then/>、/>、/>Will be reserved, reserving the nearest 4/>, of the proxyA relationship, then a full connection graph is formed between the agents.

After the relationships between the targets and the attributes of the targets are obtained, an OAR graph can be constructed. And constructing an OAR graph according to the observation, the attribute and the relation among the objects of each object on the environment. And inputting the OAR graph into a basic network, and extracting features of the OAR graph by the basic network to obtain global OAR features. At the position ofTime of day, th/>The attribute of all targets under the observation of each agent is integrated into a matrix as/>：

。

The relation of OAR diagrams constitutes an adjacency matrixWherein/>Characterization of absence of relationship from the p-th target to the q-th target,/>The characterization is that there is a relationship from the p-th target to the q-th target.

The k-1 layer OAR attribute may be calculated from the k-1 layer OAR attribute using the following formula, resulting in a global OAR feature:

；

Wherein, Is the parameter matrix calculated for the k-th OAR graph. Obtaining the fused attribute matrix/>, after k layers of reasoning calculation，/>For the global OAR feature, the attribute of each target contained in the global OAR feature fuses the whole image information of the OAR image, so that a decision can be made based on the global OAR feature to make a more reasonable action considering the global.

Extracting OAR features observed by each agent from the global OAR features comprises the following steps: obtaining a background vector of the environmental background information; extracting observation features of each agent from the global OAR features; and splicing the observation features of each agent with the background vector to obtain the OAR features observed by each agent.

Ordering the targets so that the firstThe individual agent is the/>The target, take out the first/>, of the global OAR featureLine plus use of multi-layer linear neural network (MLP) from environmental background information/>The extracted features can be used to obtain the/>Individual agent is/>OAR features observed at time/>：

。

In order to capture the history information, in the embodiment of the disclosure, a first Recurrent Neural Network (RNN) is used to perform feature extraction on the OAR features observed by each agent, so as to obtain RNN features observed by each agent. The first layer (L. Noteq.1) obtained by setting the number of RNN layers to L is characterized by：

；

Wherein the features of the first layer are calculated as follows:

；

、/> parameter matrix of the first cyclic neural network,/>, respectively Is the bias vector of the first recurrent neural network.

The calculation from the raw input to the RNN feature may be performed by the underlying network. The last layer RNN feature of each agent is sent to the action network corresponding to the agent, which may be a multi-layer linear neural network. The output dimension of the last layer of the multi-layer linear neural network may be fixed as the number of selectable actionsNormalizing the output using a softmax (soft max) activation function, a length/>Vector/>, where each element adds to 1The vector is an action probability distribution, and the x-th bit of the vector represents the probability that the agent takes the x-th action. Based on the probability that the agent takes each action, the action of the agent's final choice can be determined:

。

Optionally, after obtaining the RNN feature observed by each of the agents, further comprising: inputting the observation of each agent to the environment into a value network corresponding to the agent to obtain the observation characteristic of each agent; extracting the characteristics of the observed characteristics of each agent by using a second cyclic neural network to obtain the environmental RNN characteristics of each agent; and predicting the environmental value corresponding to each agent according to the environmental RNN characteristics of each agent, wherein the environmental value is used for determining a reward base line value, and the reward base line value is used for performing reinforcement learning on the strategy network.

In addition to calculating the action probabilities, embodiments of the present disclosure also include value networks, where the value may be estimated using a fully connected network as the value network. Full connection network direct acceptanceAs input, feature extraction is performed. In order to capture the history information, the characteristics extracted by the multi-layer perceptron are processed by using a second cyclic neural network, and the obtained/>Mapping to a real number as an estimated value of value through a fully connected network, and calculating as follows:

；

。

The strategy network based on reinforcement learning needs to perform continuous interactive, simulation, training and updating cyclic processes with the environment. Optionally, as an embodiment, training of the policy network and the value network is performed by stored previously simulated environmental interaction information. In order to save the information required in training, a data caching module needs to be constructed. As shown in table 1, the data buffer module stores various information including environmental observations, actions, rewards, and status information.

TABLE 1 data cache Module construction

Wherein n is the number of agents,For the dimension of the policy network RNN state, which is the RNN characteristics observed by each agent,/>, the policy network RNN state isFor the layer number of the policy network RNN,/>For the dimension of the value network RNN state, the value network RNN state is the environmental RNN characteristic of each agent,/>Step is the maximum number of steps stored in the data buffer for the number of layers of the value network RNN,/>Is the dimension of the environmental observation vector,/>Is the number of optional actions.

The training of the strategy network follows the training mode of general reinforcement learning, and in the process of interaction between the agent and the environment for many times, data are continuously collected and network parameters are updated, and the training and the simulation are synchronously carried out.

The initialization of the policy network may include: constructing a basic network for extracting OAR characteristics, determining the types and numbers of agents, and constructing an action network ActNet for each type of agent, wherein the action network and the basic network together form a strategy network. The action network receives OAR characteristics extracted by the base network and outputs action probability distribution. In addition, a value network is constructed for each class of agents. Parameters of all networks are initialized with orthogonality. Assuming the number of En three-dimensional is n, setting the learning rate alpha and the interactive step length of each roundThe discount coefficient gamma is awarded. The data buffer d= { } is set.

Epochs rounds of simulation and training are performed, each round of simulation and training being performedAnd (5) step simulation. In the first step of each round of simulation, the data cache D is emptied, but the RNN state of the last step of the previous round is preserved. In each step of simulation, according to environmental observation/>Constructing an OAR graph, obtaining an OAR characteristic by using a basic network, and using an own action network by each agent according to the current OAR characteristic and the RNN state/>' of the strategy network in the previous stepCalculate the action probability distribution/>RNN state/>, with policy networkWherein/>Representing currently the/>Step simulation,/>Is the number of the agent. At the same time, each agent observes the RNN state/>, of the value network according to the current environmentCalculating the current value/>, using a value networkAnd value network RNN state/>。

According to the probability distribution of the actions of each agentSampling out the actually performed actions/>Issuing the actual action to the environment, executing the action by the environment and performing one-step simulation, and then observing the new environment by the environmentPrize valueAnd (5) returning.

Storing the current step of data in a data buffer D, including、/>、/>、/>、/>Simultaneously calculate action/>In probability distribution/>Logarithmic value of probability of upper correspondence/>Also placed in the data buffer.

Calculating a prize floor value, the prize floor value being dependent primarily on the prize valueOutput from value network/>Similar to the regularization method in general machine learning, the PopArt algorithm is adopted to transform the output of the value network, and the transformation relationship is as follows:

；

Wherein, And/>Respectively is the/>The value network of each agent outputs standard deviation and average value of the value estimated values in several steps of calculation. Obtaining normalized value estimation value/>, using the transformation relationship。

And then calculating the bottom line value of the rewards by using a generalized dominance estimation method (GAE), wherein the formula is as follows:

；

Where λ is a parameter between 0 and 1, the closer λ is to 1, the closer the effect of gae is to the monte carlo algorithm, the closer λ is to 0, and the closer the effect is to the time difference method.

The network parameters of the policy network and the value network are optimally updated, and a certain coupling exists, but are approximately parallel and independent. Optimization updates to the value network are relatively simple, and the bonus base value is calculated in the foregoingAnd taking the reward base line value as a target value of the value network, and optimizing the value network. The pair/>, is also needed before optimizing the value networkThe transformation is performed because its value is derived from the output of the transformed value network. The transformation relationship is as follows:

；

Wherein, And/>The standard deviation and the mean of the value estimates, respectively. This transformation is in inverse relation to the output transformation of the value network. Use of value network re-basis during training/>Obtain/>Will/>And/>The mean square error of (2) as a loss function, i.e. the/>, can be optimizedThe value network of the individual agents is as follows:

。

Optimization updating of policy networks depends on the output of value networks The optimization objective with bonus base line and offline training correction is as follows:

；

Wherein the method comprises the steps of Is a dominance function,/>Is transformed by PopArt-Of (3), note/>And (3) withAre all recalculated at training time,/>The method is the logarithm of the motion probability value output by the original strategy network obtained in the environment interaction simulation process, and when training, the new strategy network observes/>, according to the environment recorded beforeOutput of a New action probability distribution/>While the actual action/>, was taken at the previous interaction，/>In distribution/>The logarithm of the probability value above is/>. Note loss/>It is the inverse of the optimization objective because the physical meaning of the optimization objective is a return value, the larger the better, plus a negative sign to change the gradient decrease to the gradient increase to increase the optimization objective. The update of the/>, can be optimized by adopting optimization algorithms such as Adam and the like by using the loss functionParameters of the policy network of the individual agents.

By adopting the technical scheme of the embodiment of the disclosure, an OAR graph is constructed according to the observation of each agent on the environment and the relation between the targets, the environment in the multi-agent reinforcement learning task is described through the OAR graph, a set of corresponding feature extraction methods are provided, and a plurality of reinforcement learning methods are adapted.

In order to evaluate the performance of the OAR model and reinforcement learning-based decision-making aiding method provided by the embodiments of the present disclosure, the embodiments of the present disclosure also utilize three objective indicators, including an average win rate, an average prize value, and a policy entropy, to evaluate the method provided by the embodiments of the present disclosure, the conventional MAPPO algorithm, and the MAPPO algorithm using a transducer. The average win rate is the proportion of the winning of the game obtained by the strategy used by the agent, and the performance of the strategy is directly reflected. The average prize value is the average value of the sum of the prizes harvested by the agents in each simulated game, and the index can reflect the effect of the strategic network training as the sum of the prizes is quite relevant to the optimization target of the network training. The strategy entropy is the entropy of action probability distribution output by the strategy network, the larger the value is, the more random the strategy is, otherwise, the more definite the strategy is, the completely random strategy is equal to no strategy, and the smaller strategy entropy represents that the strategy is likely to learn to be meaningful, so the strategy entropy reflects the training effect from the other side.

In terms of average win ratio, experiments show that the auxiliary decision method based on the OAR model and reinforcement learning provided by the embodiment of the disclosure has a small difference from the traditional MAPPO algorithm in terms of win ratio, but is obviously superior to the MAPPO algorithm using a transducer, and the main reason is that the transducer is used to be equivalent to a graph neural network with all nodes connected with each other, and model parameters are relatively large, which means that more training data are required under the same condition to enable the graph neural network to converge, and the data of each round of simulation interaction are limited, so that the T-MAPPO algorithm is very likely to not receive forward feedback, thereby being in a chaotic state. The relation connection mode is designed in the auxiliary decision method based on the OAR model and the reinforcement learning, so that most of the connection between targets is simplified, the connection number in the transducer model is in direct proportion to the square of the target number, and the connection number in the OAR diagram in the auxiliary decision method based on the OAR model and the reinforcement learning is only in direct proportion to the target number, so that parameters are greatly reduced, and convergence is easier.

Experiments show that the auxiliary decision-making method based on the OAR model and reinforcement learning provided by the embodiment of the disclosure has the best effect on three different tasks in the aspect of average rewarding. Especially in the running matching task, the situation that the reward value is reduced instead of the conventional MAPPO algorithm along with the increase of the training steps is noticed, which means that the MAPPO algorithm has insufficient stability in some cases, probably because the conventional MAPPO algorithm does not model and calculate the interrelation between multiple targets, the feature extraction simply directly maps from the position information of the targets and the like, and deeper and more critical features are not discovered, so that the decision quality is not stable enough. The attribute features on each target based on decision dependence in the auxiliary decision method based on the OAR model and reinforcement learning provided by the embodiment of the disclosure are integrated with global information, and meanwhile, the influence of surrounding targets is considered, so that the method is suitable for the characteristics of multiple targets and high antagonism of a football scene, and the effect is better.

Experiments show that the policy entropy of the auxiliary decision making method based on the OAR model and the reinforcement learning provided by the embodiment of the disclosure is basically the lowest in terms of average rewards, which means that the decision made by the auxiliary decision making method based on the OAR model and the reinforcement learning provided by the embodiment of the disclosure is relatively more determined, which is also a design advantage derived from the OAR method. The traditional MAPPO algorithm directly extracts the feature from the environmental observation of each agent and selects the action according to the feature, however, the environmental observation is obtained by taking the absolute reference system of the court as a reference, so that the environmental observation difference obtained by different agents is not great, in fact, the environmental observation obtained by different agents in the scene is only different by one bit, the features of different agents are almost the same, the action network is difficult to distinguish between different agents, so that similar action probability distribution is easy to give, and in order to make the strategy loss of the action probability distribution smaller overall, training optimization is easy to guide the action distribution of each agent to be the result of the average value of the overall action distribution, and compared average action distribution means larger strategy entropy, which can be verified in the comparison result of strategy entropy. In the method provided by the embodiment of the disclosure, the feature on which each agent makes a decision is the attribute corresponding to the target node, and the attributes of different targets are all integrated with global information, but the attributes are also obviously different due to different topological structures around different targets in the OAR graph, which is beneficial to each target to make more specialized actions, which generally means more determined actions, so that the policy entropy is also obviously lower. The policy entropy of the three methods in the counterattack event is not greatly different mainly because the event is simpler, all targets tend to move towards the same direction, the motion distribution is similar, and therefore the traditional MAPPO algorithm cannot affect the performance too much, but the other two events are relatively complex, and the motions of different agents should have certain difference because the method proposed by the embodiment of the disclosure is obviously lower.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the disclosed embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the disclosed embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the disclosed embodiments.

Fig. 4 is a schematic structural diagram of an auxiliary decision device based on OAR model and reinforcement learning in the embodiment of the present disclosure, which is applied to a policy network, where the policy network includes a cyclic neural network and action networks corresponding to each type of agents, and the policy network is obtained by reinforcement learning. As shown in fig. 4, the apparatus includes an attribute obtaining module, a first calculating module, a relationship obtaining module, a second calculating module, an adding module, a processing module, an input module, and an action determining module, wherein:

Acquiring dynamic properties of the targets observed by each agent;

Acquiring vector representations of static attributes of the targets;

Optionally, the additional module is specifically configured to perform:

It should be noted that, the device embodiment is similar to the method embodiment, so the description is simpler, and the relevant places refer to the method embodiment.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the disclosed embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present disclosure may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present disclosure are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices, and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the disclosed embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the disclosed embodiments.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The above detailed description of an OAR model and reinforcement learning-based decision-aiding method and device provided by the present disclosure applies specific examples to illustrate the principles and embodiments of the present disclosure, and the above examples are only used to help understand the method and core ideas of the present disclosure; meanwhile, as one of ordinary skill in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present disclosure, the contents of the present specification should not be construed as limiting the present disclosure in summary.

Claims

1. The auxiliary decision-making method based on the OAR model and reinforcement learning is characterized by being applied to a football scene and a strategy network, wherein the strategy network comprises a cyclic neural network and action networks corresponding to each type of player, the strategy network is obtained by reinforcement learning, and the OAR model is a target-attribute-relation model; the method comprises the following steps:

Setting a data cache area;

Simulating and training the strategy network, in each round of simulation performed on the strategy network, emptying the data buffer area, reserving the last step RNN state of the strategy network in the previous round, simulating based on the last step RNN state of the strategy network in the previous round, and storing the data of the current step into the data buffer area; the data of the current step comprises environmental observation; the environmental observation is obtained by taking an absolute reference system of a court as a reference;

Acquiring a set of attributes of respective targets observed by each player, wherein the targets comprise the player; the attributes of the target comprise dynamic attributes and static attributes, and the static attributes at least comprise any one or more of the following: category information, ranking information, and inherent capabilities;

Carrying out graph reasoning calculation on the attribute sets of the targets observed by each player to obtain an attribute set matrix of each player;

carrying out reasoning calculation on the attribute set matrix of each player and the relation adjacency matrix to obtain a target attribute matrix fused with the whole-figure information;

Extracting background features from an environmental background, and attaching the background features to the target attribute matrix to obtain OAR global features observed by each player at each moment;

Processing the OAR global features by using the cyclic neural network to obtain target OAR global features which are corresponding to each player and fused with history information;

inputting the target OAR global characteristics corresponding to each player into the action network corresponding to the player to obtain action probability distribution of each player;

and determining the action of each player according to the action probability distribution of each player.

2. The method of claim 1, wherein said obtaining a set of attributes for each individual object observed by each said player comprises:

acquiring dynamic properties of the targets observed by each player;

Acquiring vector representations of static attributes of the targets;

Mapping the dynamic attribute of each target observed by each player to a vector space in which the vector representation of the static attribute is located, so as to obtain a vector representation of the dynamic attribute of each target observed by each player;

and splicing the vector representation of the static attribute of each target with the vector representation of the dynamic attribute of each target observed by each player to obtain an attribute set of each target observed by each player.

3. The method of claim 1, wherein the goal further comprises a non-player; the acquiring the relation among the targets comprises the following steps:

Acquiring a first relationship that each non-player points to the player and a second relationship that each player points to other players;

4. The method of claim 1, wherein said appending the background feature to the matrix of target attributes results in an OAR global feature observed by each of the players at each time, comprising:

Retrieving from said target attribute matrix, features observed by each of said players;

And splicing the features observed by each player with the background features to obtain the OAR global features observed by each player at each moment.

5. The method of claim 1, wherein after said deriving an OAR global feature observed by each of said players at each instant, said method further comprises:

Inputting OAR global features observed by each player at each moment into a value network corresponding to the player to obtain the observation features of each player;

performing feature extraction on the observed features of each player by using a cyclic neural network to obtain the environmental RNN features of each player;

And predicting the environmental value corresponding to each player according to the OAR global characteristics observed by each player at each moment, wherein the environmental value is used for determining a reward base line value, and the reward base line value is used for performing reinforcement learning on the strategy network.

6. The method of claim 5, wherein the method further comprises:

For each player, acquiring the corresponding environmental value of the player in multi-step calculation;

Obtaining a standardized value estimation value according to the corresponding environmental value of the player in the multi-step calculation;

and calculating the standardized value estimation value by adopting a generalization dominance estimation method to obtain a reward base line value corresponding to each player.

7. The auxiliary decision-making device based on the OAR model and reinforcement learning is characterized by being applied to a strategy network, wherein the strategy network comprises a cyclic neural network and action networks corresponding to each type of player, the strategy network is obtained by reinforcement learning, and the OAR model is a target-attribute-relation model; the device comprises:

the device is used for setting a data cache area;

The device is used for simulating and training the strategy network, in each round of simulation of the strategy network, the data buffer area is emptied, the last step RNN state of the strategy network in the previous round is reserved, the simulation is carried out based on the last step RNN state of the strategy network in the previous round, and the current step data is stored in the data buffer area; the data of the current step comprises environmental observation; the environmental observation is obtained by taking an absolute reference system of a court as a reference;

The attribute acquisition module is used for acquiring attribute sets of targets observed by each player, wherein the targets comprise the player; the attributes of the target comprise dynamic attributes and static attributes, and the static attributes at least comprise any one or more of the following: category information, ranking information, and inherent capabilities;

the first calculation module is used for carrying out graph reasoning calculation on the attribute sets of the targets observed by each player to obtain an attribute set matrix of each player;

The second calculation module is used for carrying out reasoning calculation on the attribute set matrix of each player and the relation adjacency matrix to obtain a target attribute matrix fused with the whole-image information;

The adding module is used for extracting background features from the environment background, and adding the background features to the target attribute matrix to obtain OAR global features observed by each player at each moment;

the processing module is used for processing the OAR global features by using the cyclic neural network to obtain target OAR global features which are corresponding to each player and fused with history information;

The input module is used for inputting the target OAR global characteristics corresponding to each player into the action network corresponding to the player to obtain action probability distribution of each player;

And the action determining module is used for determining the actions of the players according to the action probability distribution of each player.

8. The apparatus according to claim 7, wherein the attribute acquisition module is specifically configured to perform:

acquiring dynamic properties of the targets observed by each player;

Acquiring vector representations of static attributes of the targets;

9. The apparatus of claim 7, wherein the target further comprises a non-player; the relationship acquisition module is specifically configured to perform:

10. The apparatus according to claim 7, wherein the additional module is specifically configured to perform: