CN116966573A

CN116966573A - Interaction model processing method, device, computer equipment and storage medium

Info

Publication number: CN116966573A
Application number: CN202310070091.5A
Authority: CN
Inventors: 杨阳; 邱福浩; 付强; 文荟俨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-01-11
Filing date: 2023-01-11
Publication date: 2023-10-31

Abstract

The application relates to an interaction model processing method, an interaction model processing device, computer equipment, a storage medium and a computer program product. The method comprises the following steps: acquiring state characteristics of a virtual interaction scene where a virtual object is located; inputting the state characteristics into a movement strategy model to obtain a target position to which the virtual object is to be moved from the position; inputting the state characteristics and the target positions into an interaction model to be trained for interactive operation mapping, and obtaining interaction actions to be executed by the virtual objects at the positions; the method comprises the steps of obtaining interactive benefits obtained by executing interactive actions by a virtual object, and obtaining mobile guiding benefits obtained when the virtual object moves from a located position to a target position; based on the state characteristics, the target positions, the interaction actions, the interaction benefits and the mobile guidance benefits, the interaction model to be trained is updated and then continuously trained until the interaction model after training is completed is obtained. The method can improve the interaction capability of the interaction model.

Description

Interaction model processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to an interaction model processing method, an interaction model processing apparatus, a computer device, a storage medium, and a computer program product.

Background

With the continuous development of computer technology, games become an entertainment interaction mode for more and more people, such as a multi-person online tactical competition game (Multiplayer Online Battle Arena, MOBA) game, and users can control virtual objects to perform game competition interaction in a virtual scene provided by a computer; as another example, for a First-person shooter (FPS) type game, a user may conduct a shooter challenge interaction with a First-person view as a primary view. When the game countermeasure interaction is carried out between players, both players are user players; in the case of man-machine combat or game hosting, an interaction of game countermeasure by using an artificial intelligence model is required, such as an interaction of automatically controlling a computer player or a virtual object corresponding to the hosting to perform game countermeasure.

At present, most of artificial intelligent models for realizing game countermeasure interaction rely on continuous countermeasure interaction training for iterative evolution, and interaction strategies and interaction behaviors of the artificial intelligent models in game countermeasure tend to be single easily, so that the countermeasure interaction capability of the artificial intelligent models in games is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an interaction model processing method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the interaction capability of an interaction model.

In a first aspect, the present application provides an interaction model processing method. The method comprises the following steps:

acquiring state characteristics of a virtual interaction scene where a virtual object is located;

inputting the state characteristics into a movement strategy model to obtain a target position to which the virtual object is to be moved from the position; the mobile strategy model is obtained by training based on historical interaction data obtained by interaction in the virtual interaction scene;

inputting the state characteristics and the target positions into an interaction model to be trained for interactive operation mapping, and obtaining interaction actions to be executed by the virtual objects at the positions;

the method comprises the steps of obtaining interactive benefits obtained by executing interactive actions by a virtual object, and obtaining mobile guiding benefits obtained when the virtual object moves from a located position to a target position;

based on the state characteristics, the target positions, the interaction actions, the interaction benefits and the mobile guidance benefits, the interaction model to be trained is updated and then continuously trained until the interaction model after training is completed is obtained.

In a second aspect, the application further provides an interaction model processing device. The device comprises:

the state characteristic acquisition module is used for acquiring state characteristics of a virtual interaction scene where the virtual object is located;

The target position obtaining module is used for inputting the state characteristics into the movement strategy model to obtain a target position to which the virtual object is to be moved from the position; the mobile strategy model is obtained by training based on historical interaction data obtained by interaction in the virtual interaction scene;

the interactive action obtaining module is used for inputting the state characteristics and the target positions into the interactive model to be trained to carry out interactive operation mapping, so as to obtain the interactive actions to be executed by the virtual objects at the positions;

the profit obtaining module is used for obtaining interactive profit obtained by the virtual object executing the interactive action and obtaining movement guiding profit obtained when the virtual object moves from the located position to the target position;

and the model updating module is used for updating the interactive model to be trained based on the state characteristics, the target positions, the interactive actions, the interactive benefits and the mobile guiding benefits, and continuing training until the interactive model which is trained is obtained.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the interactive model processing method, the device, the computer equipment, the storage medium and the computer program product, the state characteristics of the virtual interactive scene where the virtual object is located are input into the mobile strategy model obtained based on the historical interactive data training, the obtained target position and the obtained state characteristics are input into the interactive model to be trained for interactive operation mapping, so that the interactive action to be executed by the virtual object is obtained, the interactive benefits obtained by executing the interactive action by the virtual object are obtained, the mobile guiding benefits, the state characteristics, the target position and the interactive action are obtained when the virtual object moves from the located position to the target position, and the interactive model to be trained is updated and then is continuously trained until the interactive model to be trained is obtained. The virtual object is obtained from a target position to be moved to by the position, through outputting a movement strategy model obtained based on historical interaction data training, and performing interactive operation mapping based on state characteristics and the target position, so as to obtain an interactive action to be executed, the movement of the virtual object can be effectively controlled by utilizing the historical interaction data, and the interaction model is guided to learn diversified interaction strategies in the training process, so that the interaction capability of the interaction model can be effectively improved.

Drawings

FIG. 1 is an application environment diagram of an interaction model processing method in one embodiment;

FIG. 2 is a flow diagram of an interaction model processing method in one embodiment;

FIG. 3 is a schematic diagram of an interface for displaying status of a game screen according to one embodiment;

FIG. 4 is a schematic diagram of determining a target location in one embodiment;

FIG. 5 is a schematic diagram of determining mobile guidance benefits in one embodiment;

FIG. 6 is a flow diagram of acquiring status features in one embodiment;

FIG. 7 is a schematic diagram of a game interface for a camp of a latent player in one embodiment;

FIG. 8 is a schematic diagram of a game interface for a defender camp in one embodiment;

FIG. 9 is a schematic diagram of a game interface for defender lineup failure in one embodiment;

FIG. 10 is a schematic diagram of the architecture of upper and lower models in one embodiment;

FIG. 11 is a schematic diagram of a reinforcement learning training framework in one embodiment;

FIG. 12 is a block diagram of an interaction model processing device in one embodiment;

fig. 13 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The interaction model processing method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The server 104 performs interactive operation mapping by inputting the state characteristics of the virtual interactive scene where the virtual object is located into a mobile strategy model obtained based on historical interactive data training, inputting the obtained target position and the state characteristics into an interactive model to be trained to obtain interactive actions to be executed by the virtual object, and the server 104 continues training after updating the interactive model to be trained until the interactive model to be trained is obtained based on the obtained interactive benefits obtained by executing the interactive actions by the virtual object, the mobile guiding benefits obtained when the virtual object moves from the located position to the target position, the state characteristics, the target position and the interactive actions. When the user performs game interaction through the terminal 102, the user controls a first virtual object in the virtual interaction scene to perform countermeasure interaction through the terminal 102, and the server 104 controls a second virtual object to perform countermeasure interaction in the virtual interaction scene through the interaction model, so that man-machine countermeasure between the user player and the computer player is realized. In addition, the server 104 may also control at least one virtual object through the interaction model to play a game in the virtual interaction scenario, thereby achieving game play between the computer players. In a specific application, the interaction model processing method may be implemented by the terminal 102 alone, or may be executed by a system formed by the terminal 102 and the server 104 together.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, there is provided an interaction model processing method, which is executed by a computer device, specifically, may be executed by a computer device such as a terminal or a server, or may be executed by the terminal and the server together, and in an embodiment of the present application, an example in which the method is applied to the server in fig. 1 is described, including the following steps:

step 202, obtaining state characteristics of a virtual interaction scene where the virtual object is located.

Wherein, the virtual object refers to an active entity controlled to interact, and can be controlled by an intelligent system or a user through computer equipment. In different application scenarios, the virtual object may have different manifestations, so as to implement different interactive operations. For example, in a gaming application, the virtual object may be a game character that is in play, the virtual object may be three-dimensional or two-dimensional, and may be a character virtual object or an animal virtual object. For example, the virtual object may be a hero character or soldier in a MOBA game or a character of a different lineup in a FPS game. In a competitive game, a virtual object may be often divided into different camps to perform a challenge interaction based on the identities of the different camps, thereby competing for the winnings of the affiliated camps. For example, two red and blue camps can be divided in the MOBA game, different virtual objects are distributed in each camp, and game countermeasure can be carried out by the two virtual objects in the camps; in another example, in the FPS game, two camps of a latency and a defender can be divided, and virtual objects in the latency and the defender have different capabilities and need to complete different tasks to play against.

The virtual interaction scene is an environment where virtual objects interact, and may be a two-dimensional interaction environment or a three-dimensional interaction environment. For example, a virtual interaction environment may be exposed when the computer device runs an application in which virtual objects interact. Specifically, when the terminal device runs the game application, the terminal may display a game screen, in which an environment in which a virtual object in the game is located may be displayed, and the game player may play a game countermeasure in the virtual interaction scene. For different applications, different virtual interaction scenarios may be constructed, so that different virtual objects interact in the corresponding virtual interaction scenarios.

The state features are used for representing the corresponding states, and the state features can be obtained by extracting features according to the interactive related data. The interaction related data may be included in the virtual interaction scene, such as, but not limited to, various data related to interactions with the virtual object, including, but not limited to, context awareness data of the virtual object in the virtual interaction scene, situation data of the virtual interaction scene, inter-object interaction data of the virtual object, and the like. The environment perception data refers to data which can be perceived when the virtual object is in the virtual interaction scene, for example, the environment perception data can comprise terrain information of the virtual interaction scene, and particularly can comprise various information such as building distribution, virtual object distribution and the like; the situation data may include attribute data of the virtual object, and may specifically include one or more of a class of the virtual object, equipment of the virtual object, a vital value attribute of the virtual object such as blood volume of hero in a game, skill information of the virtual object, or an offensiveness of the virtual object; the inter-object interaction data between virtual objects may include data generated by interactions between different virtual objects, such as distance, included angle, skill or object application between virtual objects.

Specifically, when the interaction model is trained, the interaction model is used for controlling the virtual objects in the virtual interaction scene to interact, then the interaction application can be run in the computer equipment to construct the virtual interaction scene, the virtual interaction scene comprises the virtual objects, and the server can acquire the state characteristics of the virtual interaction scene where the virtual objects are located so as to control the virtual objects based on the state characteristics. In a specific application, in a virtual interaction scene, a player often triggers a specific control on a virtual object based on a state in the virtual interaction scene, for example, controls various controls such as skill, movement, defense, or equipment switching of the virtual object. When the interaction model is trained, the server can intercept an interaction picture image of a virtual interaction scene in the computer equipment, take the interaction picture image as state data of the virtual interaction scene where the virtual object is located, and extract features aiming at the state data, namely extract state features from the interaction picture image.

In one specific application, as shown in fig. 3, when a client running the FPS type combat game can be displayed on a terminal device, the interface of the terminal can display a game screen 30 of the combat game, and the game screen 30 includes a map display area 302 for displaying a small map, a situation data area 304 for displaying situation data of the combat, and a status display area 306 for displaying a virtual object. A small map situation with a latency base is displayed in the map display area 302; the situation data area 304 displays that the fight camping is respectively a latent person camping and a defender camping, the fight time is 1 minute and 30 seconds, and the fight data ratio is 0:1; in the status display area 306, the status of the virtual object controlled by the player is displayed, for example, the left hand holding bomb and the right hand holding equipment of the virtual object are displayed, and furthermore, the description information of the virtual object may be displayed. For this game screen 30, it can be considered that the virtual object is in a certain state characteristic, under which a corresponding operation needs to be performed to obtain a winning of the fight.

Step 204, inputting the state characteristics into a movement strategy model to obtain a target position to which the virtual object is to be moved from the position; the mobile strategy model is obtained by training based on historical interaction data obtained by interaction in the virtual interaction scene.

The mobile strategy model is a pre-trained network model, and can process input state characteristics to output a target position to which the virtual object is to be moved from a position. The movement policy model may be implemented based on artificial intelligence (Artificial Intelligence, AI) technology, which is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The mobile strategy model can be obtained through pre-learning in a machine learning mode, and can be obtained through training based on historical interaction data obtained through interaction in a virtual interaction scene. The historical interaction data is recorded historical data, and can be obtained by recording in the process that a user player controls a virtual object to interact in a virtual interaction scene. The historical interaction data may include interaction data generated by real user-player interactions with virtual objects, such as for gaming applications, may be generated for user-player interactions with virtual objects to play against. The target position refers to the next position to which the virtual object needs to be controlled to move from the position, for example, the current position of the virtual object is the point A in the virtual interaction scene, and the target position may be the point B in the virtual interaction scene, that is, the virtual object needs to be controlled to move from the point A to the point B, so that the control on the moving layer of the virtual object is realized.

Specifically, the server may obtain a pre-trained mobile policy model, where the mobile policy model is obtained by training based on historical interaction data obtained by performing interaction in a virtual interaction scene, and may process an input state feature to output a location point to which the virtual object needs to be controlled to move under the state feature. The server can input the obtained state characteristics into a movement strategy model, and the movement strategy model outputs a target position to which the virtual object is to be moved from the current position, so that the control of the movement strategy of the virtual object can be realized.

In one specific application, as shown in fig. 4, the map may be divided into different map blocks in the virtual interactive scene, where black-filled map blocks are represented as unreachable areas, specifically, may be building or background elements in the virtual interactive scene, and virtual objects cannot reach. And if the target position output by the movement strategy model based on the state characteristics is the map block B, indicating the user movement strategy which is learned through training according to the movement strategy model, and under the condition of the state characteristics, controlling the virtual object to move from the map block A to the map block B.

In a specific application, the mobile policy model is obtained by training based on historical interaction data of each user, and different users have different interaction levels, so that the interaction levels of the users can be divided, for example, into 5 levels, and for each level, the mobile policy model corresponding to each level can be obtained by training using the historical interaction data generated by the users of the corresponding level. When the target position is determined, the server can determine the interactive operation level which needs to be simulated currently, and input the state characteristics into a movement strategy model of the corresponding interactive operation level, so as to obtain the target position to which the virtual object is to be moved from the position. By respectively constructing corresponding mobile strategy models aiming at different interaction operation level levels, users of different interaction operation level levels can be simulated through the interaction models, and the interaction experience of the users of different interaction operation level levels can be improved. For example, for the user A, the level of the interactive operation is high, various high-difficulty interactive operations can be realized, and when the user A performs man-machine interaction with the computer, the target position can be determined through a mobile strategy model with the high level of the interactive operation, so that the target position is matched with the interactive operation level of the user A, and the man-machine interaction experience of the user A can be ensured.

And 206, inputting the state characteristics and the target positions into an interaction model to be trained for interactive operation mapping, and obtaining the interaction actions to be executed by the virtual objects at the positions.

The interaction model to be trained is an interaction model to be trained, the interaction model can carry out interaction operation mapping on input data, and interaction actions to be executed by the virtual object at the position are output. An interactive action refers to an action performed by a virtual object when interacting, which may act on the virtual object itself from which the action was issued or on other virtual objects, such as hostile heroes. The interaction may include, but is not limited to, various types of operational actions including movement, attack or avoidance actions, application of skill, changing equipment, switching equipment, jumping, squatting, or launching a device.

Specifically, the server can acquire an interaction model to be trained, the interaction model can be constructed by selecting a corresponding machine learning algorithm according to actual needs, the server inputs the state characteristics and the target positions into the interaction model to be trained, so that interaction operation mapping is carried out in the interaction model, and interaction actions to be executed by the virtual object at the position are output.

Step 208, obtaining the interactive benefits obtained by the virtual object executing the interactive action, and obtaining the movement guiding benefits obtained when the virtual object moves from the located position to the target position.

The benefit (reward) is used for feeding back the benefit of executing the corresponding event in the corresponding state, the benefit can be positive or negative, namely, the effect of executing the corresponding event can be evaluated through the benefit, or the goodness of executing the corresponding event on the interaction effect is evaluated, and the benefit belongs to the feedback of the environment on the virtual object to execute the corresponding event. The corresponding events performed by the virtual objects are different and may correspond to different benefits. The interactive benefits are benefits obtained by executing the interactive actions by the virtual objects, namely, the effect of executing the interactive actions by the virtual objects can be evaluated through the interactive benefits; the movement guidance benefit is a benefit obtained when the virtual object moves from the located position to the target position, namely, the movement effect when the virtual object moves to the target position can be evaluated through the movement guidance benefit.

The calculation modes of the interactive benefits and the mobile guiding benefits can be flexibly set according to actual needs. For example, for interactive benefits, the configuration may be set according to a state change of the virtual object after performing the interactive action, where the change refers to a state change in the virtual interactive environment before performing the action and after performing the action, that is, a state change of the environment caused by the virtual object performing the interactive action. The state change may include, for example, one or more of a change in money, a change in blood volume, a change in a state of life, a change in game outcome, a change in skill consumption, a change in item consumption. The benefit weights corresponding to the various state changes may be set as desired, i.e., different types of state changes may correspond to different benefits. The interactive benefits obtained by the interactive action of the virtual object can be obtained according to the sum of the benefits of various state changes generated after the interactive action of the virtual object is performed. For example, for an FPS game, where the virtual object successfully kills an opposing virtual object of an hostile after performing an interactive action, the benefit obtained may be +10; for another example, if the virtual object successfully locates a device after performing the interaction, the benefit may be +60. For the movement guidance benefit, the movement guidance benefit may be determined according to a distance difference between the position of the virtual object after movement and the target position, and the larger the distance difference is, which indicates that the further the virtual object is from the target position, the smaller the value of the movement guidance benefit may be, that is, the value of the movement guidance benefit may be in a negative correlation with the distance difference.

Specifically, the server can control the virtual object to execute the interaction action and obtain the interaction benefit fed back by the virtual interaction scene, so that the effect of executing the interaction action by the virtual object can be accurately judged. The server may also control the movement of the virtual object from the location to the target location, and upon triggering a determination of benefit, the server may determine the location at which the virtual object movement arrived and determine a movement guidance benefit based on the location at which the virtual object movement arrived and the target location.

Step 210, based on the status feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, the interactive model to be trained is updated and then continuously trained until the interactive model with the trained is obtained.

Specifically, the server may update the interaction model to be trained based on the obtained state features, the target location, the interaction actions, the interaction benefits and the mobile guidance benefits, and the specific server may determine adjustment parameters according to the state features, the target location, the interaction actions, the interaction benefits and the mobile guidance benefits, and update model parameters of the interaction model to be trained based on the adjustment parameters. The process of updating the model parameters of the interaction model to be trained based on the adjustment parameters can be set according to actual needs, for example, the model updating can be performed based on a near-end policy optimization (Proximal Policy Optimization, PPO) algorithm, an A3C (Asynchronous Advantage Actor-Critic, asynchronous dominant motion evaluation) algorithm, a DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient) algorithm, or the like. After the interactive model to be trained is updated, training is continued based on the updated interactive model until the training ending condition is met, if the training reaches the preset training times, or the training is ended when the training meets the convergence condition, the trained interactive model is obtained. The interaction model after training can perform interaction operation mapping aiming at the input state characteristics and the target position, and the interaction actions required to be executed by the virtual object are output.

In a specific application, the interaction model can be applied to a fight game, the server can obtain historical interaction data of a user player in the fight process in the fight game in advance, and a movement strategy model is obtained based on the historical interaction data in a training mode, so that movement strategy information of a real user player is learned through the movement strategy model. When the computer equipment runs the fight game, the computer equipment constructs a virtual interaction scene, wherein the virtual interaction scene comprises virtual objects which need to be controlled to fight, and the server can acquire the state data of the virtual interaction scene where the virtual objects which need to be controlled are located and extract the characteristics of the state data to obtain the state characteristics. The server inputs the state characteristics into a pre-trained movement strategy model, the movement strategy model outputs a target position to which the virtual object is to be moved from the position, and the movement strategy based on the real user player is obtained, and under the condition of the state characteristics, the next position point of the virtual object to be controlled to move is needed. The server inputs the state characteristics and the target positions into an interaction model to be trained for interactive operation mapping, and the interaction model to be trained outputs the interaction actions to be executed by the virtual objects at the positions. The server may control the virtual object to perform an interactive action and control the virtual object to move from the location to the target location. The server can obtain the interactive benefits obtained by the virtual object executing the interactive action and the movement guiding benefits obtained when the virtual object moves from the located position to the target position. The server updates the interactive model to be trained based on the state characteristics, the target positions, the interactive actions, the interactive benefits and the mobile guiding benefits, and then continues training until the interactive model which is trained is obtained. In this embodiment, a mobile strategy model is obtained through training historical interaction data of a real user player, and interaction operation mapping is performed through the interaction model, so that determination of a mobile strategy and interaction actions in interaction is decoupled, the mobile strategy of the real user player can be effectively learned, and interaction operation mapping is performed under the condition of the mobile strategy, the interaction model can be guided to learn various interaction strategies, and therefore interaction capability of the interaction model is improved.

According to the interactive model processing method, the state characteristics of the virtual interactive scene where the virtual object is located are input into the mobile strategy model obtained based on the historical interactive data training, the obtained target position and the obtained state characteristics are input into the interactive model to be trained for interactive operation mapping, so that the interactive action to be executed by the virtual object is obtained, and the interactive model to be trained is continuously trained after being updated based on the obtained interactive benefits obtained by executing the interactive action by the virtual object, the obtained mobile guiding benefits, the state characteristics, the target position and the interactive action when the virtual object moves from the located position to the target position, until the interactive model to be trained is obtained. The virtual object is obtained from a target position to be moved to by the position, through outputting a movement strategy model obtained based on historical interaction data training, and performing interactive operation mapping based on state characteristics and the target position, so as to obtain an interactive action to be executed, the movement of the virtual object can be effectively controlled by utilizing the historical interaction data, and the interaction model is guided to learn diversified interaction strategies in the training process, so that the interaction capability of the interaction model can be effectively improved.

In one embodiment, the interaction model processing method further includes: controlling the virtual object to move from the located position to the target position; obtaining movement guidance benefits obtained when the virtual object moves from the located position to the target position, wherein the movement guidance benefits comprise the following steps: when the movement judgment condition is met, determining an intermediate position reached when the virtual object moves from the located position to the target position; determining a distance difference between the intermediate position and the target position; and obtaining movement guiding benefits according to the distance difference mapping.

The movement determination condition is used for determining whether to trigger the benefit of moving the control virtual object to determine, and specifically may be that a movement determination period is reached. For example, the movement determination period may be set to 3 seconds, that is, the movement determination period is reached every 3 seconds, and when the movement determination condition is considered to be satisfied, the determination of the benefit of controlling the movement of the virtual object may be triggered. In addition, the movement determination condition may also output a new target position for the detected movement policy model, that is, the movement policy model updates the target position, and then the movement determination condition may be considered to be satisfied. The intermediate position is a position point at which the virtual object is located at a time when the movement determination condition is satisfied. The virtual object is moved from the present position to the target position, and the intermediate position is a position point between the present position and the target position. The distance difference refers to a position point distance difference between the intermediate position and the target position, and is used for representing the distance difference between the intermediate position and the target position, namely, the difference between the virtual object and the target position.

Specifically, the server controls the virtual object to move from the located position to the target position, and the specific server may generate a movement control instruction according to the target position and send the movement control instruction to the virtual object, where the virtual object is controlled to move from the located position to the target position. The movement control instruction can comprise position information of a target position, so that the virtual object can perform navigation movement to the target position based on the position; the movement control instruction may also include a movement planning path for moving the located position to the target position, and the virtual object may move to the target position directly based on the movement planning path. The server monitors whether a movement judging condition is met, if the movement judging condition is met, the movement judging period is met, or whether the target position is updated, namely whether a new target position is output by the movement strategy model, if the movement judging condition is met, which indicates that movement guiding benefits need to be determined, the server determines the middle position reached when the virtual object moves from the position to the target position, namely the server determines the middle position where the virtual object is currently located at the moment when the movement judging condition is met. And the server compares the intermediate position with the target position to obtain the distance difference between the intermediate position and the target position, and the server obtains the movement guiding benefit based on the distance difference mapping. For example, the server may obtain a preset mobile profit mapping relationship, and map the determined distance difference based on the mobile profit mapping relationship, to obtain a mobile guiding profit for the virtual object to move from the located position to the target position. In a specific application, the movement guiding benefit and the distance difference are in a negative correlation relationship, that is, the larger the value of the distance difference is, the more the virtual object is far from the target position, the smaller the value of the movement guiding benefit is, and the worse the effect of moving the virtual object from the position to the target position is, that is, the virtual object cannot complete the movement to the target position. The interactive model is updated by using the mobile guidance yields, so that the interactive model can be guided to complete the mobile strategy of the mobile strategy model, and the mobile strategy of the user is learned.

In one specific application, as shown in fig. 5, the map may be divided into different map blocks in the virtual interactive scene, where black-filled map blocks are represented as unreachable areas, specifically, may be building or background elements in the virtual interactive scene, and virtual objects cannot reach. And if the target position output by the movement strategy model based on the state characteristics is the map block B, indicating the user movement strategy which is learned through training according to the movement strategy model, and under the condition of the state characteristics, controlling the virtual object to move from the map block A to the map block B. When the virtual object moves from the map block a to the map block B along the path filled with the black dots, if the movement determination condition is satisfied and the virtual object moves to the map block C, the map block C may be an intermediate position where the virtual object arrives, and the server may determine movement guidance benefits through a distance difference between the map block C and the map block B.

In this embodiment, the server controls the virtual object to move toward the target position, determines the intermediate position reached by the virtual object when the movement determination condition is satisfied, and obtains movement guiding benefits based on the distance difference mapping between the intermediate position and the target position, so that the interaction model can be guided to control the virtual object to complete the movement strategy output by the movement strategy model, thereby learning the user movement strategy reflected by the historical interaction data, realizing controlling the movement of the virtual object by using the historical interaction data, guiding the interaction model to learn diversified interaction strategies in the training process, and further effectively improving the interaction capability of the interaction model.

In one embodiment, controlling movement of a virtual object from a localized position to a target position includes: determining a moving path of the virtual object according to the target position and the position; extracting characteristics of the moving path to obtain path characteristics of the moving path; and controlling the virtual object to move from the located position to the target position according to the path characteristics.

The moving path refers to a path planned when the virtual object moves from the located position to the target position, and corresponding moving paths can be different when different virtual objects relate to different located positions and target positions in different virtual interaction scenes. The path characteristics are obtained by extracting characteristics of the moving path, and are used for representing characteristics of the moving path, and particularly can comprise various path parameters of the moving path, such as path category, path direction, distance and the like. The path category may refer to a topographical information category related to a moving path, such as various terrains of a river, a mountain, etc., and different topographical virtual objects may have different moving parameters, such as different moving times, etc.; the path direction can be the direction of the moving path relative to the virtual object, when the moving distance is N, the corresponding target position can be accurately determined by combining the path direction, and the moving distance can be determined by dividing the scene map of the virtual interaction scene according to actual needs.

Specifically, the server may perform path planning for the virtual object according to the target position and the located position, to obtain a moving path. In one specific application, as shown in fig. 5, the map blocks filled with black dots, which are serially connected by black implementation arrows, are the moving paths of the virtual object when moving from map block a to map block B. The server may perform feature extraction for the moving path to extract various path parameters in the moving path to obtain path features of the moving path, where the path features may describe the moving path, and may specifically include, but not limited to, various description parameters including path category, path direction, distance, and the like. The server may control the virtual object to move from the location to the target location according to the path characteristics. For example, for different path types, the virtual object may perform different movement modes, such as fast running, walking, creeping, and other movement modes. Furthermore, the time consumption of the virtual object movement may be different for different categories of movement paths, for example for flat terrain, the time consumption of the virtual object movement is shorter, i.e. the movement speed of the virtual object may be higher; for terrains such as mountainous areas, ice and snow, the time consumption of the virtual object can be longer, i.e. the moving speed of the virtual object can be lower.

In this embodiment, the server performs feature extraction on the moving path between the target position and the located position, and controls the virtual object to move from the located position to the target position according to the path feature of the moving path obtained by the extraction, so that the virtual object can be accurately controlled to move to the target position by using the path feature of the moving path, path planning is not required by the virtual object, and accuracy of controlling the movement of the virtual object and movement response efficiency can be ensured.

In one embodiment, the interaction model processing method further includes: controlling the virtual object to execute the interaction action; the method for obtaining the interactive benefits obtained by executing the interactive actions by the virtual objects comprises the following steps: obtaining local benefits and global benefits obtained by executing interaction actions by virtual objects; and obtaining interactive benefits according to the local benefits and the global benefits.

The local benefits are used for representing local influence effects of the virtual objects on the interaction situation in the virtual interaction scene after the virtual objects execute the interaction actions, and the global benefits are used for representing comprehensive influence effects of the virtual objects on the interaction situation in the virtual interaction scene after the virtual objects execute the interaction actions. For example, for the fight game application, after the virtual object performs the interaction, the virtual object against the battle may be damaged, knocked out, replied to blood volume or get equipment, and the like, and then a corresponding benefit may be generated, where the benefit is used to characterize the local influence effect, that is, the local benefit; after the virtual object executes the interaction, the interaction can generate corresponding global benefits if the effect on the final opponent outcome is an overall effect. In different situations, the virtual object performs different interaction actions, so that different local benefits and global benefits can be obtained. The interactive benefits obtained by the virtual objects performing the interactive actions are based on local benefits and global benefits, as may be obtained from the sum of the local and global benefits.

Specifically, the server may control the virtual object to perform the interactive action, e.g., the server may generate an action control instruction based on the interactive action and issue the action control instruction to the virtual object to instruct the virtual object to perform the interactive action. After the virtual object executes the interaction action, the state of the virtual interaction scene is changed, for example, in the fight game, the picture of the fight game is changed, for example, the state changes such as fight defeat, attribution after resource robbery and the like are generated. The server obtains local benefits and global benefits obtained by the virtual object executing the interaction, and the determination rules of the local benefits and the global benefits can be set according to actual needs. Specifically, the corresponding local benefit and global benefit can be determined according to the state change generated by the virtual interaction scene after the virtual object executes the interaction action. For example, in a combat game, if a virtual object successfully defeats a character of an enemy camp after performing an interactive action, both local and global benefits may be positive, i.e., indicating that the interactive action has a positive effect, requiring encouragement of the virtual object to perform the interactive action; if the virtual object fails to rob resources after executing the interaction action, the resources are obtained by the adversary, and then both the local benefit and the global benefit can be negative, namely the interaction action has a negative effect, and the virtual object is required to be encouraged to transform to execute other interaction actions. The server obtains interactive benefits according to the obtained local benefits and global benefits, for example, the local benefits and the global benefits can be summed to obtain the interactive benefits; the server may also weight sum the local proceeds and the global proceeds to obtain the interactive proceeds.

In this embodiment, the server controls the virtual object to execute the interactive action, and determines the interactive benefit according to the local benefit and the global benefit obtained by the virtual object executing the interactive action, so as to perform multidimensional feedback on the interactive action execution of the virtual object, accurately determine the effect of the interactive action executed by the virtual object, and ensure the accuracy of the interactive benefit.

In one embodiment, obtaining interactive benefits from local benefits and global benefits includes: calculating local weighted benefits according to the local benefits and the local benefit weights; calculating according to the global benefit and the global benefit weight to obtain global weighted benefit; an interactive benefit is derived based on the local weighted benefit and the global weighted benefit.

Corresponding gain weights can be set for different types of gains, so that weighting processing can be carried out on the different types of gains, various types of gains can be highlighted in a targeted manner, and accuracy of the gains is improved. The local revenue weight is used for representing the influence degree of the local revenue on the interactive revenue, and the global revenue weight is used for representing the influence degree of the global revenue on the interactive revenue. The local gain weight and the global gain weight can be set according to actual needs, and when different interaction actions are executed on different virtual objects under different virtual interaction scenes, the local gain weight and the global gain weight can be corresponding to different local gain weights and different global gain weights, so that different interaction gains are obtained.

Specifically, the server may obtain a local revenue weight and a global revenue weight, which may be related to one or more of the virtual interaction scenario, the virtual object, and the interaction action, i.e., the server may obtain the corresponding local revenue weight and global revenue weight according to one or more of the virtual interaction scenario, the virtual object, and the interaction action. The server carries out weighted calculation according to the local benefits and the local benefit weights to obtain local weighted benefits; and the server performs weighted calculation according to the global benefit and the global benefit weight to obtain global weighted benefit. The server obtains the interactive benefits based on the local weighted benefits and the global weighted benefits, e.g., the server may sum the local weighted benefits and the global weighted benefits to obtain the interactive benefits.

In this embodiment, the server calculates the local weighted benefit according to the local benefit and the local benefit weight, calculates the global weighted benefit according to the global benefit and the global benefit weight, and obtains the interactive benefit based on the local weighted benefit and the global weighted benefit, so that the influence degree of the local weighted benefit and the global weighted benefit can be distinguished through the local benefit weight and the global benefit weight, so as to ensure the accuracy of the interactive benefit.

In one embodiment, based on the status feature, the target location, the interaction action, the interaction benefit, and the mobile guidance benefit, the training is continued after updating the interaction model to be trained until a trained interaction model is obtained, including: determining a target loss value based on the status feature, the target location, the interaction benefit, and the mobile guidance benefit; updating model parameters of the interaction model to be trained according to the target loss value to obtain an updated interaction model; and continuing training through the updated interaction model until the training ending condition is met, and obtaining the interaction model after training is completed.

The target loss value can be obtained based on a target loss function, the target loss function can be specifically built in advance, the target loss function is used for guiding model training of the interaction model, and the specific form of the target loss function can be flexibly set according to actual needs. The parameters involved in the objective loss function include state characteristics, target positions, interaction actions, interaction benefits and movement guidance benefits, namely, the objective loss value can be calculated by substituting specific numerical values of the state characteristics, the target positions, the interaction actions, the interaction benefits and the movement guidance benefits into the objective loss function. The training ending condition is a condition for judging that the training of the model is ended, and can comprise the training times reaching a training times threshold value, the training meeting a convergence condition and the like.

Specifically, the server may determine a target loss value based on the state feature, the target location, the interaction action, the interaction benefit, and the movement guidance benefit, and specifically may obtain a pre-constructed target loss function from the server, and substitute the obtained state feature, the target location, the interaction action, the interaction benefit, and the movement guidance benefit into the target loss function, to calculate the target loss value. The server updates model parameters of the interaction model to be trained based on the target loss value, and specifically can update weight parameters and mapping parameters in the interaction model to be trained, so that an updated interaction model is obtained. And the server continues training through the updated interaction model, and finishes training when the training ending condition is met, so as to obtain the interaction model after training is finished.

In this embodiment, the server determines the target loss value based on the state feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, and continues training after updating the model parameters of the interaction model to be trained according to the target loss value until the training is finished when the training end condition is satisfied, so as to obtain the interaction model after the training is finished, and update the interaction model to be trained by using multidimensional data such as the state feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, so as to ensure the effectiveness of model update, thereby improving the effect of interactive model iterative update and being beneficial to improving the training efficiency of the interaction model.

In one embodiment, the interaction model processing method further includes: acquiring historical interaction data obtained by controlling a virtual object to interact by a historical account in a virtual interaction scene; extracting state characteristics aiming at the historical interaction data to obtain historical state characteristic data carrying the target position label; training is carried out based on the historical state characteristic data, and a movement strategy model is obtained.

The mobile strategy model is obtained by training based on historical interaction data obtained by interaction in a virtual interaction scene, wherein the historical interaction data is generated by a user through an application account in the virtual interaction scene and through controlling a virtual object to interact. The historical account is an account owned by the user, and in the application for realizing the virtual interaction scene, the user can participate in interaction through the owned account. For example, for a combat game application, a user may register a game account and log into a client of the combat game application through the game account to participate in a game combat. The historical account numbers may include account numbers that are owned by the respective users. The historical state characteristic data carries a target position label which is used for describing the target position for controlling the virtual object to move in the state characteristic characterized by the historical state characteristic data.

Specifically, the server acquires historical interaction data of the historical account numbers, and specifically, the user carries out interaction in a virtual interaction scene through the respective owned historical account numbers to obtain the historical interaction data. The server performs state feature extraction on the historical interaction data, so that historical state feature data is extracted. The server may add a tag, specifically a target location tag, to the historical state feature data for describing the target location where the user actually controls the virtual object to move under the state feature. The server trains the historical state characteristic data based on the historical state characteristic data to obtain a movement strategy model. The mobile strategy model is obtained based on historical interaction data training, and learns corresponding mobile strategies under different state characteristic conditions when a user controls a virtual object to interact in a virtual interaction scene. By inputting the status features into the movement policy model, the movement policy model may output the target location to which the virtual object is to be moved from the location.

In this embodiment, the server performs state feature extraction on the historical interaction data obtained by controlling the interaction of the virtual object with respect to the historical account, and trains to obtain the movement strategy model based on the obtained historical state feature data carrying the target position tag, so that the movement strategy model learns the movement strategies corresponding to different state feature conditions when the user controls the virtual object to interact in the virtual interaction scene, the movement of the virtual object can be effectively controlled by using the historical interaction data, and the interaction model is guided to learn diversified interaction strategies in the training process, thereby effectively improving the interaction capability of the interaction model.

In one embodiment, the interaction model processing method further includes: acquiring a latest data category discrimination model; the data category discrimination model is obtained by training based on interactive sample data carrying data category labels; obtaining target interaction data based on the state characteristics and the interaction actions, and inputting the target interaction data into the latest data category discrimination model to obtain a data discrimination category; and obtaining the data category simulation benefits of the target interaction data according to the data judgment categories.

The data type discrimination model is obtained by training based on the interactive sample data carrying the data type label and is used for classifying the obtained target interactive data and determining the data type to which the target interactive data belongs. The target interaction data may be various data related to controlling interactions of virtual objects in the virtual interaction scene, and may specifically include, but not limited to, data including status features and interactions actions. The type of data included in the target interaction data may be set according to actual needs, for example, the target location may be included in addition to the status feature and the interaction. The data category to which the target interaction data belongs may be a source of the data, such as whether the data is generated by interaction of the user-controlled virtual object or by control of the virtual object via a movement policy model and an interaction model. In order to ensure that the mobile strategy model and the interaction model can effectively learn the control of the user on the virtual object interaction, the data type of the generated target interaction data can be judged, namely, the generated target interaction data is judged to belong to the data generated by the real user or is generated by simulating the artificial intelligence model, so that the target interaction data can be ensured to be more fit with the real interaction data generated by the real user.

The data determination category is a determination result output by the data category determination model for performing the data category determination processing on the target interaction data, and for example, the determination result may include a data category belonging to AI generation or a data category belonging to user generation, and the like. The data category simulation benefits are used for feeding back anthropomorphic effects of a control process, namely fitting the control degree of a real user, when the virtual object is controlled to interact through the mobile strategy model and the interaction model. The magnitude of the data category simulation benefits is positively correlated with the degree of fitting the control of the real user, namely, the more the mobile strategy model and the interaction model control the virtual object to interact to fit the control of the real user, the larger the magnitude of the data category simulation benefits is.

Specifically, the server may acquire a pre-trained data type discrimination model, where the data type discrimination model is obtained by training based on interactive sample data carrying a data type tag, and the interactive sample data may include interactive data generated by the interaction of the virtual object under AI control, and interactive data generated by the interaction of the virtual object under real user control. The data category discrimination model may be in a dynamically updated state, and the server may acquire the latest data category discrimination model. The server obtains target interaction data based on the state characteristics and the interaction actions, specifically, the state characteristics and the interaction actions can be combined to obtain the target interaction data, the target interaction data is input into the latest data type judging model, the latest data type judging model is used for judging the data type, and the data judging type is output. The server obtains data category simulation benefits of the target interaction data based on the data determination categories, and can map the data determination categories according to a preset mapping relation to obtain the data category simulation benefits. The data category simulation benefits can characterize the control processing of the virtual object through the mobile strategy model and the interaction model, and the fitting degree of the control processing of the virtual object relative to the real user, namely the anthropomorphic degree of the mobile strategy model and the interaction model.

Further, based on the status feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, the training is continued after updating the interaction model to be trained until obtaining the interaction model after training, including: and according to the state characteristics, the target positions, the interaction actions, the interaction benefits, the mobile guidance benefits and the data category simulation benefits, continuously training after updating the interaction model to be trained until the interaction model after training is completed is obtained.

Specifically, the server may further simulate benefits in combination with the data category of the target interaction data, and perform model update on the interaction model to be trained. The server can update the interactive model to be trained according to the state characteristics, the target positions, the interactive actions, the interactive benefits, the mobile guiding benefits and the data category simulation benefits, and then continue training until the interactive model which is trained is obtained. In a specific application, the server may obtain a pre-constructed target loss function, and substitute the obtained state characteristics, the target position, the interaction actions, the interaction benefits, the mobile guidance benefits and the data category simulation benefits into the target loss function, so as to calculate and obtain the target loss value. The server updates model parameters of the interaction model to be trained based on the target loss value, and specifically can update weight parameters and mapping parameters in the interaction model to be trained, so that an updated interaction model is obtained. And the server continues training through the updated interaction model, and finishes training when the training ending condition is met, so as to obtain the interaction model after training is finished.

In this embodiment, the server updates the interactive model to be trained according to the status feature, the target position, the interactive action, the interactive benefit, the mobile guiding benefit and the data category simulation benefit, and then continues to train until the interactive model to be trained is obtained, and updates the interactive model to be trained by using multidimensional data such as the status feature, the target position, the interactive action, the interactive benefit and the mobile guiding benefit, so as to ensure the effectiveness of model update, thereby improving the effect of interactive model iterative update, improving the training efficiency of the interactive model, and updating the interactive model to be trained by combining the data category simulation benefit of the target interactive data, so that the interactive model can be guided to learn the control strategy of the real user, unnecessary exploration is reduced, the convergence efficiency of the interactive model can be accelerated, the virtual object is controlled by the interactive model to be more fitted to the operation of the real user, and the training effect of the interactive model is improved.

In one embodiment, the interaction model processing method further includes: according to the target interaction data and the historical interaction data, constructing interaction sample data carrying data category labels; based on the interactive sample data, the model update is performed on the data category discrimination model.

The interactive sample data comprises interactive data generated by the interaction of the AI control virtual object and interactive data generated by the interaction of the real user control virtual object. The target interaction data are generated by controlling the virtual object to interact through the mobile strategy model and the interaction model, and belong to interaction data generated by controlling the virtual object to interact through the AI; the historical interaction data are generated by controlling virtual objects to interact in a virtual interaction scene by a historical user through a historical account, and belong to interaction data generated by controlling the virtual objects to interact by a real user.

Specifically, the server may construct interaction sample data carrying data category labels based on the target interaction data and the history interaction data, for example, corresponding data category labels may be added for each target interaction data and each history interaction data, including belonging to AI generation category labels and belonging to real user generation category labels. And the server performs model updating on the data type discrimination model based on the interactive sample data, so that the dynamic updating of the data type discrimination model is realized.

In this embodiment, the model update is performed on the data type discrimination model through the interactive sample data constructed by the target interactive data and the historical interactive data, so that the data type discrimination model can be updated while the interactive model is trained, which is beneficial to ensuring the accuracy of data type discrimination of the data type discrimination model, thereby being beneficial to improving the training effect of the interactive model.

In one embodiment, the interaction model processing method further includes: constructing a virtual interaction scene through a first processor, and controlling a virtual object to perform self-playing interaction in the virtual interaction scene; in the self-playing interaction process, the state characteristics, the target positions, the interaction actions, the interaction benefits and the movement guiding benefits are acquired through the first processor.

The first processor is used for constructing a virtual interaction scene and controlling the virtual objects to perform self-playing interaction. The Self-Play (Self-Play) is an unsupervised learning method, and is a reinforcement learning algorithm for learning and exploring from Self-game by machine learning. In MOBA and FPS games, players divide into two hostile camps, play is won by competing against competition to achieve a certain target, and self-play can be performed in the MOBA and FPS games through an artificial intelligence model during self-play interaction so as to search for different interaction data. The interaction model may be model trained based on the obtained interaction data.

Specifically, a virtual interaction scene is constructed through the first processor, and the virtual objects are controlled to perform self-playing interaction in the virtual interaction scene. For example, for a fight game application, the fight game application may be run by the first processor, thereby building a virtual interaction scenario, and controlling at least one virtual object to perform a self-play interaction in the virtual interaction scenario. In the self-playing interaction process, the state characteristics, the target positions, the interaction actions, the interaction benefits and the movement guiding benefits are acquired through the first processor. In the interaction model processing method, the state characteristics of a virtual interaction scene where a virtual object is located are obtained through a first processor; inputting the state characteristics into a movement strategy model through a first processor to obtain a target position to which the virtual object is to be moved from the position; inputting the state characteristics and the target positions into an interaction model to be trained through a first processor to perform interaction operation mapping, and obtaining interaction actions to be executed by the virtual object at the position; and obtaining interactive benefits obtained by the virtual object executing the interactive action through the first processor, and obtaining movement guiding benefits obtained when the virtual object moves from the located position to the target position.

Further, based on the status feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, the training is continued after updating the interaction model to be trained until obtaining the interaction model after training, including: and updating the interactive model to be trained based on the state characteristics, the target positions, the interactive actions, the interactive benefits and the mobile guiding benefits by the second processor, and continuing training until the interactive model to be trained is obtained.

Specifically, for the state feature, the target position, the interaction action, the interaction benefit and the movement guiding benefit obtained by the first processor, the state feature, the target position, the interaction action, the interaction benefit and the movement guiding benefit may be sent to the second processor, and the second process updates the interaction model to be trained based on the state feature, the target position, the interaction action, the interaction benefit and the movement guiding benefit, and then continues training until a trained interaction model is obtained. In a specific application, the first processor and the second processor may be integrated in the same computer device, for example, may be integrated in the same server, the first processor may be a CPU (Central Processing Unit, central processor) of the server, and the second processor may be a GPU (Graphics Processing Unit, graphics processor) of the server.

In this embodiment, the first processor controls the virtual object to perform self-playing interaction in the virtual interaction scene, and obtains status features, target positions, interaction actions, interaction benefits and mobile guiding benefits, and the second processor updates the interaction model to be trained based on the status features, the target positions, the interaction actions, the interaction benefits and the mobile guiding benefits, and then continues training until the interaction model to be trained is obtained, so that self-playing interaction and model updating processing are realized by different processors, and orderly operation of the self-playing interaction and model updating processing is facilitated.

In one embodiment, the interaction model processing method further includes: and when the state characteristics are input into the movement strategy model, obtaining the prediction benefits output by the movement strategy model.

The mobile policy model may also predict, i.e., output, the obtained benefits, where the predicted benefits may include various types of benefits, such as interactive benefits, mobile guided benefits, or data category simulation benefits.

Specifically, under the condition that the status feature is input into the movement policy model, the movement policy model outputs predicted benefits in addition to the target position to which the virtual object is to be moved from the position, and the predicted benefits may include benefits that are predicted to be obtained in the control process of the virtual object at this time. In a specific application, when the mobile strategy model is trained based on the historical interaction data obtained by interaction in the virtual interaction scene, the historical interaction data can carry a benefit label besides a target position label, so that the mobile strategy model obtained by training based on the historical interaction data can output the target position to which the virtual object is to be moved from the position and the predicted benefit according to the input state characteristics.

Further, based on the status feature, the target position, the interaction action, the interaction benefit and the mobile guidance benefit, the training is continued after updating the interaction model to be trained until obtaining the interaction model after training, including: based on the state characteristics, the target positions, the interaction actions, the interaction benefits, the mobile guiding benefits and the prediction benefits, the interaction model to be trained is updated and then continuously trained until the interaction model with the training completed is obtained.

Specifically, the server may update the model of the interaction model to be trained in combination with the predicted benefits output by the mobile policy model. The server can update the interactive model to be trained according to the state characteristics, the target positions, the interactive actions, the interactive benefits, the mobile guiding benefits and the predicted benefits, and then continue training until the interactive model which is trained is obtained. In a specific application, the server may obtain a pre-constructed target loss function, and substitute the obtained state feature, the target position, the interaction action, the interaction benefit, the movement guiding benefit and the prediction benefit into the target loss function, so as to calculate and obtain the target loss value. The server updates model parameters of the interaction model to be trained based on the target loss value, and specifically can update weight parameters and mapping parameters in the interaction model to be trained, so that an updated interaction model is obtained. And the server continues training through the updated interaction model, and finishes training when the training ending condition is met, so as to obtain the interaction model after training is finished.

In this embodiment, the server outputs prediction benefits according to the state features, the target positions, the interaction actions, the interaction benefits, the mobile guidance benefits and the mobile strategy model, and then continues to train until the interaction model to be trained is obtained, and in addition to updating the interaction model to be trained by using the multidimensional data such as the state features, the target positions, the interaction actions, the interaction benefits and the mobile guidance benefits, the effectiveness of model updating is ensured, so that the iterative updating effect of the interaction model is improved, the training efficiency of the interaction model is improved, the interaction model to be trained is updated by combining the prediction benefits, the interaction model can be guided to learn the control strategy with the largest benefits, unnecessary exploration is reduced, the convergence efficiency of the interaction model can be accelerated, the virtual object is controlled by the interaction model to obtain the largest benefits, and the training effect of the interaction model is improved.

In one embodiment, as shown in fig. 6, the process of obtaining the status feature, that is, obtaining the status feature of the virtual interaction scene where the virtual object is located, includes:

step 602, obtaining environment perception data of the virtual object in the virtual interaction scene, situation data of the virtual interaction scene and inter-object interaction data of the virtual object.

The environment perception data refers to data which can be perceived when the virtual object is in the virtual interaction scene, for example, the environment perception data can comprise terrain information of the virtual interaction scene, and particularly can comprise various information such as building distribution, virtual object distribution and the like; the situation data may include attribute data of the virtual object, and may specifically include one or more of a class of the virtual object, equipment of the virtual object, a vital value attribute of the virtual object such as blood volume of hero in a game, skill information of the virtual object, or an offensiveness of the virtual object; the inter-object interaction data between virtual objects may include data generated by interactions between different virtual objects, such as distance, included angle, skill or object application between virtual objects.

The server obtains environment perception data of the virtual object in the virtual interaction scene, situation data of the virtual interaction scene and inter-object interaction data of the virtual object, and can obtain the environment perception data, the situation data and the inter-object interaction data by intercepting pictures of the virtual interaction scene and extracting data based on the intercepted pictures.

And step 604, extracting the characteristics of the environment sensing data, the situation data and the interaction data among the objects respectively to obtain the state characteristics of the virtual interaction scene where the virtual object is located.

The environment perception data, the situation data and the interaction data among the objects are combined to form state data, and the state characteristics of the virtual interaction scene where the virtual object is can be obtained by extracting the characteristics of the state data. Specifically, the server may perform feature extraction on the context awareness data, the situation data, and the inter-object interaction data, for example, may obtain the state features of the virtual interaction scene where the virtual object is located through mapping processing of data vectorization.

In this embodiment, the server obtains the state features of the virtual interaction scene where the virtual object is located by extracting features of the environment sensing data, the situation data and the interaction data between the objects, so that the state features of multiple dimensions can be obtained, the state of the virtual interaction scene can be accurately described, the training effect of the interaction model is guaranteed, and the interaction capability of the interaction model is improved.

The application also provides an application scene, which applies the interactive model processing method.

Specifically, the application of the interaction model processing method in the application scene is as follows:

the first person shooting game, i.e., the FPS game, is a generic term for shooting-type electronic games played with a first person view of a player as a main view, and the burst mode is a very competitive play in the FPS-type game. Among various FPS games, the burst mode is a relatively central play that divides players into two hostile camps, latency and safeguards. The purpose of the hidden player is that a player carrying the bomb installs the bomb at one security point in a map, the map usually has two fixed security points, and other teammates assist in completing the final blasting task; the defender prevents the latent person from being wrapped or removes the latent person after being wrapped; the two parties can be provided with equipment to kill the enemy in the whole game process so as to reduce the winning probability of the opponent. As shown in fig. 7, in the burst mode of an FPS game, as a virtual character of a diver's camp, a player can control the virtual character to carry a bomb to a point of containment for bomb installation; as shown in fig. 8, as a virtual character of a defender's camp, a player can control the virtual character to detach a bomb installed at a point of containment; as shown in fig. 9, if a bomb installed at the installation site is not successfully removed as a virtual character of a defender camp, the bomb will detonate when the condition is satisfied, and the defender camp fight fails. Therefore, players mainly conduct two-level thinking and operation in the whole game, one is macroscopic scheduling, namely, how to schedule and cooperate in tactics to fulfill the aim of the party where the players are located; the other is microscopic operation, i.e. individual operation in some specific scenarios, mainly how to kill enemies with equipment carried by themselves and surrounding terrain during the combat process.

For complex electronic games such as FPS blasting mode, two main stream learning methods for constructing game AI intelligent agents at present are adopted, and one is a supervised learning mode, namely behavior cloning; the other is training learning of AI by using RL (Reinforcement Learning ) reinforcement learning algorithm. The former extracts characteristics and labels through human player data, and then training is carried out by deep learning; the latter is through reasonable in design's incentive for game agent obtains the biggest expected income through constantly exploring, reaches the final win of recreation. Among other things, reinforcement learning type algorithms have the following advantages: independent of existing human player data, enabling the final model capability to override human performance; data can be generated for learning in a self-fight mode, so that parallelization of an algorithm is facilitated, and training efficiency of an algorithm model is greatly improved; the deep neural network can be utilized to model a high-dimensional complex continuous state space in a game and an action space of an intelligent body.

At present, aiming at a 2D game of MOBA, a deep neural network is designed to model attribute features such as hero, map features such as barriers, global features such as time and the like, and actions and targets of hero, and corresponding offset and delay parameters are output. The model training adopts PPO (Proximal Policy Optimization, near-end strategy optimization) reinforcement learning algorithm, and utilizes the excitation of artificial design to strengthen the game intelligent body through multiple self-playing simulation. However, MOBA games are 2D games, lower than FPS games in terms of environmental responsibility and complexity of action space, and also more dense than FPS in terms of punishment and review. Compared with MOBA type 2D games, FPS games have more complex game environments, and the 3D environments involved in the FPS games have more complex spatial structures, and game characters are not dots but a 3D form; the FPS game has more complex action space, compared with the 2D game, the 3D game has larger action space because the visual angle of the game needs to be changed for sensing, and meanwhile, the FPS game is more abundant in operation, and comprises various types of actions such as steering, moving, squatting, jumping, firing, using various auxiliary equipment and the like; the feedback rewards of the FPS game is sparse, the successful unpacking/security of each round only occurs, the long-sequence feedback is performed, and no feedback signal exists in most game stages; in addition, the FPS game has multiple winning modes, the blasting mode game has multiple winning parties, multiple feedback stacks, and the model needs to be compatible with multiple targets.

Based on this, the interactive model processing method provided in this embodiment introduces human expert data mainly in the process of training the FPS blasting game AI agent by the reinforcement learning algorithm, and learns multi-strategy and micro-operation anthropomorphic hierarchical reinforcement learning algorithm (HRL, hierarchical Reinforcement Learning). Compared with 2D games, such as Weiqi, MOBA games and the like, the FPS has a 3D simulation environment, the action space is also richer, various actions including steering, moving, squatting, jumping, firing, using various auxiliary equipment and the like are included, so that the reinforcement learning algorithm converges slowly, and meanwhile, the behaviors and strategies tend to be single and inconsistent with the playing method of a real player. The interactive model processing method provided by the embodiment can help FPS AI learn various strategies by constructing upper and lower network architectures, and the blasting game is completed by using different strategies. The interactive model processing method provided by the embodiment can be realized in an FPS game, and based on human expert data, the multi-strategy game AI intelligent agent is trained by using a hierarchical reinforcement learning algorithm. Specifically, the upper layer utilizes the deep neural network to abstract and express the strategy of human expert data, so that the AI intelligent agent can perform reinforcement learning exploration based on a human law; meanwhile, a dense reward and punishment is introduced into the lower layer to express whether the micro-operation is anthropomorphic, so that unnecessary exploration is reduced in the lower layer learning process, convergence is quickened, and the micro-operation performance can be anthropomorphic. Therefore, the AI intelligent agent explores a macroscopic strategy playing method under the guidance of the upper layer, the lower layer can also continuously improve the micro-operation capability of the AI intelligent agent, and meanwhile, the AI intelligent agent can also decide to select a strategy which does not follow the upper layer appointed playing method when learning to follow the upper layer appointed playing method or cope with emergency, so that the AI intelligent agent can continuously improve the countermeasure strategy which can maximize the winning rate for different situations.

Specifically, the interactive model processing method provided by the embodiment relates to a multi-strategy FPS blasting mode game AI learning method based on expert data, which is generally divided into two large modules, namely an upper strategy learning module for learning a mobile strategy model; and secondly, a reinforcement learning module at the lower layer is used for training the interaction model. The upper strategy learning module learns a mobile strategy model, can provide a future target point in a current state, simultaneously gives information of the target point to the lower interaction model, rewards if the lower layer finishes the target point given by the upper layer, and controls the lower interaction model to follow the guidance of the upper layer as much as possible through the rewards, so that macroscopic strategy distribution and human expert keep consistency to a certain degree; the lower reinforcement learning module models the multi-classification action space of the game through the ray and image and vector processing of the 3D game environment perception, and comprises steering, movement, attack correlation, state correlation and the like. In addition, the lower layer can be provided with a special sub-module, a classifier is constructed by utilizing the idea of antagonism learning to distinguish AI data from expert data, and the probability score of the distinction (the higher the probability score is, the more the AI behaves like a person) can be used as a dense reward of the lower layer; and then the lower layer continuously trains and improves the micro-operation and strategy capability of the AI by utilizing the instant income in the game through the self-playing and reinforcement learning algorithm of the model.

As shown in fig. 10, in the upper layer policy model training process, human data is utilized to perform supervised learning, and the scheduling policy of human is learned, so as to obtain an upper layer policy generator pi _high (g|s), policy generator pi _high (gOPs) can output a moving Path Target Path which moves to a Target point g according to the input state features s, path features Path Feat can be obtained through feature extraction, such as through ebedding processing, of the moving Path Target Path, and the Path features Path Feat are connected into an underlying interaction model through features, so that the interaction model carries out interaction operation mapping based on the state features and the Target position of the Target point, and interaction actions are obtained. For the lower interactive model, the State characteristics of Game State can be input, and through the interactive operation mapping processing in the interactive model, the Game State can be selectedThe slightly network Policy-net outputs interactive Actions. In addition, predictive benefits, including combat benefits, outcome benefits, anthropomorphic benefits, and mobile guidance benefits, may also be output through the Value network Value-net in the interaction model.

Specifically, for the upper expert data strategy learning module, human data is utilized to perform supervised learning, and a human scheduling strategy is learned. The module outputs future target points of AI at intervals or according to the completion of the underlying model, according to the current state. And awarding rewards to guide the lower learning strategy according to the completion condition of the target point of the AI, so that the AI can be assisted to quickly learn the human scheduling strategy to improve the robustness of the AI, and meanwhile, the completion capability of the lower model is improved. The upper expert data strategy learning module may specifically include a target point generation module, specifically generates target points within a certain distance from the periphery according to the position where the AI is located, and performs target point deletion through a game navigation tool, such as a navmesh tool, so as to select only reachable target points. The upper expert data strategy learning module can also comprise a strategy model supervision training module for extracting input characteristics of each frame of expert data, extracting target points reached in a certain time period when the current frame is seen as a training label, and then constructing a strategy learning model by using a deep neural network to perform supervision learning training to obtain a strategy model. The strategy model is essentially a multi-classification model, each target point can be divided into directions according to the direction of the current position, and each direction is a category. The upper expert data strategy learning module may further include a strategy guidance module, when the upper model generates a target point, the lower model needs to follow the guidance of the upper layer, so that the completion condition of the last target point is judged each time when a new target point is generated, specifically, the completion condition may be judged according to the distance from the target point, and if the completion condition is not reached, a certain penalty is given according to the distance from the target point, so that the lower model is stimulated to complete the target assigned by the upper layer. Mobile guided revenue forward _guiding The calculation formula of (2) can be as follows:

reward _guiding ＝min(0,(m-Dist(pos _{currently, the method is that} ,pos _{Target object} ) -c) m is a threshold value

Wherein min () is the minimum value, dist () is the distance, pos _{Currently, the method is that} Pos is the current location point _{Target object} And m is set according to actual needs for the target position point to be moved.

For the lower reinforcement learning training module, the lower reinforcement learning training module is responsible for the self-playing data generation required by the model and the iterative reinforcement learning training of the neural network model. The lower reinforcement learning training module comprises a feature extraction module for extracting state features. In the FPS game, the feature types may be classified into three types, namely, environment sensing information, disc surface basic information, and various player information and mutual interaction information. Firstly, environment perception information, which is the most important information of a 3D game different from a 2D game, is used for perceiving topographic information in a 3D environment; reference may be made to environmental awareness features in the relevant fields of autopilot, etc., generally to the features of a radial radar, depth map and height map. And then, based on the information acquired by the real player in the office, designing the information acquired by the AI to carry out vectorization. Intra-office information including game times, package status, etc.; basic information of the main player, such as blood volume, carrying equipment, camping, etc.; information of players and interactive information with the players are visible. Such as distance, angle, etc. Corresponding features are extracted for vectorization through reading the game log, and finally the corresponding features are used as input of a model. For Action Space (Action Space), actions can be classified into 6 categories based on game key play: 1) Movement (i.e., control character movement); 2) Left-right steering (i.e., controlling the character to turn left-right); 3) Turning up and down (i.e., controlling the character to turn up and down); 4) Status actions (i.e., controlling the character to perform squat, jump, etc.); 5) Equipment-related (i.e., controlling roles to perform operations such as equipment use and handoff); 6) Package correlation (i.e., the role's associated operations with respect to a bomb package, such as install, remove, etc.); each category of actions may be discretized.

Further, the underlying reinforcement learning training module determines the benefit through a return benefit (report) extraction module. During the continuous interactive game play of the AI with the environment, reinforcement learning requires profit to evaluate the AI's trajectory in the game and thus the action taken by the AI after observing the game environment. Based on the understanding of the game, the benefits re-word can be divided into battle re-word, i.e. the combat benefits, including injuries and shots; and wintype reward, i.e., game outcome related reward, whether to win the game. The lower reinforcement learning training module performs anthropomorphic learning through the countermeasure learning module. Because of the specificity of the FPS game, the game has only victory and kill related reward, and the reward is sparse. At the same time, experts have certain point location preference and diversity in FPS games, and it is difficult to manually define reward. Therefore, based on the idea of inverse reinforcement learning, a network of discriminators is used to learn the human reward. The specific method is that a discriminator is introduced, and data generated by AI and expert data are used for learning classification tasks during training, so that whether the data are generated by an expert is judged. When the method is used, each frame is used for scoring data generated by the AI, and the score is used as a dense anthropomorphic review to help the AI reduce exploration in reinforcement learning and improve micro-operation of the AI. The calculation formula of anthropomorphic benefit r (s, a) can be as follows:

r (s, a) = -log (1-D (s, a)), D being the probabilistic output of the arbiter where s is the state feature and a is the interaction.

Further, for the neural network training module, through feature extraction, AI action execution and report calculation, a plurality of data can be obtained as the input of the deep learning network model, namely, the input of the model is related to the feature, the output is the action type, and then the report given based on the game environment can learn network parameters by using the reinforcement learning algorithm PPO so as to achieve the goal of maximizing rewarding.

As for the reinforcement learning training module, similar to reinforcement learning frames of games such as chess and cards, MOBA and the like, reinforcement learning training can be adopted to play a game from the game frames for AI-based decision making. The entire training framework can be divided into a CPU side and a GPU side. The CPU side is mainly used for conveniently performing environment expansion through multi-container dock mirror images, a game room is opened on a plurality of machines, games are performed in the room by using a model AI, so that a large amount of data is generated, the data is sent to a module of a data pool Memboost, then the GPU side is used for performing deep learning by taking the Memboost data, and the model is updated by an algorithm of reinforcement learning PPO. As shown in fig. 11, on the CPU side of the central processing unit, a game room is opened by running a game core on a plurality of machines together, a game is played in the room with a game Server by using a model AI, so that a large amount of data including action, state, return and the like is generated, and the data is sent to a data pool on the CPU side of the graphics processor, and the graphics processor acquires the data from the data pool to perform model learning, for example, model learning can be performed based on a PPO algorithm.

Specifically, according to the interactive model processing method provided by the embodiment, data extraction is performed on historical fight data of a high-level player, and in units of games of each minute, feature and target point tag information can be extracted based on the condition that the game starts to appear once, so that an expert fight data sample set is obtained; loading an expert strategy neural network model, randomly initializing network model parameters, and training the network model by using the obtained expert fight data sample big data to obtain an expert strategy model; loading a neural network model of lower reinforcement learning, randomly initializing network model parameters, loading a discriminator model, randomly initializing network model parameters, and simultaneously reading a trained expert strategy neural network model; loading game environment, selecting opponent model from opponent model pool, parallelly starting self-playing script in several computers to obtain sample data of state, target and action, and calculating to obtain correspondent game return income. The goal is to input the game situation information, namely the state characteristics, into the output obtained by the expert strategy model at intervals; inputting the AI data of the fight into a discriminator to classify to obtain an anthropomorphic income review; judging whether the current AI reaches the vicinity of the target point, if so, giving an arrival rewarding-reply; combining the obtained < state, target, action >, game benefits reward, anthropomorphic reward and guiding-reward, and updating parameters of the neural network model according to the PPO algorithm; the fight data and expert data obtained by combining the generated AI are extracted to form a batch input discriminator model with a certain proportion for carrying out network training, and whether the extracted feature of the frame (moment) is AI data or expert data is classified by a supervision signal to update network parameters; after the model iterates for a certain step number, adding the updated model into an opponent model pool for subsequent continuous fight; evaluating the capacity of the AI model, and if the capacity upper limit or the maximum iteration time step is reached, stopping training and storing the final model; otherwise, continuing training.

Further, the number of rays in the game (3D perceived accuracy), the imaging of the minimap message, and the discretized dimension of the global features can be adjusted according to the actual needs (model accuracy, resource size). The target point division of the strategy can adjust the path length according to actual needs. The input features of the discriminators can be added, deleted and changed according to the performances of which dimensions of the human being are fitted, and meanwhile, more discriminators can be expanded to add anthropomorphic reward of different scenes and dimensions. When the return benefits are calculated, the dimensions of killing, winning or losing, personifying and guiding are not limited, the dimensions can be added and deleted according to actual needs, and meanwhile, different weights of the benefits can be adjusted according to the needs. The specific structure of the various related neural network models can be adjusted according to different scenes and requirements. The reinforcement learning described in the iterative model adopts a PPO algorithm, and the module can also use other reinforcement learning to carry out iterative training, such as A3C (Asynchronous Advantage Actor-Critic, asynchronous dominant action evaluation), DDPG (Deep Deterministic Policy Gradient, depth deterministic strategy gradient) and the like. In addition, after supervised learning, the strategy model of the upper layer can also be expanded to prepare the benefit of the upper layer to carry out further iterative optimization in a reinforcement learning mode while guiding the strategy learning of the lower layer.

According to the interactive model processing method provided by the embodiment, the expert data is utilized to assist in performing FPS game AI training of hierarchical reinforcement learning, the FPS blasting mode game information is extracted, the ray and situation scene imaging and vectorization processing of the 3D game environment perception are abstracted, meanwhile, the multi-classification action space (steering, moving, attack correlation, state correlation and the like) of the game is modeled, and therefore game input and action output of the FPS game AI are effectively modeled; based on the thinking of countermeasure learning, introducing a discriminator to score data generated by the AI, wherein the score is used as a dense anthropomorphic review to help the AI reduce exploration in reinforcement learning, and simultaneously can help to improve micro-operation of the AI; the parallel self-playing on multiple machines can generate a large amount of training data independent of human players, and simultaneously utilizes abstract characteristics and defined benefits of the games, continuously optimizes the model by using a reinforcement learning algorithm, and improves the AI capability from scratch; the expert data is utilized to learn the macro scheduling of the expert, and then the training process of the lower reinforcement learning is guided by the upper strategy model, so that the AI can learn the countermeasure strategy of the expert more quickly and continuously improve the strength on the basis, and the AI has better robustness and adaptability when the human players of the countermeasure diversity.

The interactive model processing method provided by the embodiment provides a solution for an FPS blasting mode, and relates to a multi-strategy FPS blasting mode game AI learning method based on expert data, namely, the strategy and the micro-operation capability of AI are decoupled by using an upper-layer framework and a lower-layer framework. The upper layer structure of the interaction model processing method provided by the embodiment can learn the point location targets of the expert by using the fight experience of the expert data, so that scheduling strategies of different human expert playing methods can be generated. Based on the target guidance generated by the upper layer, the lower layer can learn to complete the upper layer target and learn and maintain the basic micro-operation capability at the same time, thereby achieving the purpose of learning the playing of different strategies of the FPS blasting game. Therefore, the method can effectively solve the problem of game AI multi-strategy learning in the FPS blasting mode, and further can effectively improve the AI strategy capability.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an interaction model processing device for realizing the above-mentioned interaction model processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the interaction model processing device or devices provided below may refer to the limitation of the interaction model processing method hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 12, there is provided an interaction model processing apparatus 1200, including: a status feature acquisition module 1202, a target location acquisition module 1204, an interaction acquisition module 1206, a revenue acquisition module 1208, and a model update module 1210, wherein:

a state feature acquiring module 1202, configured to acquire a state feature of a virtual interaction scene where a virtual object is located;

the target position obtaining module 1204 is configured to input the status feature into the movement policy model, to obtain a target position to which the virtual object is to be moved from the located position; the mobile strategy model is obtained by training based on historical interaction data obtained by interaction in the virtual interaction scene;

The interaction action obtaining module 1206 is configured to input the state feature and the target position into an interaction model to be trained to perform interaction operation mapping, so as to obtain an interaction action to be executed by the virtual object at the position;

the profit obtaining module 1208 is configured to obtain an interactive profit obtained by the virtual object performing the interactive action, and obtain a movement guiding profit obtained when the virtual object moves from the located position to the target position;

the model updating module 1210 is configured to update the interaction model to be trained based on the status feature, the target location, the interaction action, the interaction benefit and the mobile guidance benefit, and then continue training until a trained interaction model is obtained.

In one embodiment, the virtual object moving system further comprises a movement control module for controlling the virtual object to move from the located position to the target position; the profit obtaining module 1208 is further configured to determine an intermediate position reached when the virtual object moves from the located position to the target position when the movement determination condition is satisfied; determining a distance difference between the intermediate position and the target position; and obtaining movement guiding benefits according to the distance difference mapping.

In one embodiment, the mobile control module is further configured to determine a movement path of the virtual object according to the target position and the located position; extracting characteristics of the moving path to obtain path characteristics of the moving path; and controlling the virtual object to move from the located position to the target position according to the path characteristics.

In one embodiment, the system further comprises an action execution control module for controlling the virtual object to execute the interaction action; the profit obtaining module 1208 is further configured to obtain local profit and global profit obtained by the virtual object performing the interaction; and obtaining interactive benefits according to the local benefits and the global benefits.

In one embodiment, the profit obtaining module 1208 is further configured to calculate a local weighted profit according to the local profit and the local profit weight; calculating according to the global benefit and the global benefit weight to obtain global weighted benefit; an interactive benefit is derived based on the local weighted benefit and the global weighted benefit.

In one embodiment, the model update module 1210 is further configured to determine a target loss value based on the status feature, the target location, the interaction benefit, and the movement guidance benefit; updating model parameters of the interaction model to be trained according to the target loss value to obtain an updated interaction model; and continuing training through the updated interaction model until the training ending condition is met, and obtaining the interaction model after training is completed.

In one embodiment, the mobile policy model training module is further configured to obtain historical interaction data obtained by controlling the virtual object to interact by the historical account in the virtual interaction scene; extracting state characteristics aiming at the historical interaction data to obtain historical state characteristic data carrying the target position label; training is carried out based on the historical state characteristic data, and a movement strategy model is obtained.

In one embodiment, the method further comprises a simulation benefit determining module for obtaining the latest data category discrimination model; the data category discrimination model is obtained by training based on interactive sample data carrying data category labels; obtaining target interaction data based on the state characteristics and the interaction actions, and inputting the target interaction data into the latest data category discrimination model to obtain a data discrimination category; obtaining data category simulation benefits of the target interaction data according to the data judgment categories; the model update module 1210 is further configured to update the interaction model to be trained and then continue training until a trained interaction model is obtained according to the status feature, the target location, the interaction action, the interaction benefit, the mobile guidance benefit and the data category simulation benefit.

In one embodiment, the method further comprises a discriminant model updating module, which is used for constructing interactive sample data carrying data category labels according to the target interactive data and the historical interactive data; based on the interactive sample data, the model update is performed on the data category discrimination model.

In one embodiment, the game playing system further comprises a self-playing control module, wherein the self-playing control module is used for constructing a virtual interaction scene through the first processor and controlling the virtual object to perform self-playing interaction in the virtual interaction scene; in the self-playing interaction process, acquiring state characteristics, target positions, interaction actions, interaction benefits and mobile guiding benefits through a first processor; the model updating module 1210 is further configured to update, by the second processor, the interaction model to be trained based on the status feature, the target location, the interaction action, the interaction benefit, and the mobile guidance benefit, and then continue training until a trained interaction model is obtained.

In one embodiment, the method further comprises a predicted gain obtaining module, configured to obtain predicted gain output by the mobile policy model when the status feature is input into the mobile policy model; the model update module 1210 is further configured to update the interaction model to be trained based on the status feature, the target location, the interaction action, the interaction benefit, the mobile guidance benefit, and the prediction benefit, and then continue training until a trained interaction model is obtained.

In one embodiment, the status feature obtaining module 1202 is further configured to obtain context awareness data of the virtual object in the virtual interaction scene, situation data of the virtual interaction scene, and inter-object interaction data of the virtual object; and respectively extracting the characteristics of the environment perception data, the situation data and the interaction data among the objects to obtain the state characteristics of the virtual interaction scene where the virtual object is located.

The respective modules in the above-described interaction model processing apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server or a terminal, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store data in the interaction model training. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an interaction model processing method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of interaction model processing, the method comprising:

Inputting the state characteristics and the target position into an interaction model to be trained for interactive operation mapping, and obtaining an interaction action to be executed by the virtual object at the position;

obtaining interactive benefits obtained by the virtual object executing the interactive action, and obtaining movement guiding benefits obtained when the virtual object moves from the located position to the target position;

and based on the state characteristics, the target positions, the interaction actions, the interaction benefits and the mobile guidance benefits, continuously training the interaction model to be trained after updating the interaction model to be trained until the interaction model with the trained interaction model is obtained.

2. The method according to claim 1, wherein the method further comprises:

controlling the virtual object to move from the located position to the target position;

the obtaining the movement guiding benefit obtained when the virtual object moves from the located position to the target position comprises the following steps:

when a movement judgment condition is met, determining an intermediate position reached when the virtual object moves from the located position to the target position;

determining a distance difference between the intermediate position and the target position;

And obtaining movement guiding benefits according to the distance difference mapping.

3. The method of claim 2, wherein the controlling the virtual object to move from the located position to the target position comprises:

determining a moving path of the virtual object according to the target position and the position;

extracting features of the moving path to obtain path features of the moving path;

and controlling the virtual object to move from the located position to the target position according to the path characteristics.

4. The method according to claim 1, wherein the method further comprises:

controlling the virtual object to execute the interaction action;

the obtaining the interactive benefits obtained by executing the interactive actions by the virtual objects comprises the following steps:

obtaining local benefits and global benefits obtained by the virtual object executing the interaction action;

and obtaining interactive benefits according to the local benefits and the global benefits.

5. The method of claim 4, wherein the obtaining an interactive benefit from the local benefit and the global benefit comprises:

calculating to obtain local weighted benefits according to the local benefits and the local benefit weights;

Calculating according to the global benefit and the global benefit weight to obtain global weighted benefit;

and obtaining interactive benefits based on the local weighted benefits and the global weighted benefits.

6. The method of claim 1, wherein the updating the interaction model to be trained based on the status feature, the target location, the interaction action, the interaction benefit, and the movement guidance benefit continues training until a trained interaction model is obtained, comprising:

determining a target loss value based on the status feature, the target location, the interaction benefit, and the mobile guidance benefit;

updating the model parameters of the interaction model to be trained according to the target loss value to obtain an updated interaction model;

and continuing training through the updated interaction model until the training ending condition is met, and obtaining the interaction model after training is completed.

7. The method according to claim 1, wherein the method further comprises:

acquiring historical interaction data obtained by controlling a virtual object to interact by a historical account in the virtual interaction scene;

Extracting state characteristics of the historical interaction data to obtain historical state characteristic data carrying a target position label;

training based on the historical state characteristic data to obtain the movement strategy model.

8. The method according to claim 1, wherein the method further comprises:

acquiring a latest data category discrimination model; the data category judging model is obtained by training based on interactive sample data carrying data category labels;

obtaining target interaction data based on the state characteristics and the interaction actions, and inputting the target interaction data into a latest data category discrimination model to obtain a data discrimination category;

obtaining data category simulation benefits of the target interaction data according to the data judgment categories;

and continuing training after updating the interaction model to be trained based on the state characteristics, the target position, the interaction action, the interaction benefits and the mobile guidance benefits until a trained interaction model is obtained, wherein the method comprises the following steps of:

and according to the state characteristics, the target positions, the interaction actions, the interaction benefits, the mobile guiding benefits and the data category simulation benefits, continuously training the interaction model to be trained after updating the interaction model to be trained until a trained interaction model is obtained.

9. The method of claim 8, wherein the method further comprises:

according to the target interaction data and the historical interaction data, constructing interaction sample data carrying data category labels;

and based on the interaction sample data, carrying out model updating on the data category discrimination model.

10. The method according to claim 1, wherein the method further comprises:

constructing the virtual interaction scene through a first processor, and controlling the virtual object to perform self-playing interaction in the virtual interaction scene;

acquiring the state characteristics, the target positions, the interaction actions, the interaction benefits and the mobile guiding benefits through the first processor in the self-playing interaction process;

and through a second processor, based on the state characteristics, the target positions, the interaction actions, the interaction benefits and the movement guiding benefits, continuing training after updating the interaction model to be trained until a trained interaction model is obtained.

11. The method according to claim 1, wherein the method further comprises:

when the state characteristics are input into the movement strategy model, obtaining the prediction benefits output by the movement strategy model;

and based on the state characteristics, the target positions, the interaction actions, the interaction benefits, the mobile guiding benefits and the prediction benefits, continuously training the interaction model to be trained after updating until a training-completed interaction model is obtained.

12. The method according to any one of claims 1 to 11, wherein the obtaining the state characteristics of the virtual interaction scene in which the virtual object is located includes:

acquiring environment perception data of a virtual object in a virtual interaction scene, situation data of the virtual interaction scene and inter-object interaction data of the virtual object;

and respectively extracting the characteristics of the environment perception data, the situation data and the interaction data among the objects to obtain the state characteristics of the virtual interaction scene where the virtual object is located.

13. An interaction model processing apparatus, the apparatus comprising:

the target position obtaining module is used for inputting the state characteristics into a movement strategy model to obtain a target position to which the virtual object is to be moved from the position; the mobile strategy model is obtained by training based on historical interaction data obtained by interaction in the virtual interaction scene;

the interactive action obtaining module is used for inputting the state characteristics and the target position into an interactive model to be trained for interactive operation mapping, and obtaining the interactive action to be executed by the virtual object at the position;

the profit acquisition module is used for acquiring interactive profit obtained by the virtual object executing the interactive action and acquiring movement guiding profit obtained when the virtual object moves from the located position to the target position;

and the model updating module is used for continuously training the interaction model to be trained after updating the interaction model to be trained based on the state characteristics, the target positions, the interaction actions, the interaction benefits and the mobile guiding benefits until the interaction model which is trained is obtained.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.

15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.

16. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.