CN118097082A

CN118097082A - Virtual object image generation method, device, computer equipment and storage medium

Info

Publication number: CN118097082A
Application number: CN202410509837.2A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-28
Anticipated expiration: 2044-04-26

Abstract

The application relates to a virtual object image generation method, a virtual object image generation device, computer equipment and a storage medium. The method comprises the following steps: acquiring a gesture graph designating the action gesture of interaction of at least two virtual objects and a first virtual object image comprising the at least two virtual objects; extracting virtual object features of the first virtual object image to obtain multi-scale virtual object features, and extracting action gesture features of the action gesture image to obtain multi-scale action gesture features; acquiring sampling noise characteristics for generating a virtual object image, and performing multi-step noise reduction processing on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics to obtain noise-reduced virtual object image characteristics; and decoding the noise-reduced virtual object image characteristics to obtain two virtual object images comprising at least two virtual objects interacted according to the action gesture. By adopting the method, the generation effect of the virtual object image can be improved.

Description

Virtual object image generation method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to a virtual object image generating method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of computer technology, a virtual object image generation technology has emerged, which can generate a virtual object image including a plurality of virtual objects, such as a virtual object poster including a plurality of virtual objects, or the like.

In the conventional technology, a commonly adopted virtual object image generation mode is that under the condition of learning images of a plurality of virtual objects, a plurality of virtual object image generation models are obtained, each virtual object image is respectively generated according to an input action gesture through each virtual object image generation model, and then the virtual object images are obtained through a mode of matting and splicing.

However, the conventional method depends on the reduction degree of model learning and the performance of the matting model of each virtual object image generation model, and has a certain compatibility problem for maintaining the action gesture and restoring the virtual object image in the generation process, and has the problem of poor generation effect of the virtual object image.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a virtual object image generation method, apparatus, computer device, computer-readable storage medium, and computer program product that can enhance the generation effect of a virtual object image.

In a first aspect, the present application provides a virtual object image generation method. The method comprises the following steps:

Acquiring an action gesture image and a first virtual object image comprising at least two virtual objects; the action gesture graph is used for designating action gestures of interaction of the at least two virtual objects;

Extracting virtual object features of the first virtual object image to obtain multi-scale virtual object features, and extracting action gesture features of the action gesture image to obtain multi-scale action gesture features;

Acquiring sampling noise characteristics for generating a virtual object image, and performing multi-step noise reduction processing on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics to obtain noise-reduced virtual object image characteristics;

Decoding the noise-reduced virtual object image features to obtain a second virtual object image; the second virtual object image includes the at least two virtual objects that interact according to the motion gesture.

In one embodiment, the action gesture graph is obtained by:

acquiring an interaction object image for specifying an action gesture of interaction of the at least two virtual objects; the number of the interactive objects in the interactive object image is the same as the number of the virtual objects;

and extracting the action gesture of the interactive object in the interactive object image to obtain an action gesture image.

In one embodiment, the second virtual object image is determined by a pre-trained virtual object image reasoning model; the virtual object image reasoning model comprises a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network and a decoder;

the virtual object reference control network is used for extracting virtual object characteristics of the first virtual object image to obtain multi-scale virtual object characteristics;

the action gesture reference control network is used for extracting action gesture characteristics of the action gesture graph to obtain multi-scale action gesture characteristics;

The encoder is used for acquiring sampling noise characteristics for generating a virtual object image;

the noise reduction network is used for carrying out multi-step noise reduction on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics to obtain noise-reduced virtual object image characteristics;

the decoder is used for decoding the noise-reduced virtual object image characteristics to obtain a second virtual object image.

In one embodiment, the virtual object image inference model is obtained by a training step comprising:

Acquiring a plurality of training samples;

Training an initial image reasoning model according to a sample object image and a sample posture diagram in the training sample aiming at each training sample to obtain a virtual object image reasoning model; the sample object image comprises at least two sample objects; the sample gesture graph is obtained by extracting the action gesture of the sample object image.

In one embodiment, the obtaining a plurality of training samples comprises:

Acquiring initial image images of a plurality of sample objects, and extracting target image images of the sample objects from the initial image images of the sample objects for each sample object;

Carrying out station combination for multiple times on the target image images of each of the plurality of sample objects to obtain a plurality of sample object images;

Respectively extracting action postures of the plurality of sample object images to obtain respective corresponding sample posture images of the plurality of sample object images;

And taking the sample object image and a sample posture graph corresponding to the sample object image as training samples for each sample object image.

In one embodiment, before the training of the initial image inference model, the method further comprises:

The initial image inference model is pre-trained based on at least one of an initial avatar image of each of the plurality of sample objects and the plurality of sample object images.

In a second aspect, the application further provides a virtual object image generation device. The device comprises:

the image acquisition module is used for acquiring an action gesture image and a first virtual object image comprising at least two virtual objects; the action gesture graph is used for designating action gestures of interaction of the at least two virtual objects;

The feature extraction module is used for extracting the virtual object features of the first virtual object image to obtain multi-scale virtual object features, and extracting the action gesture features of the action gesture graph to obtain multi-scale action gesture features;

The noise reduction module is used for acquiring sampling noise characteristics for generating a virtual object image, and performing multi-step noise reduction processing on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics to obtain noise-reduced virtual object image characteristics;

The decoding module is used for decoding the noise-reduced virtual object image characteristics to obtain a second virtual object image; the second virtual object image includes the at least two virtual objects that interact according to the motion gesture.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the virtual object image generation method, the device, the computer equipment, the storage medium and the computer program product, on the basis of acquiring the motion gesture image and the first virtual object image comprising at least two virtual objects, the virtual object characteristics of multiple scales can be obtained by extracting the virtual object characteristics of the first virtual object image, the motion gesture characteristics of the motion gesture image can be extracted, the motion gesture characteristics of multiple scales can be obtained, on the basis of acquiring the sampling noise characteristics for generating the virtual object image, the sampling noise characteristics are subjected to multi-step noise reduction processing based on the multiple scales of the virtual object characteristics and the multiple scales of the motion gesture characteristics, the reference information can be provided for the virtual object image by utilizing the multiple scales of the virtual object characteristics, the reference information can be provided for the motion gesture interacted by the virtual object by utilizing the multiple scales of the motion gesture characteristics, the noise-reduced virtual object image characteristics which accurately represent the virtual object image and the motion gesture interacted by the virtual object can be obtained, and the noise-reduced virtual object image characteristics can be obtained by decoding the noise-reduced virtual object image characteristics. In the whole process, the motion gesture image and the first virtual object image can be used as control signals, and the multi-scale virtual object features and the multi-scale motion gesture features are used as reference information to refine and generate the virtual object image, so that the generation effect of the virtual object image is improved.

Drawings

FIG. 1 is an application environment diagram of a virtual object image generation method in one embodiment;

FIG. 2 is a flow diagram of a virtual object image generation method in one embodiment;

FIG. 3 is a schematic diagram of a first virtual object image in one embodiment;

FIG. 4 is a flow diagram of noise prediction in one embodiment;

FIG. 5 is a schematic diagram of a multi-level downsampling process and upsampling process in one embodiment;

FIG. 6 is a schematic diagram of a multi-level downsampling process in one embodiment;

FIG. 7 is a schematic diagram of feature fusion in one embodiment;

FIG. 8 is a schematic diagram of feature fusion in another embodiment;

FIG. 9 is a schematic diagram of feature fusion in yet another embodiment;

FIG. 10 is a schematic diagram of a multi-level upsampling process in one embodiment;

FIG. 11 is a schematic diagram of virtual object feature extraction in one embodiment;

FIG. 12 is a schematic illustration of an initial avatar image in one embodiment;

FIG. 13 is a schematic illustration of an object representation in one embodiment;

FIG. 14 is a schematic view of a first virtual object image according to another embodiment;

FIG. 15 is a schematic view of a first virtual object image in yet another embodiment;

FIG. 16 is a schematic diagram of motion gesture extraction in one embodiment;

FIG. 17 is a schematic diagram of a virtual object image inference model in one embodiment;

FIG. 18 is a schematic diagram of a noise reducer in one embodiment;

FIG. 19 is a schematic diagram of a virtual object reference control network in one embodiment;

FIG. 20 is a schematic diagram of virtual object reference control network interactions with a noise reducer in one embodiment;

FIG. 21 is a schematic diagram of virtual object reference control network, noise reducer, and motion gesture reference control network interactions in one embodiment;

FIG. 22 is a flow diagram of training an initial image inference model in one embodiment;

FIG. 23 is a flow diagram of constructing training samples in one embodiment;

FIG. 24 is an overall frame diagram of a virtual object image generation method in one embodiment;

FIG. 25 is a schematic flow chart of constructing training samples in another embodiment;

FIG. 26 is a diagram of a model architecture of a stable diffusion model in one embodiment;

FIG. 27 is a schematic diagram of the structure and feature combination of an initial object control network in one embodiment;

FIG. 28 is a schematic diagram of the structure and feature combination of an initial attitude control network in one embodiment;

FIG. 29 is a flow diagram of the inference phase generating a second virtual object image in one embodiment;

FIG. 30 is an exemplary diagram of an interactive object image and a second virtual object image in one embodiment;

FIG. 31 is a block diagram showing the configuration of a virtual object image generating apparatus in one embodiment;

fig. 32 is an internal structural view of a computer device in one embodiment.

Detailed Description

The application relates to the technical field of artificial intelligence. Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence such as machine learning/deep learning, wherein the machine learning (MACHINE LEARNING, ML) is a multi-field intersection subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The virtual object image generation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be provided separately, may be integrated on the server 104, or may be located on a cloud or other server.

The server 104 acquires an action gesture image and a first virtual object image comprising at least two virtual objects; the motion gesture image is used for appointing the motion gesture of at least two virtual object interactions, performing virtual object feature extraction on a first virtual object image to obtain multi-scale virtual object features, performing motion gesture feature extraction on the motion gesture image to obtain multi-scale motion gesture features, obtaining sampling noise features for generating the virtual object image, performing multi-step noise reduction processing on the sampling noise features based on the multi-scale virtual object features and the multi-scale motion gesture features to obtain noise-reduced virtual object image features, decoding the noise-reduced virtual object image features to obtain a second virtual object image, wherein the second virtual object image comprises at least two virtual objects interacted according to the motion gesture, and pushing the second virtual object image to the terminal 102 for display.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers, or may be a cloud server.

In one embodiment, as shown in fig. 2, a virtual object image generation method is provided, which may be performed by a terminal or a server alone or in conjunction with the terminal and the server. In the embodiment of the application, the method is applied to a server for illustration, and comprises the following steps:

Step 202, acquiring an action gesture image and a first virtual object image comprising at least two virtual objects; the action gesture graph is used for specifying action gestures of interaction of at least two virtual objects.

The action gesture graph is used for designating action gestures of interaction of at least two virtual objects. The motion gesture refers to a gesture of the virtual object in the virtual object image, and may include a basic gesture (such as standing, sitting, squatting, kneeling, etc.), a hand motion (such as elbow bending, elbow extension, finger bending, finger extension, fist making, etc.), a leg motion (such as knee bending, knee extension, abduction, adduction, etc.), a trunk motion (such as buckling, stretching, lateral flexion, etc.), a head motion (such as lateral deviation, upward view, lower head, etc.), etc.

Wherein the virtual object refers to a movable object in the virtual environment, and the movable object may be a virtual character, a virtual animal, or the like. For example, when the virtual environment is a three-dimensional virtual environment, the virtual object is a virtual character, a virtual animal, or the like displayed in the three-dimensional virtual environment, and the virtual object has its own shape and volume in the three-dimensional virtual environment and occupies a part of the space in the three-dimensional virtual environment. A virtual environment is an environment provided by a client when running on a terminal. The virtual environment may be a simulation environment for the real world, a semi-simulation and semi-imaginary environment, or a pure imaginary environment. For example, the virtual environment may be a three-dimensional virtual environment.

The first virtual object image refers to an image including at least two virtual objects, and it can be understood that in the first virtual object image, the at least two virtual objects may be arranged in any position, that is, the action gesture of the at least two virtual objects is not limited in the first virtual object image, and the first virtual object image is mainly used as reference information for generating the image of the virtual object. For example, taking at least two virtual objects as two virtual objects, as shown in fig. 3, in the first virtual object image, the two virtual objects (virtual object a and virtual object B respectively as shown in fig. 3) may be arranged in a left-right position, where the virtual object a may be on the left side of the virtual object B or on the right side of the virtual object B.

Specifically, when the virtual object image generation is required, the server acquires an action gesture image and a first virtual object image including at least two virtual objects, so as to generate a second virtual object image according to at least two virtual objects specified in the first virtual object image and an action gesture of interaction of at least two virtual objects specified by the action gesture image.

Step 204, performing virtual object feature extraction on the first virtual object image to obtain multi-scale virtual object features, and performing motion gesture feature extraction on the motion gesture graph to obtain multi-scale motion gesture features.

The virtual object feature refers to a feature capable of representing at least two virtual objects in the first virtual object image, and the at least two virtual objects in the first virtual object image can be distinguished from other virtual objects by the virtual object feature. For example, the virtual object feature may specifically refer to a feature vector capable of representing at least two virtual objects in the first virtual object image. The multi-scale virtual object features refer to features obtained by feature extraction of the first virtual object image from a plurality of scales, and the granularity level of each scale virtual object feature in the multi-scale virtual object features is different.

The motion gesture feature is a feature that can represent a motion gesture in the motion gesture map, and the motion gesture in the motion gesture map can be distinguished from other motion gestures by the motion gesture feature. For example, the motion gesture feature may specifically be a feature vector that can represent a motion gesture in the motion gesture graph. The multi-scale motion gesture features refer to features obtained by extracting features from motion gesture graphs from multiple scales, and the granularity level of each scale motion gesture feature in the multi-scale motion gesture features is different.

Specifically, after the first virtual object image and the motion gesture image are obtained, the server performs multi-scale virtual object feature extraction on the first virtual object image to obtain multi-scale virtual object features, and performs multi-scale motion gesture feature extraction on the motion gesture image to obtain multi-scale motion gesture features, so that the multi-scale virtual object features and the multi-scale motion gesture features are used as references to generate a second virtual object image.

In a specific application, the multi-scale virtual object feature extraction and the multi-scale action gesture feature extraction can be realized through multi-level sampling processing, wherein the number of levels and the sampling mode of the sampling processing can be configured according to an actual application scene when the multi-level sampling processing is performed. For example, the sampling mode may specifically be at least one of upsampling and downsampling.

In a specific application, the multi-scale virtual object features include multi-scale downsampled object features and multi-scale upsampled object features, the multi-scale downsampled object features are obtained by performing multi-level downsampling processing on the first virtual object image, and the multi-scale upsampled object features are obtained by performing multi-level upsampling processing based on the minimum-scale downsampled object features. The multi-scale motion gesture feature can be obtained by performing multi-level downsampling processing on the motion gesture graph.

Step 206, obtaining sampling noise characteristics for generating the virtual object image, and performing multi-step noise reduction processing on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics to obtain the noise-reduced virtual object image characteristics.

The sampling noise characteristics refer to characteristics of sampling noise signals obtained by a random sampling method when a virtual object image is to be generated. For example, the sampling noise signal may be specifically a gaussian noise signal, and the sampling noise characteristic may be obtained by encoding the sampling noise signal obtained by a random sampling manner. The noise reduction process refers to removing noise in the sampled noise features. The virtual object image feature refers to a feature capable of representing a virtual object image that needs to be generated, i.e., a feature of a second virtual object image.

Specifically, the server obtains a sampling noise signal for generating the virtual object image in a random sampling manner, encodes the sampling noise signal to obtain sampling noise characteristics, and performs multi-step noise reduction processing on the sampling noise characteristics based on the multi-scale virtual object characteristics and the multi-scale action posture characteristics to obtain the noise-reduced virtual object image characteristics. In a specific application, the server takes the sampled noise feature as a noise feature subjected to multi-step noise addition, predicts a noise signal added in each step of multi-step noise addition based on the multi-scale virtual object feature and the multi-scale action gesture feature, and gradually carries out noise reduction processing on the sampled noise feature based on the noise signal added in each step, so as to obtain a noise-reduced virtual object image feature from the sampled noise feature.

It should be noted that the multi-scale virtual object features and the multi-scale motion gesture features exist as control signals for generating the noise-reduced virtual object image features, where the multi-scale virtual object features are used to provide reference information for the virtual object image, and the multi-scale motion gesture features are used to provide reference information for the motion gesture of the virtual object interaction, so that the noise-reduced virtual object image features that accurately represent the virtual object image and the motion gesture of the virtual object interaction can be obtained.

Step 208, decoding the noise-reduced virtual object image features to obtain a second virtual object image; the second virtual object image includes at least two virtual objects that interact according to the gesture.

Specifically, the server decodes the noise-reduced image features of the virtual object to obtain a second virtual object image, where the second virtual object image includes at least two virtual objects that interact according to the motion gesture. In a specific application, the image features of the virtual object after noise reduction are decoded, that is, the image features of the virtual object after noise reduction are converted back into the image space of the virtual object in a mapping mode.

According to the virtual object image generation method, on the basis of acquiring the motion gesture image and the first virtual object image comprising at least two virtual objects, the first virtual object image is subjected to virtual object feature extraction, so that multi-scale virtual object features can be obtained, on the basis of acquiring the sampling noise features for generating the virtual object image, the sampling noise features are subjected to multi-step noise reduction processing on the basis of the multi-scale virtual object features and the multi-scale motion gesture features, reference information can be provided for the virtual object image by utilizing the multi-scale virtual object features, reference information is provided for the motion gesture of virtual object interaction by utilizing the multi-scale motion gesture features, the noise-reduced virtual object image features which accurately represent the virtual object image and the motion gesture of the virtual object interaction can be obtained, and further, the second virtual object image can be obtained by decoding the noise-reduced virtual object image features. In the whole process, the motion gesture image and the first virtual object image can be used as control signals, and the multi-scale virtual object features and the multi-scale motion gesture features are used as reference information to refine and generate the virtual object image, so that the generation effect of the virtual object image is improved.

In one embodiment, performing multi-step noise reduction processing on the sampled noise features based on the multi-scale virtual object features and the multi-scale motion gesture features to obtain noise reduced virtual object image features includes:

Taking the sampled noise characteristic as a noise characteristic subjected to multi-step noise adding;

Starting from the last step of multi-step noise adding, performing inverse noise reduction processing on the noise characteristics input in each step based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics;

and taking the noise reduction characteristics obtained by noise reduction processing of the noise characteristics input in the first step as the noise-reduced virtual object image characteristics.

Specifically, when performing multi-step noise reduction processing, the server takes the sampled noise characteristics as noise characteristics subjected to multi-step noise addition, and starts from the last step of multi-step noise addition, and takes the multi-scale virtual object characteristics and the multi-scale action gesture characteristics as reference information, so as to perform inverse noise reduction processing on the noise characteristics input in each step, and take the noise characteristics obtained by performing noise reduction processing on the noise characteristics input in the first step as noise reduction characteristics of the virtual object image characteristics after noise reduction.

In a specific application, the noise features input in the last step of multi-step noise addition are sampling noise features, and starting from the last and second last steps of multi-step noise addition, the noise features input in each step are noise reduction features output after the noise reduction processing in the last step. For each step in multi-step denoising, the multi-scale virtual object features and the multi-scale action gesture features are required to be used as reference information, the noise added by the aimed denoising step is predicted based on the multi-scale virtual object features, the multi-scale action gesture features and the noise features input by the aimed denoising step, and then the noise features input by the aimed denoising step are subjected to noise reduction processing according to the predicted added noise, so that the noise reduction features are obtained.

In a specific application, the noise reduction processing performed by each step can be implemented based on a pre-trained noise reducer, and the pre-trained noise reducer can be configured and trained according to the actual application scenario. Assuming that the sampling noise characteristic is a noise characteristic subjected to T-step noise addition, for each step of T-step noise addition, noise prediction can be performed by using a pre-trained noise reducer based on the multi-scale virtual object characteristic, the multi-scale action gesture characteristic and the noise characteristic of the target noise addition step input, and noise of the target noise addition step input can be further subjected to noise reduction processing by using the noise predicted by the pre-trained noise reducer. It will be appreciated that because the sampled noise signature is a noise signature that is T-step denoised, the pre-trained noise reducer may be used T times during the multi-step noise reduction process.

For example, the pre-trained noise reducer may be a noise reduction network based on a U-Net algorithm, which is one of the earlier algorithms that uses a full convolution network for semantic segmentation, using a symmetrical U-shaped structure that contains both a compressed path and an expanded path is very innovative at that time and affects to some extent the design of the latter several segmentation networks, the name of which is also taken from its U-shape.

In this embodiment, the sampling noise feature is taken as the noise feature subjected to multi-step noise addition, and the noise feature input in each step is reversely noise-reduced based on the multi-scale virtual object feature and the multi-scale action gesture feature from the last step of multi-step noise addition, so that gradual accurate noise reduction can be realized by taking the multi-scale virtual object feature and the multi-scale action gesture feature as reference information, and the noise-reduced virtual object image feature is obtained.

In one embodiment, for each of the multiple steps of denoising, the step of denoising the noise feature of the denoised step input comprises:

Based on the multi-scale virtual object characteristics, the multi-scale action gesture characteristics and the noise characteristics input by the aimed noise adding step, predicting the corresponding noise adding of the aimed noise adding step to obtain the corresponding predicted noise adding of the aimed noise adding step;

And according to the predicted added noise, carrying out noise reduction processing on the noise characteristics of the noise adding step input to be aimed, and obtaining noise reduction characteristics.

Specifically, for each step in multi-step noise adding, the server predicts the corresponding added noise of the noise adding step based on the multi-scale virtual object feature, the multi-scale action gesture feature and the noise feature input by the noise adding step to obtain the corresponding predicted added noise of the noise adding step to be subjected to the noise adding, and subtracts the predicted added noise from the noise feature input by the noise adding step to be subjected to the noise adding to obtain the noise reduction feature.

In a specific application, when predicting the corresponding added noise of the aimed noise adding step, the server will first perform multi-level downsampling processing on the noise characteristics input by the aimed noise adding step based on the multi-scale virtual object characteristics and the multi-scale action gesture characteristics, and then perform multi-level upsampling processing to obtain the corresponding predicted added noise of the aimed noise adding step.

In this embodiment, the prediction of the corresponding added noise of the aimed noise adding step is performed based on the multi-scale virtual object feature, the multi-scale motion gesture feature and the noise feature of the aimed noise adding step input, so that the corresponding predicted added noise of the aimed noise adding step can be obtained, and further the noise feature of the aimed noise adding step input can be directly subjected to noise reduction processing according to the predicted added noise, so as to obtain the noise reduction feature, and noise reduction is realized by means of noise prediction.

In one embodiment, the multi-scale virtual object features include multi-scale downsampled object features and multi-scale upsampled object features; based on the multi-scale virtual object features, the multi-scale action gesture features and the noise features input by the aimed noise adding step, predicting the corresponding added noise of the aimed noise adding step, and obtaining the corresponding predicted added noise of the aimed noise adding step comprises the following steps:

Based on the multi-scale downsampling object features and the multi-scale action gesture features, performing multi-level downsampling processing on the noise features input by the noise adding step to obtain target downsampling noise features;

and performing multi-level up-sampling processing on the target down-sampling noise characteristic based on the multi-scale up-sampling object characteristic to obtain the predicted added noise corresponding to the noise adding step.

The multi-scale virtual object features comprise multi-scale downsampled object features and multi-scale upsampled object features, the multi-scale downsampled object features are obtained by performing multi-level downsampling processing on the first virtual object image, and the multi-scale upsampled object features are obtained by performing multi-level upsampling processing on the basis of the minimum-scale downsampled object features.

Specifically, in the case of obtaining a multi-scale downsampling object feature and a multi-scale upsampling object feature, the process of performing noise prediction may be as shown in fig. 4, where the server may perform multi-level downsampling processing on the noise feature input by the targeted noise adding step based on the multi-scale downsampling object feature and the multi-scale action gesture feature to obtain a target downsampling noise feature, and then perform multi-level upsampling processing on the target downsampling noise feature based on the multi-scale upsampling object feature to obtain the corresponding predicted additive noise of the targeted noise adding step.

In a specific application, when the multi-level downsampling process and the multi-level upsampling process are performed, the number of layers of the downsampling process and the number of layers of the upsampling process can be configured according to an actual application scene, the number of layers of the downsampling process is the same as the number of layers of the upsampling process, and for each level of downsampling process, the feature scale of the downsampling noise feature output by the downsampling process is the same as the feature scale of the upsampling noise feature output by the corresponding upsampling process. For example, when the downsampled noise feature and the upsampled noise feature are both noise feature graphs, the feature scale may specifically refer to the resolution of the feature graphs, and the feature scale may specifically refer to the feature graphs having the same resolution.

Therefore, the corresponding prediction added noise of the aimed noise adding step obtained after the multi-level up-sampling processing is actually the same as the noise characteristic of the aimed noise adding step input, so that the prediction added noise can be directly subtracted from the noise characteristic of the aimed noise adding step input, and the noise reduction processing is realized.

In a specific application, the number of layers of the downsampling process and the number of layers of the upsampling process are both 4, and as shown in fig. 5, for the downsampling process of the first layer, the feature scale of the downsampled noise feature output by the downsampling process is the same as the feature scale of the upsampled noise feature output by the corresponding upsampling process (i.e., the upsampling process of the 4 th layer). For the downsampling process of the second level, the feature scale of the downsampled noise features output by the downsampling process is the same as the feature scale of the upsampled noise features output by the corresponding upsampling process (i.e., the upsampling process of the 3 rd level).

In this embodiment, in this way, when the multi-level downsampling process is performed, the multi-scale downsampling object features and the multi-scale action gesture features are used as reference information, and when the multi-level upsampling process is performed, the multi-scale upsampling object features are used as reference information, and the multi-scale virtual object features and the multi-scale action gesture features can be used as reference information to realize accurate prediction of the corresponding prediction added noise of the noise adding step, thereby realizing accurate noise reduction.

In one embodiment, the downsampling process of each of the multiple levels corresponds to one scale downsampled object feature and one scale action pose feature, respectively;

based on the multi-scale downsampling object features and the multi-scale action gesture features, performing multi-level downsampling processing on the noise features input by the noise adding step to obtain target downsampling noise features, wherein the steps include:

Based on the downsampling object features and the action gesture features corresponding to the downsampling process of the first level, the downsampling process of the first level is carried out on the noise features input by the noise adding step to be aimed, and downsampling noise features output by the first level are obtained;

And at each level after the first level, performing the downsampling processing of the current level on the downsampling noise characteristics output by the previous level based on the downsampling object characteristics and the action gesture characteristics corresponding to the downsampling processing of the current level, so as to obtain target downsampling noise characteristics after the downsampling processing of multiple levels.

The downsampling processing of each level in the downsampling processing of the multiple levels corresponds to one scale downsampling object feature and one scale action gesture feature respectively, the scale sizes of the downsampling object feature and the action gesture feature corresponding to the downsampling processing of the same level are the same, and the scale sizes of downsampling noise features output by the downsampling processing of the level are the same. For example, when the downsampled object feature and the motion gesture feature are both feature graphs, the scale may specifically be the resolution of the feature graphs, and then the same scale means that the resolution of the feature graphs is the same.

Specifically, in the multi-level downsampling processing, the object of the downsampling processing of the first level is the noise feature input by the aimed noise adding step, the object of the downsampling processing is the downsampling noise feature output by the last level in each level after the first level, the server performs the downsampling processing of the first level on the noise feature input by the aimed noise adding step based on the downsampling object feature and the action gesture feature corresponding to the downsampling processing of the first level, so as to obtain the downsampling noise feature output by the first level, and the server performs the downsampling processing of the current level on the downsampling noise feature output by the last level based on the downsampling object feature and the action gesture feature corresponding to the downsampling processing of the current level in each level after the first level, so as to obtain the target downsampling noise feature after the downsampling processing of multiple levels.

In a specific application, taking the number of layers of the downsampling process as 4 as an example, as shown in fig. 6, the downsampling process of each layer corresponds to one scale downsampling object feature and one scale action gesture feature, in a first layer, the server performs the first-layer downsampling process on the noise feature input by the noise adding step based on the downsampling object feature and the action gesture feature corresponding to the first-layer downsampling process to obtain the downsampling noise feature output by the first layer, in a second layer, the server performs the 2 nd-layer downsampling process on the downsampling noise feature output by the first layer based on the downsampling object feature and the action gesture feature corresponding to the downsampling process of the 2 nd layer, and so on, after the downsampling process of 4 layers, the target downsampling noise feature is obtained.

In this embodiment, based on the downsampling object feature and the action gesture feature corresponding to the downsampling process of the first level, the first-level downsampling process is performed on the noise feature input by the noise adding step, so that the downsampling noise feature output by the first level can be obtained, and based on the downsampling object feature and the action gesture feature corresponding to the downsampling process of the current level, the downsampling process of the current level is performed on the downsampling noise feature output by the last level at each level after the first level, and more important features can be extracted through gradual downsampling, so that the target downsampling noise feature can be obtained.

In one embodiment, based on the downsampled object feature and the action gesture feature corresponding to the downsampling process of the first hierarchy, performing the downsampling process of the first hierarchy on the noise feature input by the noise adding step to obtain the downsampled noise feature output by the first hierarchy includes:

Performing downsampling processing on the noise characteristics input by the aimed noise adding step to obtain preliminary downsampled noise characteristics;

and fusing the preliminary downsampling noise characteristics, the downsampling object characteristics corresponding to the downsampling processing of the first level and the action gesture characteristics to obtain downsampling noise characteristics of the first level output.

Specifically, when the first-level downsampling is performed, the server performs downsampling on the noise characteristics input by the noise adding step to be processed to obtain preliminary downsampling noise characteristics, and then fuses the preliminary downsampling noise characteristics, the downsampling object characteristics corresponding to the first-level downsampling processing and the action gesture characteristics to obtain the first-level output downsampling noise characteristics. The dimensions of the preliminary downsampling noise feature, the downsampling object feature corresponding to the downsampling process of the first level, and the motion gesture feature are the same, and the downsampling noise feature outputted by the first level obtained after the three features are fused is also the same as the dimensions of the three features.

It can be understood that, at each level after the first level, the server performs downsampling processing on the downsampled noise feature output by the previous level to obtain a preliminary downsampled noise feature of the current level, and then fuses the preliminary downsampled noise feature of the current level, the downsampled object feature corresponding to the downsampling processing of the current level and the action gesture feature to obtain the downsampled noise feature output by the current level.

In this embodiment, the data amount of the noise feature input by the aimed noise adding step can be reduced by performing downsampling processing on the noise feature input by the aimed noise adding step, more important features are extracted from the data amount, and the downsampled noise feature combining the downsampled object feature and the action gesture feature can be obtained by fusing the features, so that accurate prediction of the corresponding prediction added noise of the aimed noise adding step can be realized based on the downsampled noise feature, and accurate noise reduction is further realized.

In one embodiment, fusing the preliminary downsampling noise feature, the downsampling object feature corresponding to the downsampling process of the first hierarchy, and the action gesture feature to obtain the downsampling noise feature of the first hierarchy output includes:

splicing the preliminary downsampling noise characteristics and downsampling object characteristics corresponding to the downsampling process of the first level to obtain splicing noise characteristics;

Calculating an average value of the spliced noise characteristics based on the dimension of the preliminary downsampling noise characteristics to obtain average noise characteristics;

and superposing the average noise characteristic and the action gesture characteristic corresponding to the downsampling processing of the first level to obtain the downsampled noise characteristic output by the first level.

Specifically, as shown in fig. 7, when features are fused, the server will splice the preliminary downsampling noise feature and the downsampling object feature corresponding to the first-level downsampling process to obtain a spliced noise feature, and calculate an average value of the spliced noise feature based on the dimension of the preliminary downsampling noise feature to obtain an average noise feature, where the dimension of the average noise feature is the same as that of the preliminary downsampling noise feature. On the basis of obtaining the average noise characteristics, the server can superimpose the average noise characteristics and the action gesture characteristics corresponding to the downsampling processing of the first level, and superimpose the characteristic values of the same position points in the action gesture characteristics corresponding to the downsampling processing of the first level on the average noise characteristics to obtain the downsampling noise characteristics output by the first level.

In a specific application, when calculating the average value of the stitching noise feature, for each dimension in the preliminary downsampling noise feature, the server extracts feature values of two dimensions corresponding to the dimension from the stitching noise feature, calculates the average value of the two feature values, and uses the average value of the two feature values as the feature value of the dimension in the same position as the dimension in the average noise feature.

In a specific application, as shown in fig. 8, taking the dimensions (i.e., the size of the dimensions) of the downsampled object feature corresponding to the preliminary downsampling noise feature and the first-level downsampling process as an example, the dimensions of the spliced noise feature obtained by splicing are 10×2, and for the 1 st dimension in the preliminary downsampling noise feature, the server extracts feature values of two dimensions (i.e., the 1 st dimension and the 6 th dimension) corresponding to the 1 st dimension from the spliced noise feature, calculates an average value of the two feature values, and uses the average value of the two feature values as the feature value of the 1 st dimension in the average noise feature. Similarly, a feature value for each dimension in the average noise feature may be obtained. It will be appreciated that the two dimensions of the noise feature that correspond to the dimension in question, one being the dimension in the same position as the dimension in question and the other being the dimension N positions behind the dimension in question, are stitched. N is the total number of dimensions in the dimension direction for which the preliminary downsampling noise feature is made, N being 5 in fig. 8.

In a specific application, as shown in fig. 9, taking the dimension (i.e., the dimension size) of the motion gesture feature corresponding to the downsampling process of the average noise feature and the first level as an example, the dimension (i.e., the dimension size) of 5*2 (may be 5*2 feature points in particular), the feature value at the 1 st point in the downsampling noise feature output by the first level can be obtained by superposing the feature values at the 1 st point in the motion gesture feature corresponding to the downsampling process of the average noise feature and the first level. Similarly, the feature value at each point in the downsampled noise feature output by the first level can be obtained.

In this embodiment, by firstly stitching the preliminary downsampling noise feature and the downsampling object feature corresponding to the downsampling process of the first level to obtain the stitched noise feature, calculating the average value of the stitched noise feature based on the dimension of the preliminary downsampling noise feature, the average noise feature combined with the downsampling object feature can be obtained, and further, the downsampling noise feature combined with the downsampling object feature and the action gesture feature output by the first level can be obtained by superposing the average noise feature and the action gesture feature corresponding to the downsampling process of the first level, so that accurate prediction of the corresponding prediction added noise of the noise adding step can be achieved based on the downsampling noise feature, and accurate noise reduction can be achieved.

In one embodiment, each of the multi-level upsampling processes corresponds to a respective one of the scale upsampled object features;

based on the multi-scale up-sampling object feature, performing multi-level up-sampling processing on the target down-sampling noise feature to obtain the predicted added noise corresponding to the noise adding step, wherein the step of obtaining the predicted added noise comprises the following steps:

Taking the target downsampling noise feature as an input noise feature corresponding to the upsampling processing of the first level, and fusing the upsampling object feature corresponding to the upsampling processing of the first level with the input noise feature to obtain an upsampling noise feature output by the first level;

Connecting the up-sampling noise characteristics output by the previous level with the down-sampling noise characteristics corresponding to the up-sampling noise characteristics output by the previous level at each level after the first level to obtain connection sampling noise characteristics;

And carrying out current-level up-sampling processing on the connected sampling noise characteristics based on up-sampling object characteristics corresponding to the current-level up-sampling processing, so as to obtain the corresponding prediction added noise of the noise adding step subjected to the multi-level up-sampling processing.

Wherein each level of up-sampling processing in the multi-level up-sampling processing corresponds to one scale up-sampling object feature, and for each level of up-sampling processing, the scale size of the up-sampling object feature corresponding to the level of up-sampling processing and the up-sampling noise feature output by the level of up-sampling processing are the same. The downsampled noise features corresponding to the upsampled noise features of the previous level output refer to downsampled noise features having the same feature scale as the upsampled noise features of the previous level output.

Specifically, in the multi-level upsampling process, the server uses the target downsampling noise feature as the input noise feature corresponding to the upsampling process of the first level, that is, the object of the upsampling process of the first level is the target downsampling noise feature, and after the first level, the object of the upsampling process is the connection sampling noise feature, that is, the feature obtained by connecting the upsampling noise feature output by the previous level with the downsampling noise feature corresponding to the upsampling noise feature output by the previous level.

Specifically, at the first level, the server fuses the up-sampling object feature corresponding to the up-sampling process of the first level with the input noise feature to obtain an up-sampling noise feature output by the first level, and at each level after the first level, the server connects the up-sampling noise feature output by the previous level with the down-sampling noise feature corresponding to the up-sampling noise feature output by the previous level to obtain a connection sampling noise feature, and performs the up-sampling process of the current level on the connection sampling noise feature based on the up-sampling object feature corresponding to the up-sampling process of the current level to obtain the predicted added noise corresponding to the noise adding step after the up-sampling process of the multiple levels.

In a specific application, taking the case that the number of layers of the downsampling process is 4 and the number of layers of the upsampling process is 4 as an example, as shown in fig. 10, the upsampling process of each layer corresponds to one scale upsampling object feature, in the first layer, the server fuses the upsampling object feature corresponding to the upsampling process of the first layer with the input noise feature to obtain the upsampling noise feature output by the first layer, in the second layer, the server connects the upsampling noise feature output by the first layer with the downsampling noise feature corresponding to the upsampling noise feature output by the first layer (the downsampling noise feature of the 4 th layer is not shown in fig. 10), so as to obtain a connection sampling noise feature, based on the upsampling object feature corresponding to the upsampling process of the 2 nd layer, the connection sampling noise feature is subjected to the upsampling process of the 2 nd layer, and so on, after the upsampling process of the 4 layers, the corresponding prediction added noise is obtained for the noise adding step.

In this embodiment, by combining the multiscale upsampling object feature and the downsampling noise feature obtained during downsampling, the input target downsampling noise feature is gradually upsampled, and the low-level feature obtained during downsampling, the high-level feature obtained during upsampling and the multiscale upsampling object feature can be fully combined through the gradual upsampling, so as to obtain the corresponding prediction added noise of the noise adding step, thereby realizing accurate noise reduction prediction.

In one embodiment, performing virtual object feature extraction on the first virtual object image to obtain a multi-scale virtual object feature includes:

Encoding the first virtual object image to obtain the characteristics of the encoded virtual object image;

Performing multi-level downsampling processing on the coded virtual object image characteristics to obtain multi-scale downsampled object characteristics;

Taking the minimum-scale downsampling object feature in the multiscale downsampling object features as the minimum-scale upsampling object feature, and performing multi-level upsampling processing on the minimum-scale upsampling object feature to obtain the multiscale upsampling object feature;

And obtaining the multi-scale virtual object features according to the multi-scale downsampled object features and the multi-scale upsampled object features.

Specifically, when the virtual object feature extraction is performed on the first virtual object image, the server encodes the first virtual object image to obtain encoded virtual object image features, performs multi-level downsampling processing on the encoded virtual object image features to obtain multi-scale downsampled object features, uses the minimum-scale downsampled object feature in the multi-scale downsampled object features as the minimum-scale upsampled object feature, performs multi-level upsampling processing on the minimum-scale upsampled object feature to obtain multi-scale upsampled object features, and uses the multi-scale downsampled object features and the multi-scale upsampled object features as the multi-scale virtual object features.

In a specific application, taking the downsampled object features of multiple scales as downsampled object features of 4 scales, and taking the downsampled object features of multiple scales as upsampled features of 4 scales as an example, as shown in fig. 11, the server may encode the first virtual object image through a pre-trained encoder to obtain encoded virtual object image features, then perform 4-level downsampling processing on the encoded virtual object image features to obtain downsampled object features of 4 scales, then use the downsampled object features of the minimum scales as upsampled object features of the minimum scales, and perform multi-level upsampling processing on the upsampled object features of the minimum scales to obtain upsampled object features of 4 scales.

In this embodiment, on the basis of encoding the first virtual object image to obtain the encoded virtual object image feature, the multi-scale downsampling object feature is obtained by performing multi-level downsampling processing, and then the multi-scale upsampling object feature is obtained by performing multi-scale upsampling processing, so that the abundant multi-scale virtual object feature can be obtained, guidance is provided for virtual object image generation, the reduction degree and texture of at least two virtual objects in the generated virtual object image can be improved, and the generation effect of the virtual object image can be improved.

In one embodiment, the first virtual object image is obtained by:

Acquiring initial image images of at least two virtual objects respectively;

Extracting a target image of the virtual object from the initial image of the virtual object for each virtual object;

and performing station merging on the target image images of at least two virtual objects to obtain a first virtual object image.

Wherein, the initial avatar image of the virtual object refers to an image including an avatar of the virtual object. It is understood that a blank area or the like may be included in the initial avatar image in addition to the virtual object. For example, taking the virtual object a and the virtual object B as an example, the initial avatar image may include a blank area in addition to the virtual object as shown in fig. 12. The target avatar image of the avatar refers to an image including only the avatar of the avatar. It will be appreciated that no whitespace is included in the target avatar image. For example, in the case where the initial avatar image is as shown in fig. 12, the target avatar images corresponding to the virtual object a and the virtual object B may not include a white area as shown in fig. 13.

Specifically, when the first virtual object image is acquired, the server firstly acquires initial image images of at least two virtual objects, extracts target image images of the virtual objects from the initial image images of the virtual objects through virtual object segmentation for each virtual object, and performs random station combination on the target image images of the at least two virtual objects to obtain the first virtual object image.

In a specific application, virtual object segmentation refers to the segmentation of virtual objects from an initial avatar image. In one specific application, the server may segment the virtual object from the initial visual image through a pre-trained semantic segmentation model or segmentation tool. The pre-trained semantic segmentation model can be configured according to an actual application scene. For example, the pre-trained semantic segmentation model may be BiseNet (dual-path segmentation network) model. The segmentation tool may be a SEGMENTANYTHING segmentation model-based tool.

In a specific application, the random station merging means that the station of at least two virtual objects is not limited when the station merging is performed, so long as the station merging can be realized. For example, taking the virtual object a and the virtual object B as examples, the first virtual object image after the random station combining may be as shown in fig. 14 or as shown in fig. 15.

In this embodiment, when the initial avatar image of each of the at least two virtual objects is acquired, the target avatar image of the virtual object is extracted from the initial avatar image of the virtual object, and then the target avatar images of each of the at least two virtual objects are combined in a station position, so that the first virtual object image can be obtained, and further the first virtual object image can be used as a control signal to control the generation of the virtual object image, so that the generation effect of the virtual object image can be improved.

In one embodiment, the action gesture graph is obtained by:

Acquiring an interactive object image for specifying an action gesture of interaction of at least two virtual objects; the number of the interactive objects in the interactive object image is the same as the number of the virtual objects;

The interactive object image refers to an image comprising at least two interactive objects, and the number of the interactive objects in the interactive object image is the same as the number of the virtual objects.

Specifically, the server acquires an interactive object image for specifying the action gesture of interaction of at least two virtual objects, and extracts the action gesture of the interactive object in the interactive object image to obtain an action gesture image including the action gesture of the interactive object. In a specific application, the server can extract the action gesture of the interactive object in the interactive object image through a pre-trained action gesture extraction model to obtain an action gesture image comprising the action gesture of the interactive object, and the pre-trained action gesture extraction model can be configured according to an actual application scene. For example, the pre-trained motion gesture extraction model may be a openpose model, and the motion gesture graph is a pose graph output by a openpose model. In one particular application, included in the motion gesture graph are object locations of virtual objects, keypoints, and connection and joint angles between the keypoints. For example, the motion gesture graph obtained by extracting the motion gesture of the interactive object in the interactive object image may be as shown in fig. 16.

In this embodiment, on the basis of acquiring an interactive object image for specifying an action gesture for interaction of at least two virtual objects, an action gesture image can be obtained by extracting the action gesture of the interactive object in the interactive object image, and further the generation of the virtual object image can be controlled by using the action gesture image as a control signal, so that the generation effect of the virtual object image can be improved.

In one embodiment, the second virtual object image is determined by a pre-trained virtual object image inference model; the virtual object image reasoning model comprises a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network and a decoder;

The motion gesture reference control network is used for extracting motion gesture features of the motion gesture graph to obtain multi-scale motion gesture features;

the encoder is used for acquiring sampling noise characteristics for generating the virtual object image;

The decoder is used for decoding the characteristics of the virtual object image after noise reduction to obtain a second virtual object image.

Specifically, the pre-trained virtual object image inference model is a model for generating a second virtual object image, as shown in fig. 17, where the virtual object image inference model includes a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network, and a decoder, where the virtual object reference control network is used to perform virtual object feature extraction on a first virtual object image to obtain multi-scale virtual object features, the action gesture reference control network is used to perform action gesture feature extraction on an action gesture image to obtain multi-scale action gesture features, the encoder is used to obtain sampling noise features for generating the virtual object image, the noise reduction network is used to perform multi-step noise reduction processing on the sampling noise features based on the multi-scale virtual object features and the multi-scale action gesture features to obtain noise-reduced virtual object image features, and the decoder is used to decode the noise-reduced virtual object image features to obtain the second virtual object image. The sampling noise characteristics are obtained by encoding the acquired sampling noise signals.

In this embodiment, accurate reasoning of the second virtual object image can be achieved by using a virtual object image reasoning model including a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network, and a decoder, so as to improve the generation effect of the virtual object image.

In one embodiment, the noise reduction network includes a plurality of noise reducers; in the case of taking the sampled noise characteristics as noise characteristics subjected to multi-step noise addition, each noise reducer corresponds to each step of the multi-step noise addition, respectively;

For each step of multi-step noise adding, a noise reducer corresponding to the aimed noise adding step is used for predicting the corresponding added noise of the aimed noise adding step based on the multi-scale virtual object characteristics, the multi-scale action gesture characteristics and the noise characteristics input by the aimed noise adding step to obtain the corresponding predicted added noise of the aimed noise adding step;

Specifically, the noise reduction network includes a plurality of noise reducers, each corresponding to each step of multi-step noise adding, and configured to perform a noise reduction process, and for each step of multi-step noise adding, the noise reducer corresponding to the aimed noise adding step predicts an added noise corresponding to the aimed noise adding step based on a multi-scale virtual object feature, a multi-scale action gesture feature, and a noise feature input by the aimed noise adding step, so as to obtain a predicted added noise corresponding to the aimed noise adding step.

In this embodiment, for each step in the multi-step noise adding, the noise reducer corresponding to the aimed noise adding step can be utilized to accurately predict the corresponding added noise of the aimed noise adding step, so as to obtain the corresponding predicted added noise of the aimed noise adding step, and further, according to the predicted added noise, the noise characteristics input by the aimed noise adding step can be accurately noise-reduced, so as to obtain the noise-reducing characteristics.

In one embodiment, the multi-scale virtual object features include multi-scale downsampled object features and multi-scale upsampled object features; the noise reducer comprises a first downsampling assembly and a first upsampling assembly;

the first downsampling component is used for performing multi-level downsampling processing on the noise characteristics input by the noise adding step based on the multi-scale downsampling object characteristics and the multi-scale action gesture characteristics to obtain target downsampling noise characteristics;

The first up-sampling component is used for carrying out multi-level up-sampling processing on the target down-sampling noise characteristic based on the multi-scale up-sampling object characteristic to obtain the predicted added noise corresponding to the noise adding step.

Specifically, as shown in fig. 18, the noise reducer includes a first downsampling component and a first upsampling component, and when noise reduction processing is performed, the first downsampling component performs multi-level downsampling processing on noise features input by a noise adding step to be processed based on multi-scale downsampling object features and multi-scale action gesture features to obtain target downsampling noise features. After the target downsampling noise characteristics are obtained, the first upsampling component performs multi-level upsampling processing on the target downsampling noise characteristics based on the multi-scale upsampling object characteristics to obtain the predicted added noise corresponding to the noise adding step.

In one embodiment, a virtual object reference control network includes an encoding component, a second downsampling component, and a second upsampling component;

the coding component is used for coding the first virtual object image to obtain coding virtual object image characteristics;

The second downsampling component is used for performing multi-level downsampling processing on the coded virtual object image characteristics to obtain multi-scale downsampled object characteristics;

The second up-sampling component is used for taking the minimum-scale down-sampling object feature in the multi-scale down-sampling object features as the minimum-scale up-sampling object feature, and performing multi-level up-sampling processing on the minimum-scale up-sampling object feature to obtain the multi-scale up-sampling object feature.

Specifically, as shown in fig. 19, the virtual object reference control network includes an encoding component, a second downsampling component and a second upsampling component, when the virtual object feature extraction is performed on the first virtual object image, the encoding component encodes the first virtual object image to obtain an encoded virtual object image feature, on the basis of the encoded virtual object image feature, the second downsampling component performs multi-level downsampling processing on the encoded virtual object image feature to obtain a multi-scale downsampling object feature, and the second upsampling component uses the minimum-scale downsampling object feature in the multi-scale downsampling object feature as the minimum-scale upsampling object feature and performs multi-level upsampling processing on the minimum-scale upsampling object feature to obtain the multi-scale upsampling object feature.

In one embodiment, the second downsampling assembly includes a plurality of second downsampling layers, each second downsampling layer for obtaining downsampled object features of a scale, and inputting the obtained downsampled object features into the first downsampling assembly at a first downsampling layer of a corresponding hierarchy;

The second upsampling component comprises a plurality of second upsampling layers, each second upsampling layer being for deriving upsampled object features for one dimension, and inputting the derived upsampled object features into the first upsampling layer in the first upsampling component at a corresponding level.

Specifically, as shown in fig. 20, the second downsampling assembly includes a plurality of second downsampling layers, each of which is configured to obtain downsampled object features of a scale, and input the obtained downsampled object features into the first downsampling layer of the first downsampling assembly at a corresponding level, where the corresponding level is the first downsampling layer in which the outputted downsampled noise features have the same feature scale as the downsampled object features. The second upsampling component comprises a plurality of second upsampling layers, each second upsampling layer is used for obtaining upsampled object features of one scale, and the obtained upsampled object features are input into a first upsampling layer which is positioned at a corresponding level in the first upsampling component, wherein the corresponding level is the first upsampling layer, the output upsampled noise features of which are the same as the feature scale of the upsampled object features.

In this embodiment, the downsampled object features obtained by each second downsampling layer are respectively output to the first downsampling layer of the corresponding hierarchy, and the upsampled object features obtained by each second upsampling layer are respectively output to the first upsampling layer of the corresponding hierarchy, so that the multiscale object features can be injected into the noise reduction network, and the dimensions of the object features injected by each layer can be the same as those of the corresponding first downsampling layer or the first upsampling layer, so that reference information can be provided for the virtual object image by using the multiscale virtual object features, so that the noise-reduced virtual object image features of the virtual object image can be accurately represented, and further the second virtual object image can be obtained by decoding the noise-reduced virtual object image features.

In one embodiment, the motion gesture reference control network comprises a plurality of third downsampling layers, each third downsampling layer being configured to obtain motion gesture features of one scale and to input the obtained motion gesture features into a first downsampling layer of the first downsampling assembly at a corresponding level.

Specifically, as shown in fig. 21, the motion gesture reference control network includes a plurality of third downsampling layers, each of which is used for motion gesture features of one scale, and inputs the obtained motion gesture features into a first downsampling layer of a first downsampling assembly at a corresponding level, where the corresponding level refers to a first downsampling layer in which the outputted downsampling noise features have the same feature scale as the motion gesture features.

In this embodiment, the motion gesture features obtained by each third downsampling layer are respectively output to the corresponding first downsampling layers, so that the multi-scale motion gesture features can be injected into the noise reduction network, the dimension of each injected motion gesture feature of each layer can be the same as that of the corresponding first downsampling layer, reference information can be provided for the motion gesture of the virtual object interaction by the multi-scale motion gesture features, the noise-reduced virtual object image features which accurately represent the motion gesture of the virtual object interaction are obtained, and further the noise-reduced virtual object image features can be decoded to obtain the second virtual object image.

Acquiring a plurality of training samples;

Training the initial image reasoning model according to the sample object image and the sample posture graph in the training sample aiming at each training sample to obtain a virtual object image reasoning model; the sample object image comprises at least two sample objects; the sample gesture image is obtained by extracting the action gesture of the sample object image.

The training samples are samples for training a virtual object image reasoning model, each training sample comprises a sample object image and a sample gesture image, each sample object image comprises at least two sample object images, each sample object image is a virtual object appearing in the sample object image, and it is understood that at least two sample objects can interact in any action gesture in the sample object images, namely the action gesture of interaction of at least two sample objects is not limited in the sample object images.

The sample gesture image is a gesture image obtained by extracting the action gesture of a sample object image, and comprises the action gesture of interaction of at least two sample objects in the sample object image. The initial image reasoning model refers to an image reasoning model which is not subjected to parameter training, and the virtual object image reasoning model can be obtained by training the initial image reasoning model.

Specifically, the server acquires a plurality of training samples, and trains the initial image reasoning model according to the sample object image and the sample posture diagram in the training samples for each training sample to obtain a virtual object image reasoning model. The sample object image comprises at least two sample objects, and the sample gesture image is obtained by extracting the action gesture of the sample object image and comprises at least two interaction action gestures of the sample objects.

In a specific application, the initial image inference model includes an initial object control network, an initial encoder, an initial noise reduction network, an initial gesture control network and an initial decoder, a process of training the initial image inference model may be as shown in fig. 22, for each training sample, virtual object feature extraction is performed on sample object images in the training samples through the initial object control network to obtain multi-scale sample object features, sample object images are encoded through the initial encoder to obtain sample encoding features, noise is obtained, noise is processed on the basis of the noise to obtain sample noise features, motion gesture feature extraction is performed on sample gesture graphs in the training samples through the initial gesture control network to obtain multi-scale sample gesture features, multi-step noise reduction processing is performed on sample noise features through the initial noise reduction network based on the multi-scale sample object features and the multi-scale sample gesture features to obtain noise-reduced sample object image features, noise-reduced sample object image feature decoding is performed through the initial decoder to obtain predicted sample object images based on each training sample object image, and the initial image model is adjusted to obtain a virtual inference model (the process is not shown in the inference model 22).

In this embodiment, by acquiring a plurality of training samples, the initial image inference model can be trained by using the sample object image and the sample gesture image in each training sample, so as to acquire the virtual object image inference model, thereby generating the virtual object image by using the virtual object image inference model, realizing accurate inference on the virtual object image, and improving the generation effect of the virtual object image.

In one embodiment, obtaining a plurality of training samples comprises:

Multiple station combination is carried out on the target image images of each of the plurality of sample objects to obtain a plurality of sample object images;

respectively extracting action postures of a plurality of sample object images to obtain respective corresponding sample posture images of the plurality of sample object images;

For each sample object image, the sample object image and a sample posture graph corresponding to the sample object image are taken as training samples.

Wherein the initial avatar image of the sample object refers to an image including an avatar of the sample object. It will be appreciated that, in addition to the sample objects, white areas and the like may be included in the initial avatar image. The target avatar image of the sample object refers to an image including only the avatar of the sample object. It will be appreciated that no whitespace is included in the target avatar image.

Specifically, when obtaining a plurality of training samples, the server firstly obtains initial image images of a plurality of sample objects, extracts target image images of the sample objects from the initial image images of the sample objects through virtual object segmentation for each sample object, performs multi-time station combination on the target image images of the sample objects to obtain a plurality of sample object images, performs action gesture extraction on the sample object images to obtain sample gesture images corresponding to the sample object images, and uses the sample gesture images corresponding to the sample object images and the sample object images as training samples for each sample object image.

In a specific application, the server may segment the sample object from the initial visual image by means of a pre-trained semantic segmentation model or segmentation tool. Under the condition that the target image of each of a plurality of sample objects is obtained, when the station position combination is carried out for a plurality of times, the target image of each of at least two sample objects is selected each time for random station position combination, and one sample object image is obtained.

In a specific application, taking the processing of the initial image of each of the two sample objects to obtain a training sample as an example, as shown in fig. 23, the server will first obtain the initial image of each of the two sample objects, for each sample object, through virtual object segmentation, extract the target image of the sample object from the initial image of the sample object, perform random station merging on the target image of each of the two sample objects to obtain a sample object image, perform action gesture extraction on the sample object image to obtain a sample gesture image corresponding to the sample object image, and use the sample gesture image corresponding to the sample object image and the sample gesture image corresponding to the sample object image as a training sample.

In this embodiment, on the basis of acquiring initial image images of a plurality of sample objects, a plurality of training samples can be obtained by extracting target image firstly, then combining station positions and finally extracting action gestures, so that the requirement on data during training can be greatly reduced, and the generated training samples can enable the model to simultaneously sense the required mixed virtual object image semantics, thereby eliminating the modes of drawing, and the like, and effectively improving the training convenience and the reduction degree of the virtual object image.

In one embodiment, prior to training the initial image inference model, the method further comprises:

The initial image inference model is pre-trained based on at least one of the initial visual image of each of the plurality of sample objects and the plurality of sample object images.

Specifically, before training the initial image reasoning model, the server further performs pre-training on the initial image reasoning model based on at least one of the initial image images of each of the plurality of sample objects and the plurality of sample object images, so that the initial image reasoning model can deeply feel the style of the virtual object to be generated.

In a specific application, the initial image reasoning model comprises an initial object control network, an initial encoder, an initial noise reduction network, an initial gesture control network and an initial decoder, and the initial object control network, the initial encoder, the initial noise reduction network and the initial decoder are mainly trained during pre-training.

In a specific application, when the initial image inference model is pre-trained, at least one of an initial image of each of a plurality of sample objects and a plurality of sample object images is used as a pre-training sample to be input into an initial object control network to obtain multi-scale pre-training sample object features, the pre-training samples are encoded through an initial encoder to obtain pre-training sample encoding features, the pre-training sample encoding features are subjected to noise adding processing to obtain pre-training sample noise features, the pre-training sample noise features are subjected to multi-step noise reduction processing through an initial noise reduction network based on the multi-scale pre-training sample object features to obtain noise-reduced pre-training sample object image features, the noise-reduced pre-training sample object features are decoded through an initial decoder to obtain predicted pre-training sample object images, and the initial image inference model is adjusted based on the predicted pre-training sample object images of each pre-training sample to obtain the pre-training initial image inference model.

In this embodiment, the initial image inference model is pre-trained based on at least one of the initial image images and the sample object images of each of the sample objects, so that the initial image inference model can deeply feel the style of the virtual object to be generated, and is favorable for obtaining the virtual object image inference model capable of realizing accurate prediction, thereby generating the virtual object image by using the virtual object image inference model, realizing accurate inference on the virtual object image, and improving the generation effect of the virtual object image.

The application provides a virtual object image generation method based on a reference control mechanism, which can generate a second virtual object image with high reduction degree of at least two virtual object mixed-up images according to the designated action gesture on the basis of acquiring an action gesture image of the designated action gesture and selecting a first virtual object image with the corresponding object quantity at the same time, so as to realize controllable generation of the action gesture interactive image of combining different virtual objects, form mixed-up effects and promote the generation effect of the virtual object images.

In one embodiment, the virtual object image generation method of the present application will be described taking as an example the determination of the second virtual object image by the pre-trained virtual object image inference model. Specifically, taking the training samples including two sample objects and the number of virtual objects being two as an example, the overall framework of the virtual object image generating method of the present application is shown in fig. 24, and includes a training stage for training an initial image inference model and an inference stage for inferring the virtual object image inference model obtained by training, where the initial image inference model includes an initial object control network, an initial encoder, an initial noise reduction network, an initial gesture control network, and an initial decoder, and the virtual object image inference model includes a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network, and a decoder. Wherein the initial encoder, initial decoder, encoder and decoder are not shown in the figures.

Specifically, as shown in fig. 24, the training phase mainly involves the generation of training samples of the mashup sample object (implemented by the first module shown in fig. 24) and the construction of a controllable generation model based on reference control (i.e., an initial image inference model, which corresponds to the second module in fig. 24). Firstly, generating training samples of mixed-up sample objects, wherein the innovation point of the application is that the standard and difficulty on the training samples are reduced, the initial image images of different sample objects are used as the construction basis of the training samples, the target image images of the different sample objects are obtained first, then mixed-up combined image synthesis of random stations is carried out, the sample object images are obtained, and the sample posture images of the sample object images are generated at the same time, so that data can be provided for a reference control mechanism of a model.

Specifically, the controllable generation model based on reference control is an important innovation mechanism model in the application, and is constructed based on a diffusion model (comprising an initial encoder, an initial noise reduction network and an initial decoder, which can be specifically a stable diffusion model), wherein the main innovation mechanism is virtual object reference control and action gesture reference control, the virtual object reference control mechanism is an innovation mechanism for improving the reduction degree of the virtual object image, the initial object control network is constructed by an up-sampling and down-sampling network, and all attention layers interact with the basic generation network (mainly the initial noise reduction network in the figure 24), so that the multi-layer semantics of the mixed virtual object image are transferred to the generation network. The motion gesture reference control mechanism is an innovative mechanism for improving the restoration degree of the motion gesture of the virtual object interaction, the input of the initial gesture control network is a motion gesture graph of the appointed motion gesture, the motion gesture reference control mechanism is composed of a downsampled network architecture, and the motion gesture information can be transmitted to the generation network at the initial stage of the network, so that the finally generated motion gesture of the virtual object interaction is identical to the appointed motion gesture.

Specifically, as shown in fig. 24, the training phase includes a pre-training phase and an adjustment phase. The pre-training stage uses at least one of the initial image and the multiple sample object images of each of the multiple sample objects to train, and is mainly used for enabling the model to deeply feel the style of learning the sample objects and the style of the scene using the sample objects, for example, the scene using the sample objects can be a game, in this way, the model can deeply feel the style of learning the game, and the weight of the initial object control network can be updated through calculation loss in the pre-training stage. As shown in fig. 24, during the pre-training phase, the input of the initial object control network is a pre-training sample, and the input of the initial noise reduction network is a pre-training sample noise characteristic (i.e., the latency of the pre-training phase in fig. 24) after encoding the pre-training sample and adding noise. The training samples used in the adjustment stage are the mixed-lap data generated in the first module, and in the adjustment stage, the two control networks of the initial object control network and the initial gesture control network are started to train, so that the model can learn the control capability of the action gesture, and after training, the model can master the controllable generation capability of the mixed-lap combination of the virtual objects. The use of the first and second modules in the training phase is described below, respectively.

First, a first module can be understood as a data model in which control of the avatar of the virtual object is performed primarily for the purpose of being able to generate mashup data (i.e., mashup sample objects and mashup virtual objects). In the training stage, the input of the module is the initial image of each of a plurality of sample objects to be mixed, the target image of the sample object can be extracted by utilizing a semantic segmentation model or a segmentation tool, and finally the sample object images can be combined to generate. It should be noted that, in the pre-training stage of the training stage, a large number of initial visual images of sample objects using the corresponding styles of the scenes of the virtual objects (such as the corresponding styles of the games using the virtual objects) need to be prepared to pre-train the first module and the second module, so that each model deeply experiences the learned representation of the styles. In the adjustment stage, the initial image images of at least two sample objects which need to form a mixing combination are selected, then a semantic segmentation model or a segmentation tool is used for extracting the target image images of the sample objects from the initial image images of the sample objects, and then the target image images of the at least two sample objects are combined at will.

It should be noted that, the sample object image is input to the initial encoder, the sample object image is encoded by the initial encoder, after the sample encoding feature is obtained, the sample noise feature (i.e. the latency (potential space feature) of the adjustment stage in fig. 24) is obtained by obtaining noise and performing a noise adding process on the sample encoding feature based on the noise, the sample noise feature is input to the initial noise reduction network, the initial noise reduction network performs a noise reducing process on the sample noise feature by combining the feature output by the initial object control network and the feature output by the initial posture control network, so as to obtain the sample object image feature, and further, the sample object image feature is decoded by the initial decoder, so as to obtain the predicted sample object image, so as to calculate the loss by comparing the sample object image in the training sample with the predicted sample object image, and adjust the model. The initial visual image of each of the plurality of sample objects is primarily used for pre-training the entire model, while the plurality of sample object images is primarily used for tuning the entire model.

In a specific application, taking processing and constructing training samples for respective initial image images of two sample objects as shown in fig. 25, in an adjustment stage, it is required to obtain respective initial image images of two sample objects first, then extract target image images of the sample objects from the initial image images of the sample objects by using a semantic segmentation model or a segmentation tool, then combine the target image images of the two sample objects at random to obtain sample object images, then extract action gestures of the sample object images to obtain action gesture images of the sample object images, and train a generating network by taking the sample object images and the action gesture images of the sample object images as training samples.

Next, as shown in fig. 24, the second module is divided into three modules and two phases (a pre-training phase and an adjustment phase). The three modules are a basic generation network, an initial object control network and an initial pose control network, respectively.

The basic generating network is mainly a Diffusion model for generating an image based on text, specifically, a stable Diffusion model may be used, as shown in fig. 26, a model architecture of the stable Diffusion model is divided into VAE Encoder-Decoder (variable self-encoder, including VAE Encoder (encoder) (i.e., initial encoder) and VAE Decoder (Decoder) (i.e., initial Decoder)), and Diffusion (Diffusion network, including U-net network (e.g., unet +schedule)), where in the training process, an input sample object image is encoded by VAE Encoder (encoder) in the variable self-encoder, then added with noise, generated into Random IMAGE LATENT (Random image potential spatial feature) (i.e., sample noise feature), input into the Diffusion network, specifically, may be U-net network for noise reduction learning, and finally noise reduction becomes a processed IMAGE LATENT (processed image potential feature) (i.e., sample object image feature), and finally, the input sample object image is decoded by the VAE Decoder in the variable self-encoder.

The initial object control network is an important innovation point in the model, and aims to enable the model to feel representation of styles and various image details of at least two virtual objects to be mixed in a multi-level key manner in the training and generating process, provide a guide for virtual object image generation, and further improve the reduction degree and texture of the finally generated virtual object image. The initial object control network comprises a downsampled and upsampled U-net network which is identical in structure to the U-net network in the diffusion model, so that the initial object control network can be initialized with the weight parameters of the diffusion model during the pre-training phase. Through the U-net network in the direct multiplexing diffusion model, on one hand, object reference control and the diffusion model can be more compatible, on the other hand, multi-level semantic information (namely multi-scale sample object characteristics and multi-scale virtual object characteristics) can be injected into the diffusion model, and the injected semantic characteristics of each layer can be the same as the dimensions of the diffusion model.

The structure of the initial object control network may be as shown in fig. 27, in which a sample object image is input in the training stage, after being encoded by an encoder in the variable self-encoder, the sample object image is injected into the U-net network to be used as a feature latent, after being processed by multi-level downsampling and multi-level upsampling in the U-net network, the output of each layer in the U-net network is transmitted to a corresponding layer of the U-net network in the diffusion model to be combined, and the features of the corresponding layers are spliced first, and because the network structures of the upper and lower U-net networks are identical, the dimensions of the features are identical, the dimension of the combined features is twice the original dimension, and in order to be compatible with the dimension of the feature and the model network, the combined features are calculated to be average, so that the original dimension can be restored, and then the next calculation in the U-net network is continued. By the method, the characteristic information of each layer in the initial object control network can be injected into each layer in the diffusion model, so that the perception of the whole model on the style and the detail guidance on at least two virtual objects which are mixed and overlapped are improved.

The initial attitude control network is another important innovation point in the model of the application, and is added into the model for training in the adjustment stage, and the initial attitude control network is formed into a downsampling structure, namely a first half section of a U-net network, and the first half section of the U-net network is adopted to enable the model to be more compatible when being calculated with a diffusion model, and simultaneously, the characteristic fusion is more compact.

The structure of the initial pose control network may be as shown in fig. 28, where the input of the initial pose control network in the training stage is a sample pose graph, and in order to keep the sample object image consistent, the sample pose graph may be obtained by performing motion pose extraction on the sample object image. It should be noted that, the main difference between the initial gesture control network and the initial object control network is that the sample gesture map is directly input, and the sample gesture map is not required to be encoded, mainly because the content of the sample gesture map is single, the rest is the background except the action gesture, if the encoder is used for encoding, not much important information is extracted, and in order to avoid information loss, the sample gesture map is directly used as input.

The output of each layer in the initial gesture control network is also transmitted to the corresponding layer of the U-net network in the diffusion model to be combined, and the combination mode is as shown in fig. 28, and the combination mode is directly added into the diffusion model, because the information of the action gesture is used for supplementing and guiding the final drawing action gesture, the weighting with the original characteristics is not needed, and the addition can be directly carried out. That is, as shown in fig. 28, for the downsampling layer of the U-net network in the diffusion model, the features obtained after the downsampling process need to be spliced with the features of the sample objects with the same scale, and then the average value of the spliced features is calculated and then overlapped with the posture features of the samples with the same scale, so that the downsampling noise features output by the downsampling layer can be obtained.

It should be noted that, in the scheme of the application, since the structure of the initial object control network and the architecture of the initial gesture control network correspond to the U-net network, which is equivalent to adding 1.5U-net networks on the basis of the generation network, in order to increase the training speed and the generation speed of the virtual object image, the initial object control network and the initial gesture control network can be distilled to eliminate some intermediate layers, and the parameter scale of the model is reduced on the premise of ensuring that the final effect is not reduced.

Specifically, in the reasoning stage, under the condition that at least two virtual objects are required to be mixed, initial image images of at least two virtual objects can be acquired first, then a first virtual object image is produced and generated through a first module to serve as input of a virtual object reference control network, and accordingly the reduction degree of finally generated virtual object images is improved. Then inputting an interactive object image for designating the action gesture of the interaction of at least two virtual objects, extracting the action gesture of the interactive objects in the interactive object image by utilizing an action gesture extraction network in a first module to obtain an action gesture image, and controlling the station action gesture of at least two virtual objects in the finally generated virtual object image by taking the action gesture image as the input of an action gesture reference control network. Furthermore, as shown in fig. 24, it is also necessary to acquire a sampling noise feature (i.e., a potential spatial feature of the inference stage in fig. 24) for generating the virtual object image as an input to the noise reduction network to perform noise reduction processing.

In a specific application, taking at least two virtual objects to be mixed as an example, a process of an inference stage is described, as shown in fig. 29, the input of a virtual object reference control network is a first virtual object image including two virtual objects, the number of interactive objects in an interactive object image is also two, an action gesture image can be obtained by extracting an action gesture of the interactive object in the interactive object image, the action gesture image is the input of the action gesture reference control network, an encoder is used for obtaining sampling noise characteristics (latency, potential space characteristics) for generating the virtual object image, and inputting the sampling noise characteristics into a noise reduction network, after the multi-step noise reduction processing of the noise reduction network, the image characteristics of the virtual object after noise reduction can be obtained, and the two virtual objects interacted according to the action gesture in the interactive object image can be obtained by decoding the image characteristics of the virtual object after noise reduction through a decoder.

It should be noted that, when generating the virtual object image, by selecting different interactive object images, a second different virtual object image may be generated, as shown in fig. 30, an example of a part of the interactive object image and the second virtual object image is given, and it can be seen that in the generated second virtual object image, both virtual objects interact according to the action gesture of the interactive object in the interactive object image.

The virtual object image generation method has the following advantages:

Firstly, the application constructs a full-automatic mixed virtual object image controllable generation method, in which the method only needs to carry out adjustment training on a model based on a small amount of initial image images of virtual objects, and can automatically generate corresponding and highly-restored virtual object images according to the provided action gestures in the actual use process.

Secondly, the application can creatively construct a pre-training-adjusting diffusion model learning mechanism on the premise of being based on the initial image of a small number of virtual objects, improves the representation capability of the model on the style of the scene using the virtual objects by pre-training the model, and adjusts the model on the basis of a small number of virtual objects to be mixed, thereby completing the learning of the mixed virtual object image.

In the training process, the basic data are target image images of sample objects, then the sample object images with random station posture actions are generated after different mixing and combining to serve as training samples, the mode can greatly reduce the requirement on data, the model is creatively data by using the sample object images and the sample posture images of the mixed sample objects, the model can sense the needed mixed sample object image semantics at the same time, processing through modes such as matting is not needed, convenience of the whole scheme is effectively improved, and the reduction degree of mixed sample object images is improved.

Secondly, the scheme of the application is not limited to the use scenes of the virtual objects, so that the initial image images of the virtual objects under a plurality of use scenes can be obtained for training, and the capability of generating the mixed image of the virtual objects aiming at different scenes can be mastered by the model, so that the model which can be widely applied is obtained. For example, taking a use scene of a virtual object as an example of a game, virtual objects of different games can be mixed together to manufacture a training sample, and the model can master the capability of generating a combined mixed image of the virtual objects among different games, so that more possibilities and expansion of playing methods can be provided for the model.

Secondly, the scheme of the application creatively constructs a network module for assisting in guiding generation of gesture reference control, the module creatively uses a downsampled network structure to construct, uses intermediate computing features in the network to link with a downsampled layer in a U-net network in a diffusion model, and can control the action gesture position of a finally generated mashup virtual object to be consistent with the action gesture in the input action gesture graph through inputting the action gesture graph in the module, thereby increasing interaction between the mashup virtual objects in the final graph, improving the control capability of generating the virtual object image and increasing the quality of the finally generated virtual object image.

Finally, the scheme of the application creatively constructs a network structure of object reference control for guiding control generation of the image of the mixed virtual object. According to the method, an up-sampling network and a down-sampling network are innovatively constructed, so that image information (namely multi-scale virtual object characteristics) of the mixed virtual object is input into a generating network in a multi-level mode through an object reference control mechanism, the image pictures of at least two virtual objects to be generated are sensed by the depth of a model, and the generation reduction degree of the final mixed virtual object and the reasonability of mixed images are improved. In this way, the reduction degree of the related elements of the virtual object, such as the elements of props, clothing accessories and the like used by the virtual object, can be improved.

In one embodiment, the virtual object image generating method of the application can be applied to the controllable generation of character mashup images in games, and on the basis of combining a plurality of single-person game character image images into a plurality of first mashup character images, the first mashup character images can be input into a model for each first mashup character image, and the model can generate a second mashup character image interacted according to a designated action gesture on the basis of a designated action gesture image through an object reference control mechanism and an action gesture reference control mechanism.

When the virtual object image generation method of the application is applied to the controllable generation of the character mashup image in the game, the virtual object image generation method of the application can be applied to the following scenes:

firstly, the scheme of the application can be applied to a game operation announcement scene, and aiming at a newly released or to-be-released game, a game character CP (Coupling) or a poster of a plurality of character group images is often designed for propaganda release of the game when a platform plans an operation activity. According to the scheme of the application, the exquisite game image action interaction station poster can be automatically generated rapidly according to the selected game role CP or group image role, so that the efficiency of operation activities can be improved, and the cost of game announce operation can be greatly reduced.

Secondly, the scheme of the application can be applied to the generation of commercial IP (intellectual property ) derivatives around games, especially to host games or social network games with directed roles, and can be used for rapidly generating interactive picture pictures or posters among different roles, thereby providing materials for IP commercialization of games and increasing more options of the derivatives.

Finally, the scheme of the application can be applied to game production in a floor manner, and provides a picture or element generating tool for game production. The game can be made in a manner similar to a card game or a mapped storyline, and according to the relation among roles designed in advance, the game role picture of the current storyline or card can be quickly generated for the player in real time when the player plays, so that more play is provided for the player.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a virtual object image generating device for realizing the virtual object image generating method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the virtual object image generating apparatus provided below may refer to the limitation of the virtual object image generating method described above, and will not be repeated herein.

In one embodiment, as shown in fig. 31, there is provided a virtual object image generating apparatus including: an image acquisition module 3102, a feature extraction module 3104, a noise reduction module 3106, and a decoding module 3108, wherein:

an image acquisition module 3102 for acquiring an action gesture image and a first virtual object image including at least two virtual objects; the action gesture graph is used for designating action gestures of interaction of at least two virtual objects;

the feature extraction module 3104 is configured to perform virtual object feature extraction on the first virtual object image to obtain a multi-scale virtual object feature, and perform motion gesture feature extraction on the motion gesture graph to obtain a multi-scale motion gesture feature;

the noise reduction module 3106 is configured to obtain a sampling noise feature for generating a virtual object image, and perform multi-step noise reduction processing on the sampling noise feature based on the multi-scale virtual object feature and the multi-scale motion gesture feature, so as to obtain a denoised virtual object image feature;

a decoding module 3108, configured to decode the denoised virtual object image feature to obtain a second virtual object image; the second virtual object image includes at least two virtual objects that interact according to the gesture.

The virtual object image generating device is capable of obtaining a multi-scale virtual object feature by extracting a virtual object feature from a first virtual object image including at least two virtual objects on the basis of obtaining an action gesture image and the first virtual object image, obtaining a multi-scale action gesture feature by extracting the action gesture feature from the action gesture image, performing multi-step noise reduction processing on the sampling noise feature on the basis of obtaining a sampling noise feature for generating the virtual object image on the basis of the multi-scale virtual object feature and the multi-scale action gesture feature, providing reference information for a virtual object image by using the multi-scale virtual object feature, providing reference information for an action gesture interacted by the virtual object by using the multi-scale action gesture feature, obtaining a noise-reduced virtual object image feature which accurately represents the action gesture interacted by the virtual object image and the virtual object, and further decoding the noise-reduced virtual object image feature to obtain a second virtual object image. In the whole process, the motion gesture image and the first virtual object image can be used as control signals, and the multi-scale virtual object features and the multi-scale motion gesture features are used as reference information to refine and generate the virtual object image, so that the generation effect of the virtual object image is improved.

In one embodiment, the noise reduction module is further configured to take the sampled noise feature as a noise feature subjected to multi-step noise addition, and perform inverse noise reduction processing on the noise feature input in each step based on the multi-scale virtual object feature and the multi-scale motion gesture feature from the last step of multi-step noise addition, and take the noise feature obtained by performing noise reduction processing on the noise feature input in the first step as a noise-reduced virtual object image feature.

In one embodiment, for each step of multi-step denoising, the denoising module is further configured to predict corresponding added noise of the aimed denoising step based on the multi-scale virtual object feature, the multi-scale motion gesture feature and the noise feature input by the aimed denoising step, to obtain corresponding predicted added noise of the aimed denoising step, and to perform denoising processing on the noise feature input by the aimed denoising step according to the predicted added noise, to obtain the denoising feature.

In one embodiment, the multi-scale virtual object features include multi-scale downsampling object features and multi-scale upsampling object features, and the noise reduction module is further configured to perform multi-level downsampling processing on the noise features input by the targeted noise adding step based on the multi-scale downsampling object features and the multi-scale action gesture features to obtain target downsampling noise features, and perform multi-level upsampling processing on the target downsampling noise features based on the multi-scale upsampling object features to obtain the predicted additive noise corresponding to the targeted noise adding step.

In one embodiment, the downsampling process of each level in the multi-level downsampling process corresponds to one scale downsampling object feature and one scale action gesture feature respectively, the noise reduction module is further configured to perform first-level downsampling process on the noise feature input by the noise adding step based on the downsampling object feature and the action gesture feature corresponding to the first-level downsampling process, obtain first-level output downsampling noise features, and perform current-level downsampling process on the downsampling noise features output by the previous level based on the downsampling object feature and the action gesture feature corresponding to the current-level downsampling process at each level after the first level, to obtain target downsampling noise features after the multi-level downsampling process.

In one embodiment, the noise reduction module is further configured to perform downsampling processing on the noise feature input by the aimed noise adding step to obtain a preliminary downsampling noise feature, and fuse the preliminary downsampling noise feature, a downsampling object feature corresponding to the downsampling processing of the first level, and an action gesture feature to obtain a downsampling noise feature output by the first level.

In one embodiment, the noise reduction module is further configured to splice the preliminary downsampling noise feature and a downsampling object feature corresponding to the downsampling process of the first level, obtain a spliced noise feature, calculate an average value of the spliced noise feature based on a dimension of the preliminary downsampling noise feature, obtain an average noise feature, and superimpose the average noise feature and an action gesture feature corresponding to the downsampling process of the first level, so as to obtain a downsampling noise feature output by the first level.

In one embodiment, the upsampling process of each level in the multi-level upsampling process corresponds to a scale upsampling object feature, the noise reduction module is further configured to use the target downsampling noise feature as an input noise feature corresponding to the upsampling process of the first level, and fuse the upsampling object feature corresponding to the upsampling process of the first level with the input noise feature to obtain an upsampling noise feature output by the first level;

and at each level after the first level, based on the up-sampling object characteristics corresponding to the up-sampling processing of the current level, carrying out the up-sampling processing of the current level on the up-sampling noise characteristics output by the previous level, and obtaining the corresponding prediction added noise of the noise adding step aimed at after the up-sampling processing of the multiple levels.

In one embodiment, the feature extraction module is further configured to encode the first virtual object image to obtain an encoded virtual object image feature, perform multi-level downsampling on the encoded virtual object image feature to obtain a multi-scale downsampled object feature, use a minimum-scale downsampled object feature of the multi-scale downsampled object feature as a minimum-scale upsampled object feature, perform multi-level upsampling on the minimum-scale upsampled object feature to obtain a multi-scale upsampled object feature, and obtain the multi-scale virtual object feature according to the multi-scale downsampled object feature and the multi-scale upsampled object feature.

In one embodiment, the image obtaining module is further configured to obtain initial avatar images of each of the at least two virtual objects, extract, for each virtual object, a target avatar image of the virtual object from the initial avatar images of the virtual object, and perform station merging on the target avatar images of each of the at least two virtual objects to obtain the first virtual object image.

In one embodiment, the image acquisition module is further configured to acquire an interaction object image for specifying an action gesture for interaction of the at least two virtual objects; the number of the interactive objects in the interactive object image is the same as the number of the virtual objects, and the action gesture extraction is carried out on the interactive objects in the interactive object image to obtain an action gesture graph.

In one embodiment, the second virtual object image is determined through a pre-trained virtual object image reasoning model, the virtual object image reasoning model comprises a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network and a decoder, the feature extraction module is further used for carrying out virtual object feature extraction on the first virtual object image through the virtual object reference control network to obtain multi-scale virtual object features, carrying out action gesture feature extraction on the action gesture image through the action gesture reference control network to obtain multi-scale action gesture features, the noise reduction module is further used for obtaining sampling noise features for generating the virtual object image through the encoder, carrying out multi-step noise reduction processing on the sampling noise features through the noise reduction network based on the multi-scale virtual object features and the multi-scale action gesture features to obtain noise-reduced virtual object image features, and the decoding module is further used for decoding the noise-reduced virtual object image features through the decoder to obtain the second virtual object image.

In one embodiment, the noise reduction network includes a plurality of noise reducers, where the sampled noise features are taken as noise features subjected to multi-step noise addition, each noise reducer corresponds to each step of the multi-step noise addition, and the noise reduction module is further configured to predict, for each step of the multi-step noise addition, corresponding added noise of the aimed noise addition based on the multi-scale virtual object features, the multi-scale motion gesture features, and the noise features of the aimed noise addition input by the corresponding noise reducer of the aimed noise addition, obtain predicted added noise corresponding to the aimed noise addition, and perform noise reduction processing on the noise features of the aimed noise addition input according to the predicted added noise, so as to obtain the noise reduction features.

In one embodiment, the multi-scale virtual object feature comprises a multi-scale downsampling object feature and a multi-scale upsampling object feature, the noise reducer comprises a first downsampling component and a first upsampling component, the noise reduction module is further used for performing multi-level downsampling processing on the noise feature input by the targeted noise adding step through the first downsampling component based on the multi-scale downsampling object feature and the multi-scale action gesture feature to obtain a target downsampling noise feature, performing multi-level upsampling processing on the target downsampling noise feature through the first upsampling component based on the multi-scale upsampling object feature to obtain the corresponding predicted added noise of the targeted noise adding step.

In one embodiment, the virtual object reference control network includes an encoding component, a second downsampling component and a second upsampling component, where the feature extraction module is further configured to encode the first virtual object image through the encoding component to obtain an encoded virtual object image feature, perform multi-level downsampling processing on the encoded virtual object image feature through the second downsampling component to obtain a multi-scale downsampling object feature, and perform multi-level upsampling processing on the minimum-scale upsampling object feature as a minimum-scale upsampling object feature through the second upsampling component to obtain a multi-scale upsampling object feature.

In one embodiment, the second downsampling assembly comprises a plurality of second downsampling layers, the feature extraction module is further configured to obtain downsampled object features of a scale through each second downsampling layer, and input the obtained downsampled object features into the first downsampling layer of the first downsampling assembly at a corresponding level, the second upsampling assembly comprises a plurality of second upsampling layers, and the feature extraction module is further configured to obtain upsampled object features of a scale through each second upsampling layer, and input the obtained upsampled object features into the first upsampling layer of the first upsampling assembly at a corresponding level.

In one embodiment, the motion gesture reference control network includes a plurality of third downsampling layers, and the feature extraction module is further configured to obtain motion gesture features of a scale through each of the third downsampling layers, and input the obtained motion gesture features into a first downsampling layer of the first downsampling assembly at a corresponding level.

In one embodiment, the virtual object image generating device further includes a training module, the training module is configured to obtain a plurality of training samples, and for each training sample, train the initial image inference model according to a sample object image and a sample posture diagram in the training sample, to obtain a virtual object image inference model, where the sample object image includes at least two sample objects; the sample gesture image is obtained by extracting the action gesture of the sample object image.

In one embodiment, the training module is further configured to obtain initial image images of each of the plurality of sample objects, extract, for each sample object, a target image of the sample object from the initial image images of the sample object, perform multiple-time station combination on the target image images of each of the plurality of sample objects to obtain a plurality of sample object images, perform motion gesture extraction on each of the plurality of sample object images to obtain a sample gesture image corresponding to each of the plurality of sample object images, and use, for each sample object image, the sample gesture image corresponding to each of the sample object images and the sample gesture image corresponding to the sample object image as a training sample.

In one embodiment, the virtual object image generation apparatus further comprises a pre-training module for pre-training the initial image inference model based on at least one of the initial avatar image of each of the plurality of sample objects and the plurality of sample object images.

The respective modules in the above-described virtual object image generating apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device, which may be a server or a terminal, is provided, and in this embodiment, an example in which the computer device is a server is described, and an internal structure thereof may be as shown in fig. 32. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing training samples and other data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a virtual object image generation method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 32 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A virtual object image generation method, the method comprising:

2. The method of claim 1, wherein performing a multi-step noise reduction process on the sampled noise features based on the multi-scale virtual object features and the multi-scale motion gesture features to obtain noise reduced virtual object image features comprises:

3. The method of claim 2, wherein for each of the plurality of steps of adding noise, the step of denoising the noise feature of the noise step input for which it is intended comprises:

Predicting the corresponding added noise of the aimed noise adding step based on the multi-scale virtual object characteristics, the multi-scale action gesture characteristics and the noise characteristics of the aimed noise adding step input to obtain the corresponding predicted added noise of the aimed noise adding step;

And according to the predicted added noise, carrying out noise reduction processing on the noise characteristics of the aimed noise adding step input to obtain noise reduction characteristics.

4. A method according to claim 3, wherein the multi-scale virtual object features comprise multi-scale downsampled object features and multi-scale upsampled object features; the predicting the noise added corresponding to the noise adding step based on the multi-scale virtual object feature, the multi-scale action gesture feature and the noise feature input by the noise adding step comprises the following steps:

5. The method of claim 4, wherein each of the multi-level downsampling processes corresponds to a scale downsampled object feature and a scale action pose feature, respectively;

The step of performing multi-level downsampling processing on the noise characteristics input by the noise adding step based on the multi-scale downsampling object characteristics and the multi-scale action gesture characteristics to obtain target downsampled noise characteristics includes:

6. The method of claim 5, wherein the performing first-level downsampling on the noise feature of the noise-added input based on the downsampled object feature and the motion gesture feature corresponding to the first-level downsampling, and obtaining the downsampled noise feature of the first-level output comprises:

And fusing the preliminary downsampling noise characteristics, downsampling object characteristics corresponding to the downsampling processing of the first level and action gesture characteristics to obtain downsampling noise characteristics of the first level output.

7. The method of claim 6, wherein the fusing the preliminary downsampling noise feature, the downsampling object feature corresponding to the first-level downsampling process, and the motion profile feature to obtain the first-level output downsampling noise feature comprises:

Splicing the preliminary downsampling noise characteristics and downsampling object characteristics corresponding to the downsampling process of the first level to obtain spliced noise characteristics;

calculating the average value of the spliced noise characteristics based on the dimension of the preliminary downsampling noise characteristics to obtain average noise characteristics;

and superposing the average noise characteristic and the action gesture characteristic corresponding to the first-level downsampling processing to obtain the first-level output downsampling noise characteristic.

8. The method of claim 4, wherein each of the multi-level upsampling processes corresponds to a respective one of the scale upsampled object features;

the step of performing multi-level up-sampling processing on the target down-sampling noise feature based on the multi-scale up-sampling object feature to obtain the predicted added noise corresponding to the noise adding step includes:

Taking the target downsampling noise feature as an input noise feature corresponding to the upsampling processing of the first level, and fusing the upsampling object feature corresponding to the upsampling processing of the first level with the input noise feature to obtain an upsampling noise feature of the first level output;

And based on the up-sampling object characteristics corresponding to the up-sampling processing of the current level, carrying out the up-sampling processing of the current level on the connection sampling noise characteristics to obtain the corresponding prediction added noise of the noise adding step subjected to the up-sampling processing of the multiple levels.

9. The method according to any one of claims 1 to 8, wherein the performing virtual object feature extraction on the first virtual object image to obtain a multi-scale virtual object feature includes:

10. The method according to any one of claims 1 to 8, wherein the first virtual object image is obtained by:

Acquiring initial image images of at least two virtual objects respectively;

and performing station merging on the target image images of the at least two virtual objects to obtain a first virtual object image.

11. The method according to any one of claims 1 to 8, wherein the second virtual object image is determined by a pre-trained virtual object image reasoning model; the virtual object image reasoning model comprises a virtual object reference control network, an encoder, a noise reduction network, an action gesture reference control network and a decoder;

12. The method of claim 11, wherein the noise reduction network comprises a plurality of noise reducers; in the case that the sampled noise characteristic is taken as a noise characteristic subjected to multi-step noise addition, each noise reducer corresponds to each step of the multi-step noise addition respectively;

For each step of the multi-step denoising, the denoising device corresponding to the aimed denoising step is used for predicting the corresponding added noise of the aimed denoising step based on the multi-scale virtual object characteristics, the multi-scale action gesture characteristics and the noise characteristics of the aimed denoising step input to obtain the corresponding predicted added noise of the aimed denoising step; and according to the predicted added noise, carrying out noise reduction processing on the noise characteristics of the aimed noise adding step input to obtain noise reduction characteristics.

13. The method of claim 12, wherein the multi-scale virtual object features include multi-scale downsampled object features and multi-scale upsampled object features; the noise reducer comprises a first downsampling assembly and a first upsampling assembly;

The first downsampling component is used for performing multi-level downsampling processing on the noise characteristics input by the targeted noise adding step based on the multi-scale downsampling object characteristics and the multi-scale action gesture characteristics to obtain target downsampling noise characteristics;

14. The method of claim 13, wherein the virtual object reference control network comprises an encoding component, a second downsampling component, and a second upsampling component;

The second upsampling component is configured to take a minimum-scale downsampling object feature of the multiscale downsampling object features as a minimum-scale upsampling object feature, and perform multi-level upsampling processing on the minimum-scale upsampling object feature to obtain a multiscale upsampling object feature.

15. The method of claim 14, wherein the second downsampling assembly comprises a plurality of second downsampling layers, each of the second downsampling layers for deriving downsampled object features of a scale, and inputting the derived downsampled object features into a first downsampling layer of the first downsampling assembly at a respective level;

The second upsampling component comprises a plurality of second upsampling layers, each of which is used for obtaining upsampled object features of one scale, and inputting the obtained upsampled object features into a first upsampling layer in a corresponding level in the first upsampling component.

16. The method of claim 13, wherein the motion gesture reference control network comprises a plurality of third downsampling layers, each for deriving motion gesture features of a scale, and inputting the derived motion gesture features into a first downsampling layer of the first downsampling assembly at a respective level.

17. A virtual object image generation apparatus, characterized in that the apparatus comprises:

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.