CN117894038A

CN117894038A - Method and device for generating object gesture in image

Info

Publication number: CN117894038A
Application number: CN202311848079.9A
Authority: CN
Inventors: 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-04-16

Abstract

The disclosure relates to the technical field of computer vision, and provides a method and a device for generating object gestures in an image. The method comprises the following steps: extracting skeleton posture features of the target posture image and the image to obtain a posture feature image of the target posture image and a posture feature image of the image; performing feature fusion processing based on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image; noise prediction is carried out on the noise image based on a plurality of fusion feature images of the image, an attitude feature image of the target attitude image and a feature image of the noise image, so that a noise feature image corresponding to the noise image is obtained; the target image is generated based on the characteristic image of the noise and the characteristic image of the noise image, so that the target images of the same object and different postures can be generated, and the problem of high cost for generating a plurality of images with different postures for the same object in the prior art is solved.

Description

Method and device for generating object gesture in image

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method and an apparatus for generating an object gesture in an image.

Background

As artificial intelligence techniques continue to break through iterations, artificial intelligence content generation is increasingly being applied. In recent years, image generation techniques based on deep learning and diffusion models have made great progress. In the prior art, images of different postures of the same person are generated, a plurality of images of the person are required to be collected, lora fine adjustment or full-parameter training is carried out on an image generation model based on the images, the cost is high, and the method for controlling the postures based on a control network cannot ensure that the generated objects are the same person.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a readable storage medium for generating a captain gesture in an image, so as to solve the problem in the prior art that generating multiple images with different gestures for the same object is costly.

In a first aspect of an embodiment of the present disclosure, there is provided a method for generating an object pose in an image, including: acquiring an image and a target attitude image, wherein the object attitude in the image is different from the object attitude in the target attitude image; inputting the image into a parallel network to perform noise processing on the image, so as to obtain a feature map of a noise image corresponding to the image; extracting skeleton posture features of the target posture image to obtain a posture feature image of the target posture image, and extracting skeleton posture features of the image to obtain a posture feature image of the image; carrying out convolution processing on the image to obtain a global feature map of the image, and carrying out feature fusion processing on the basis of the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image; noise prediction is carried out on the noise image based on a plurality of fusion feature images of the image, an attitude feature image of the target attitude image and a feature image of the noise image, so that a noise feature image corresponding to the noise image is obtained; and generating a target image based on the characteristic image of the noise and the characteristic image of the noise image, wherein the object posture of the target image is the same as the object posture in the target posture image.

In a second aspect of the embodiments of the present disclosure, there is provided an object pose generation apparatus in an image, including: the acquisition module is used for acquiring an image and a target attitude image, wherein the object attitude in the image is different from the object attitude in the target attitude image; the noise adding module is used for inputting the image into the parallel network to carry out noise adding processing on the image so as to obtain a feature map of a noise image corresponding to the image; the gesture feature extraction module is used for extracting skeleton gesture features of the target gesture image to obtain a gesture feature image of the target gesture image, and extracting skeleton gesture features of the image to obtain a gesture feature image of the image; the fusion module is used for carrying out convolution processing on the image to obtain a global feature map of the image, and carrying out feature fusion processing on the basis of the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image; the noise prediction module is used for carrying out noise prediction on the noise image based on the multiple fusion feature images of the image, the gesture feature image of the target gesture image and the feature image of the noise image to obtain a noise feature image corresponding to the noise image; and the object posture image generation module is used for generating a target image based on the characteristic image of the noise and the characteristic image of the noise image, wherein the object posture of the target image is the same as that in the target posture image.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: the image and the target posture image are input into the parallel network through the design of the parallel network, the image contains a target object with the posture required to be changed, the posture of the target object in the image is a first posture, the posture of the object in the target posture image is a second posture, the target image can be output and obtained, the posture of the target object in the target image is the second posture, and when the second postures of the objects in the target posture image input into the diffusion model are different, a plurality of images with different postures of the target object can be output and obtained. The image input parallel network carries out noise processing on the image to obtain a feature map of a noise image corresponding to the image, and carries out skeleton gesture feature extraction on the target gesture image and the image to obtain a gesture feature map of the target gesture image and a gesture feature map of the image. And carrying out feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image, carrying out noise prediction on the noise image by the plurality of fusion feature maps of the image, the attitude feature map of the target attitude image and the feature map of the noise image to obtain a noise feature map corresponding to the noise image, and generating a target image according to the noise feature map and the feature map of the noise image, wherein the attitude of a target object in the target image is a second attitude. The method comprises the steps of adding and denoising images through a parallel network, extracting features of the images, extracting skeleton gesture features of target gesture images and performing interactive calculation on the images, and finally outputting a target object with changed gesture through interactive calculation of two networks of the parallel network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following description will briefly explain the embodiments or the drawings required to be used in the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;

Fig. 2 is a flowchart of a method for generating an object gesture in an image according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of another method for generating object gestures in an image according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for generating an object pose in an image according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an object gesture generating device in an image according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

An object pose generation method and apparatus in an image according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 1,2 and 3, a server 4 and a network 5.

The terminal devices 1, 2 and 3 may be hardware or software. When the terminal devices 1, 2 and 3 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 4, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal apparatuses 1, 2, and 3 are software, they can be installed in the electronic apparatus as above. The terminal devices 1, 2 and 3 may be implemented as a plurality of software or software modules, or as a single software or software module, to which the embodiments of the present disclosure are not limited. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 1, 2, and 3.

The server 4 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 4 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present disclosure.

The server 4 may be hardware or software. When the server 4 is hardware, it may be various electronic devices that provide various services to the terminal devices 1, 2, and 3. When the server 4 is software, it may be a plurality of software or software modules providing various services to the terminal devices 1, 2, and 3, or may be a single software or software module providing various services to the terminal devices 1, 2, and 3, which is not limited by the embodiments of the present disclosure.

The network 5 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.

The user can establish a communication connection with the server 4 via the network 5 through the terminal devices 1,2, and 3 to receive or transmit information or the like. Specifically, the server 4 acquires an image in which the object pose is different from the object pose in the target pose image and the target pose image from the terminal device 1,2, or 3; inputting the image into a parallel network to perform noise processing on the image, so as to obtain a feature map of a noise image corresponding to the image; extracting skeleton posture features of the target posture image to obtain a posture feature image of the target posture image, and extracting skeleton posture features of the image to obtain a posture feature image of the image; carrying out convolution processing on the image to obtain a global feature map of the image, and carrying out feature fusion processing on the basis of the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image; noise prediction is carried out on the noise image based on a plurality of fusion feature images of the image, an attitude feature image of the target attitude image and a feature image of the noise image, so that a noise feature image corresponding to the noise image is obtained; and generating a target image based on the characteristic image of the noise and the characteristic image of the noise image, wherein the object posture of the target image is the same as the object posture in the target posture image.

It should be noted that the specific types, numbers and combinations of the terminal devices 1,2 and 3, the server 4 and the network 5 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.

Fig. 2 is a flowchart of an object gesture generating method in an image according to an embodiment of the present disclosure. The object pose generation method in the image of fig. 2 may be performed by the terminal device or the server of fig. 1. As shown in fig. 2, the object gesture generating method in the image includes:

Step 201, acquiring an image and a target posture image, wherein the posture of an object in the image is different from the posture of the object in the target posture image.

In some embodiments, the image includes a target object, the pose of the target object is a first pose, the pose of the object in the target pose image is a second pose, the first pose is different from the second pose, and the target object in the image and the object in the target pose image may be the same or different. Specifically, if the image is a standing white plum image, the target posture image is a love image in a bending posture, the two images are input into the diffusion model to output and obtain a white plum image in the bending posture, and if the target posture image is a Bai Juyi image in a running posture, the output and obtain a white plum image in the running posture. When the gesture of the object in the input target gesture image is different, the white plum image under various other gestures can be output and obtained. Based on the information of the image and the information of the target pose image, important guidance is provided for generating a target image containing the target object.

Step 202, inputting the image into a parallel network to perform noise adding processing on the image, and obtaining a feature map of a noise image corresponding to the image.

In some embodiments, the parallel network includes two parallel neural networks, one neural network being a diffusion model and the other being a feature fusion model. The diffusion model comprises a noise adding module and a noise removing module, wherein the noise adding module is a part for adding noise data to an input image, and the noise removing module is a part for carrying out noise prediction on the noise image added with the noise data. The denoising module of the parallel network adds random noise data corresponding to t time steps to the image, wherein t is a positive integer, different time steps correspond to different denoising steps, and the random noise data can be Gaussian noise data. In the process of adding random noise data corresponding to t time steps to an image, the larger the time step t is, the larger the noise amount of the random noise data is. At each time step, gaussian noise is added to the previous noise image, for example, when t is 5, gaussian noise is added to the image z0 input to the parallel network to obtain a noise image z1, gaussian noise is added to the noise image z1 to obtain a noise image z2, and so on to obtain a noise image z5. Specifically, the image is subjected to noise adding processing, namely a random noise value conforming to Gaussian distribution is added to each pixel point of the image at each time step, the gray value or the color value of the pixel point can be influenced, the diffusion behavior of the pixel value in the image in space is simulated, and a noise image corresponding to the image is obtained, wherein the noise image comprises image information and the information of the added Gaussian noise data. And extracting features of the noise image, for example, performing convolution processing on the noise image through a convolution layer to obtain a feature map of the noise image corresponding to the image, which is used for representing feature information of the noise image and facilitating tasks such as generating a subsequent noise prediction target image. By adding random noise data to the image, the parallel network can simulate the degradation process of the image under the influence of various factors, a series of noise images containing noise are generated, the noise images containing noise can be used as training data for learning a method for generating a target image, and the parallel network is trained. The parallel network further comprises a first convolution layer, and after the noise image is obtained, the first convolution layer is used for carrying out convolution processing on the noise image to obtain a feature map of the noise image.

Step 203, extracting skeleton posture features of the target posture image to obtain a posture feature map of the target posture image, and extracting skeleton posture features of the image to obtain a posture feature map of the image.

In some embodiments, the bone pose feature extraction may be to extract position and pose information of human bone points from images or videos using a deep learning technique. Specifically, skeleton gesture feature extraction can be performed on a target gesture image through a graph convolution neural network to obtain a gesture feature graph of the target gesture image, and skeleton gesture feature extraction is performed on the image through a graph convolution neural network to obtain a gesture feature graph of the image. The above modules for extracting bone pose features from the target pose image and the image, respectively, include, but are not limited to, a bone neural network, an open pose estimation model (Openpose), and the like. And extracting skeleton posture features of the target posture image to obtain a posture feature image of the target posture image, extracting skeleton posture features of the image to obtain a posture feature image of the image, helping a diffusion network to recognize and understand the postures and actions of the image and the object in the target posture image, further performing tasks such as feature fusion processing on the global feature image of the subsequent image and the posture feature image of the image, performing noise prediction on the noise image, and generating a target image with the posture of the target object being the second posture.

Step 204, performing convolution processing on the image to obtain a global feature map of the image, and performing feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image.

In some embodiments, the parallel network further includes a second convolution layer, which may be a 3x3 convolution layer. And extracting the characteristics of the image through a second convolution layer of the parallel network, converting the image into data which can be identified by a computer, obtaining a global characteristic image of the image, including information of all pixel points in the image, obtaining the overall characteristic information of the image, and capturing the characteristics such as color, texture, shape and the like of the image. The global feature map of the image and the gesture feature map of the image are input into a feature fusion model of a parallel network to perform feature fusion processing, so that richer and more accurate feature representations, namely a plurality of fusion feature maps of the image are obtained, the plurality of fusion feature maps of the image comprise global information and gesture information of the image, the noise image can be subjected to tasks such as noise prediction to generate a target image by utilizing the plurality of fusion feature maps of the image, more accurate noise prediction is obtained, and the target image which comprises a target object with a second gesture is accurately and accurately generated.

Step 205, performing noise prediction on the noise image based on the multiple fusion feature images of the image, the gesture feature image of the target gesture image and the feature image of the noise image, so as to obtain a feature image of noise corresponding to the noise image.

In some embodiments, a plurality of fusion feature images of an image, a gesture feature image of the image, a gesture feature image of a target gesture image and a feature image of a noise image are input into a denoising module of a parallel network, t times of noise prediction is performed on the noise image by utilizing the self-adaptive capacity of the trained parallel network, one noise feature image can be obtained through each prediction, and t noise feature images corresponding to the noise image can be obtained. In the embodiment of the disclosure, a generating process of a target image including a target object in a second posture is modeled as a denoising process of a diffusion model, and a feature map of predicted noise is used as a condition of the denoising process of the diffusion model, so that accuracy of the target image can be improved, an accurate and fidelity target image can be generated, and images in different postures of the target object can be generated based on one image including the target object through a parallel network in the embodiment of the disclosure.

Step 206, generating a target image based on the characteristic image of the noise and the characteristic image of the noise image, wherein the object pose of the target image is the same as the object pose in the target pose image.

In some embodiments, generating the target image based on the feature map of the noise and the feature map of the noise image may be subtracting t feature maps of the noise image from the feature map of the noise image to obtain a feature map of the target image without noise, and generating the target image based on the feature map of the target image, the target image including the target object and having a second pose, the pose of the target object in the target image being the same as the pose of the object in the target pose image. In the use stage of the parallel network, images of different poses of the object in the reference image can be generated based on one reference image and a plurality of target pose images.

According to the method for generating the object posture in the image, a parallel network is designed, an image and a target posture image are input into the parallel network, the image contains a target object with the posture to be changed, the posture of the target object in the image is a first posture, the posture of the object in the target posture image is a second posture, the target image can be output and obtained, the posture of the target object in the target image is the second posture, and when the second postures of the objects in the target posture image input into a diffusion model are different, a plurality of images with different postures of the target object can be output and obtained. The image input parallel network carries out noise processing on the image to obtain a feature map of a noise image corresponding to the image, and carries out skeleton gesture feature extraction on the target gesture image and the image to obtain a gesture feature map of the target gesture image and a gesture feature map of the image. And carrying out feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image, carrying out noise prediction on the noise image by the plurality of fusion feature maps of the image, the attitude feature map of the target attitude image and the feature map of the noise image to obtain a noise feature map corresponding to the noise image, and generating a target image according to the noise feature map and the feature map of the noise image, wherein the attitude of a target object in the target image is a second attitude. The parallel network comprises two networks, one network is used for adding and removing noise from the image, the other network is used for extracting the characteristics of the image, meanwhile, the target attitude image and the image are subjected to skeleton attitude characteristic extraction and interaction calculation, and finally, the target object with the attitude changed is output through the interaction calculation of the two networks of the parallel network, so that the problem that the cost for generating multiple target images with different attitudes for the same object is high is solved.

In some embodiments, the parallel network further includes a human body pose model skeleton pose extraction network and a first full connection layer, and performing skeleton pose feature extraction on the target pose image to obtain a pose feature map of the target pose image includes: extracting skeleton posture features of the target posture image through a skeleton posture extraction network of the human body posture model to obtain an initial posture feature image of the target posture image; and carrying out feature transformation on the initial gesture feature map of the target gesture image through the first full-connection layer to obtain a gesture feature map of the target gesture image.

In some embodiments, the skeleton gesture extraction network of the human body gesture model may be a graph convolution neural network, the target gesture image is input into the graph convolution neural network to extract skeleton gesture features, and position and gesture information of skeleton nodes of the object in the target gesture image are extracted to obtain an initial gesture feature map of the target gesture image. In order to reduce the parameter number of subsequent calculation and speed up the calculation, the initial gesture feature map of the target gesture image can be input into the first full-connection layer to perform feature transformation and dimension reduction on the initial gesture feature map of the target gesture image, so as to obtain the gesture feature map of the target gesture image, and the resolution can be 64×64, so that more accurate and robust image generation of the object gesture can be realized.

In some embodiments, the parallel network further includes a second full-connection layer, and extracting skeleton gesture features of the image to obtain a gesture feature map of the image includes: extracting skeleton posture features of the image through a skeleton posture extraction network of the human body posture model to obtain an initial posture feature map of the image; and carrying out feature transformation on the initial gesture feature map of the image through the second full-connection layer to obtain the gesture feature map of the image.

In some embodiments, the human body posture model skeleton posture extraction network may be a graph convolution neural network or an open posture estimation model, and the structure of the human body posture model skeleton posture extraction network is not particularly limited in the disclosure, an image is input into the graph convolution neural network to perform skeleton posture feature extraction, and position and posture information of skeleton joints of an object in the image are extracted to obtain an initial posture feature graph of the image. In order to reduce the parameter number of subsequent calculation and speed up the calculation, the initial gesture feature map of the image can be input into a second full-connection layer to perform feature transformation and dimension reduction processing on the initial gesture feature map of the image, so as to obtain the gesture feature map of the image, and the resolution ratio of the gesture feature map can be 64×64, so that more accurate and robust object gesture generation can be realized.

In some embodiments, the parallel network further includes a first residual processing module, a second residual processing module, a first feature processing module, a second feature processing module, a third feature processing module, and a fourth feature processing module, and performing feature fusion processing based on the global feature map of the image and the pose feature map of the image, where obtaining the plurality of fused feature maps of the image includes: adding the global feature map of the image and the attitude feature map of the image to obtain a first initial fusion feature map of the image, inputting the first initial fusion feature map of the image into a first residual error processing module to perform feature transformation to obtain a first residual error processing result, and performing downsampling processing on the first residual error processing result to obtain a first downsampling result; adding the first downsampling result and the attitude feature map of the image to obtain a second initial fusion feature map of the image, inputting the second initial fusion feature map of the image into a second residual processing module for feature transformation to obtain a second residual processing result, and downsampling the second residual processing result to obtain a second downsampling result; inputting the second downsampling result into a first feature processing module for feature processing to obtain a first feature processing result, and downsampling the first feature processing result to obtain a third downsampling result; inputting the third downsampling result into a second feature processing module for feature processing to obtain a second feature processing result; adding the two second feature processing results to obtain a third initial fusion feature image of the image, inputting the third initial fusion feature image of the image into a third feature processing module for feature processing to obtain a third feature processing result, and performing up-sampling processing on the third feature processing result to obtain a first up-sampling result; adding the first upsampling result and the first feature processing result to obtain a fourth fused initial feature image of the image, and inputting the fourth fused initial feature image of the image into a fourth feature processing module for feature processing to obtain a fourth feature processing result; and determining the first feature processing result, the second feature processing result, the third feature processing result and the fourth feature processing result as a plurality of fusion feature graphs of the image.

In some embodiments, the feature fusion model includes a first residual processing module, a second residual processing module, a first feature processing module, a second feature processing module, a third feature processing module, and a fourth feature processing module. The first residual processing module comprises three residual processing units, wherein one residual processing unit only comprises a residual calculation layer, and one residual calculation layer consists of a 1x1 convolution layer, a batch normalization layer, a 3x3 convolution layer, a PRelu activation layer and a 1x1 convolution layer. The second residual processing module comprises four residual processing units. The first feature processing module comprises six feature processing units, and one feature processing unit comprises a residual calculation layer, a self-attention layer and a cross-attention layer. The second feature processing module comprises seven feature processing units, the third feature processing module comprises seven feature processing units, and the fourth feature processing module comprises six feature processing units.

Inputting the global feature map of the image and the gesture feature map of the image into a feature fusion model of a parallel network, adding the global feature map of the image and the gesture feature map of the image, fusing the global information of the image and the gesture information of the image to obtain a first initial fusion feature map of the image, inputting the first initial fusion feature map of the image into a first residual error processing module to perform feature transformation, performing nonlinear transformation on the first initial fusion feature map of the image through a series of convolution, activation functions and pooling operations to obtain a first residual error processing result, wherein the resolution of the first residual error processing result can be 64×64, performing downsampling processing on the first residual error processing result to obtain a first downsampling result, and the resolution of the first downsampling result can be 32×32.

And adding the first downsampling result and the gesture feature map of the image, fusing the information of the first downsampling result and the gesture information of the image to obtain a second initial fusion feature map of the image, inputting the second initial fusion feature map of the image into a second residual error processing module for feature transformation, performing nonlinear transformation on the second initial fusion feature map of the image through a series of convolution, activation functions and pooling operation to obtain a second residual error processing result, wherein the resolution of the second residual error processing result can be 32 multiplied by 32, and performing downsampling processing on the second residual error processing result to obtain a second downsampling result, and the resolution of the second downsampling result can be 16 multiplied by 16.

And inputting the second downsampling result into a first feature processing module for feature processing, and obtaining a first feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the first feature processing result can be 16 multiplied by 16, and the downsampling processing is performed on the first feature processing result to obtain a third downsampling result, and the resolution of the third downsampling result can be 8 multiplied by 8.

And inputting the third downsampling result into a second feature processing module for feature processing, and obtaining a second feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the second feature processing result can be 8 multiplied by 8.

And adding the two second feature processing results to obtain a third initial fusion feature map of the image, inputting the third initial fusion feature map of the image into a third feature processing module for feature processing, obtaining a third feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the third feature processing result can be 8 multiplied by 8, and carrying out up-sampling processing on the third feature processing result to obtain a first up-sampling result, and the resolution of the first up-sampling result can be 16 multiplied by 16.

And adding the first upsampling result and the first feature processing result to obtain a fourth fused initial feature image of the image, inputting the fourth fused initial feature image of the image into a fourth feature processing module, and obtaining a fourth feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the fourth feature processing result can be 16 multiplied by 16.

And determining the obtained first feature processing result, second feature processing result, third feature processing result and fourth feature processing result as a plurality of fusion feature graphs of the image. The noise image can be subjected to tasks such as noise prediction to generate a target image by utilizing a plurality of fusion feature images of the image, so that more accurate noise prediction is obtained, and the target image which is accurate and fidelity and contains the target object with the second gesture is generated.

In some embodiments, the parallel network includes a third residual processing module, a fourth residual processing module, a fifth feature processing module, a sixth feature processing module, a seventh feature processing module, an eighth feature processing module, a fifth residual processing module, and a sixth residual processing module, and performing noise prediction on the noise image based on a plurality of fused feature maps of the image, a pose feature map of the target pose image, and a feature map of the noise image, where obtaining the feature map of the noise corresponding to the noise image includes: adding the feature images of the noise image and the gesture feature images of the target gesture image to obtain a first fusion feature image of the noise image, inputting the first fusion feature image of the noise image into a third residual processing module for feature transformation to obtain a third residual processing result, and performing downsampling processing on the third residual processing result to obtain a fourth downsampling result; adding the fourth downsampling result and the gesture feature image of the target gesture image to obtain a second fusion feature image of the noise image, inputting the second fusion feature image of the noise image into a fourth residual processing module to perform feature transformation to obtain a fourth residual processing result, and performing downsampling processing on the fourth residual processing result to obtain a fifth downsampling result; performing feature processing by a fifth feature processing module based on a fifth downsampling result, a first feature processing result, an attitude feature image of the image and an attitude feature image of the target attitude image to obtain a fifth feature processing result, and performing downsampling processing on the fifth feature processing result to obtain a sixth downsampling result; performing feature processing by a sixth feature processing module based on a sixth downsampling result, a second feature processing result, an attitude feature map of the image and an attitude feature map of the target attitude image to obtain a sixth feature processing result; adding the two sixth feature processing results to obtain a third fusion feature image of the noise image, performing feature processing on the basis of the third fusion feature image of the noise image, the third feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image through a seventh feature processing module to obtain a seventh feature processing result, and performing up-sampling processing on the seventh feature processing result to obtain a second up-sampling result; adding the second up-sampling result and the fifth feature processing result to obtain a fourth fusion feature image of the noise image, performing feature processing by an eighth feature processing module based on the fourth fusion feature image of the noise image, the fourth feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image to obtain an eighth feature processing result, and performing up-sampling processing on the eighth feature processing result to obtain a third up-sampling result; adding the third upsampling result, the gesture feature image of the target gesture image and the fourth residual error processing result to obtain a fifth fusion feature image of the noise image, inputting the fifth fusion feature image of the noise image into a fifth residual error processing module for feature transformation to obtain a fifth residual error processing result, and upsampling the fifth residual error processing result to obtain a fourth upsampling result; adding the fourth upsampling result, the gesture feature image of the target gesture image and the third residual error processing result to obtain a sixth fusion feature image of the noise image, and inputting the sixth fusion feature image of the noise image into a sixth residual error processing module to perform feature transformation to obtain a sixth residual error processing result; and carrying out noise prediction processing on the feature map of the noise image based on the sixth residual result to obtain the feature map of the noise corresponding to the noise image.

In some embodiments, the denoising module of the diffusion model further includes a third residual processing module, a fourth residual processing module, a fifth feature processing module, a sixth feature processing module, a seventh feature processing module, an eighth feature processing module, a fifth residual processing module, and a sixth residual processing module. The third residual processing module comprises three residual processing units, wherein one residual processing unit only comprises a residual calculation layer, and one residual calculation layer consists of a 1x1 convolution layer, a batch normalization layer, a 3x3 convolution layer, a PRelu activation layer and a 1x1 convolution layer. The fourth residual processing module comprises four residual processing units. The fifth feature processing module comprises six feature processing units, and one feature processing unit comprises a residual calculation layer, a self-attention layer and a cross-attention layer. The sixth feature processing module comprises seven feature processing units, the seventh feature processing module comprises seven feature processing units, the eighth feature processing module comprises six feature processing units, the fifth residual processing module comprises four residual processing units, and the sixth residual processing module comprises three residual processing units.

In some embodiments, a diffusion model is input to a plurality of fusion feature images of an image, a gesture feature image of the image, a gesture feature image of a target gesture image and a feature image of a noise image, the feature images of the noise image and the gesture feature image of the target gesture image are added, information of the noise image and gesture information of the target gesture image are fused to obtain a first fusion feature image of the noise image, the first fusion feature image of the noise image is input to a third residual processing module to perform feature transformation, the first fusion feature image of the noise image is subjected to nonlinear transformation through a series of convolution, activation functions and pooling operations to obtain a third residual processing result, the resolution of the third residual processing result can be 64×64, the third residual processing result is subjected to downsampling processing to obtain a fourth downsampling result, and the resolution of the fourth downsampling result can be 32×32.

Adding the fourth downsampling result and the gesture feature map of the target gesture image, fusing the information of the fourth downsampling result and the gesture information of the target gesture image to obtain a second fused feature map of the noise image, inputting the second fused feature map of the noise image into a fourth residual error processing module to perform feature transformation, performing nonlinear transformation on the second fused feature map of the noise image through a series of convolution, activation function and pooling operation to obtain a fourth residual error processing result, wherein the resolution of the fourth residual error processing result can be 32×32, performing downsampling processing on the fourth residual error processing result to obtain a fifth downsampling result, and the resolution of the fifth downsampling result can be 16×16.

Inputting the fifth downsampling result, the first feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image into a fifth feature processing module for feature processing, obtaining a fifth feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the fifth feature processing result can be 16×16, and downsampling the fifth feature processing result to obtain a sixth downsampling result, and the resolution of the sixth downsampling result can be 8×8.

Inputting the sixth downsampling result, the second feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image into a sixth feature processing module for feature processing, and obtaining a sixth feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the sixth feature processing result can be 8×8.

And adding the two sixth feature processing results to obtain a third fusion feature image of the noise image, inputting the third fusion feature image of the noise image, the third feature processing result, the gesture feature image of the image and the gesture feature image of the target gesture image into a seventh feature processing module for feature processing, obtaining a seventh feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the seventh feature processing result can be 8 multiplied by 8, and carrying out up-sampling processing on the seventh feature processing result to obtain a second up-sampling result, and the resolution of the second up-sampling result can be 16 multiplied by 16.

And adding the second up-sampling result and the fifth feature processing result, fusing the information of the second up-sampling result and the information of the fifth feature processing result to obtain a fourth fused feature image of the noise image, inputting the fourth fused feature image of the noise image, the fourth feature processing result, the gesture feature image of the image and the gesture feature image of the target gesture image into an eighth feature processing module for feature processing, and obtaining an eighth feature processing result through a series of residual processing, self-attention processing and cross-attention processing, wherein the resolution of the eighth feature processing result can be 16 multiplied by 16, and the up-sampling processing of the eighth feature processing result can be carried out to obtain a third up-sampling result, and the resolution of the third up-sampling result can be 32 multiplied by 32.

And adding the third upsampling result, the gesture feature map of the target gesture image and the fourth residual processing result to obtain a fifth fusion feature map of the noise image, which is a richer and more accurate feature representation, inputting the fifth fusion feature map of the noise image into a fifth residual processing module for feature transformation, performing nonlinear transformation on the fifth fusion feature map of the noise image through a series of convolution, activation functions and pooling operations to obtain a fifth residual processing result, wherein the resolution of the fifth residual processing result can be 32×32, and performing upsampling processing on the fifth residual processing result to obtain a fourth upsampling result, and the resolution of the fourth upsampling result can be 64×64.

And adding the fourth upsampling result, the gesture feature map of the target gesture image and the third residual error processing result to obtain a sixth fusion feature map of the noise image, which is a richer and more accurate feature representation, inputting the sixth fusion feature map of the noise image into a sixth residual error processing module to perform feature transformation, and performing nonlinear transformation on the sixth fusion feature map of the noise image through a series of convolution, activation functions and pooling operations to obtain a sixth residual error processing result, wherein the resolution of the sixth residual error processing result can be 64 multiplied by 64.

And carrying out noise prediction processing on the feature map of the noise image based on the sixth residual result to obtain a feature map of noise corresponding to the noise image, wherein the feature map of noise can be used for describing noise features in the noise image predicted by the parallel network, such as the type, degree, distribution and the like of noise. And carrying out t times of noise prediction processing on the feature images of the noise images, and repeating the process for t times to obtain t noise feature images corresponding to the noise images. The subsequent target image generation task can be performed based on the t noise feature images, and the accuracy of object gesture generation in the images is improved.

Referring to fig. 3, the parallel network includes a diffusion model noise adding module 301, a first convolution layer 302, a human body posture model skeleton posture extracting network 303, a first full connection layer 304, a second full connection layer 305, a second convolution layer 306, a first residual error processing module 307, a second residual error processing module 308, a first feature processing module 309, a second feature processing module 310, a third feature processing module 311, a fourth feature processing module 312, a third residual error processing module 313, a fourth residual error processing module 314, a fifth feature processing module 315, a sixth feature processing module 316, a seventh feature processing module 317, an eighth feature processing module 318, a fifth residual error processing module 319, and a sixth residual error processing module 320, and the diffusion model noise adding module 301 of the parallel network inputs the image to perform noise adding processing to obtain a noise image corresponding to the image, and inputs the noise image to the first convolution layer 302 to perform convolution processing to obtain a feature map of the noise image. The target posture image is input into a human body posture model skeleton posture extraction network 303 to extract skeleton posture characteristics, an initial posture characteristic image of the target posture image is obtained, and the initial posture characteristic image of the target posture image is input into a first full-connection layer 304 to perform characteristic transformation, so that a posture characteristic image of the target posture image is obtained. The image is input into a human body posture model skeleton posture extraction network 303 to extract skeleton posture characteristics, an initial posture characteristic image of the image is obtained, and the initial posture characteristic image of the image is input into a second full-connection layer 305 to perform characteristic transformation, so that a posture characteristic image of the image is obtained. The image is input to a second convolution layer 306 for convolution processing to obtain a global feature map of the image. Adding the global feature map of the image and the attitude feature map of the image to obtain a first initial fusion feature map of the image, inputting the first initial fusion feature map of the image into a first residual processing module 307 for feature transformation to obtain a first residual processing result, and performing downsampling processing on the first residual processing result to obtain a first downsampling result; adding the first downsampling result and the attitude feature map of the image to obtain a second initial fusion feature map of the image, inputting the second initial fusion feature map of the image into a second residual processing module 308 for feature transformation to obtain a second residual processing result, and downsampling the second residual processing result to obtain a second downsampling result; inputting the second downsampling result into a first feature processing module 309 for feature processing to obtain a first feature processing result, and downsampling the first feature processing result to obtain a third downsampling result; inputting the third downsampling result into a second feature processing module 310 to perform feature processing to obtain a second feature processing result; adding the two second feature processing results to obtain a third initial fusion feature image of the image, inputting the third initial fusion feature image of the image into a third feature processing module 311 for feature processing to obtain a third feature processing result, and performing up-sampling processing on the third feature processing result to obtain a first up-sampling result; and adding the first upsampling result and the first feature processing result to obtain a fourth fused initial feature map of the image, and inputting the fourth fused initial feature map of the image into a fourth feature processing module 312 to perform feature processing to obtain a fourth feature processing result. Adding the feature images of the noise image and the gesture feature images of the target gesture image to obtain a first fusion feature image of the noise image, inputting the first fusion feature image of the noise image into a third residual processing module 313 for feature transformation to obtain a third residual processing result, and performing downsampling processing on the third residual processing result to obtain a fourth downsampling result; adding the fourth downsampling result and the gesture feature image of the target gesture image to obtain a second fusion feature image of the noise image, inputting the second fusion feature image of the noise image into a fourth residual processing module 314 to perform feature transformation to obtain a fourth residual processing result, and performing downsampling processing on the fourth residual processing result to obtain a fifth downsampling result; inputting the fifth downsampling result, the first feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image into a fifth feature processing module 315 for feature processing to obtain a fifth feature processing result, and performing downsampling processing on the fifth feature processing result to obtain a sixth downsampling result; inputting the sixth downsampling result, the second feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image into a sixth feature processing module 316 for feature processing to obtain a sixth feature processing result; adding the two sixth feature processing results to obtain a third fused feature image of the noise image, inputting the third fused feature image of the noise image, the third feature processing result, the gesture feature image of the image and the gesture feature image of the target gesture image into a seventh feature processing module 317 for feature processing to obtain a seventh feature processing result, and performing up-sampling processing on the seventh feature processing result to obtain a second up-sampling result; adding the second up-sampling result and the fifth feature processing result to obtain a fourth fused feature image of the noise image, inputting the fourth fused feature image of the noise image, the fourth feature processing result, the gesture feature image of the image and the gesture feature image of the target gesture image into an eighth feature processing module 318 for feature processing to obtain an eighth feature processing result, and up-sampling the eighth feature processing result to obtain a third up-sampling result; adding the third upsampling result, the gesture feature image of the target gesture image and the fourth residual error processing result to obtain a fifth fusion feature image of the noise image, inputting the fifth fusion feature image of the noise image into a fifth residual error processing module 319 for feature transformation to obtain a fifth residual error processing result, and upsampling the fifth residual error processing result to obtain a fourth upsampling result; adding the fourth upsampling result, the gesture feature image of the target gesture image and the third residual error processing result to obtain a sixth fusion feature image of the noise image, and inputting the sixth fusion feature image of the noise image into a sixth residual error processing module 320 to perform feature transformation to obtain a sixth residual error processing result; and carrying out noise prediction processing on the feature map of the noise image based on the sixth residual result to obtain the feature map of the noise corresponding to the noise image. The noise-based feature map can carry out subsequent target image generation tasks, and accuracy of object gesture generation in the images is improved.

In some embodiments, performing, by a fifth feature processing module, feature processing based on the fifth downsampling result, the first feature processing result, the pose feature map of the image, and the pose feature map of the target pose image, to obtain a fifth feature processing result, including: carrying out convolution processing on the fifth downsampling result to obtain a feature vector after convolution; pooling is carried out on the attitude feature images of the images to obtain attitude feature vectors of the images, and pooling is carried out on the attitude feature images of the target attitude images to obtain the attitude feature vectors of the target attitude images; performing matrix mapping on the convolved feature vector based on the query matrix to obtain a query vector; splicing the attitude feature vector of the image, the attitude feature vector of the target attitude image and the convolved feature vector to obtain a spliced feature vector, performing matrix mapping on the spliced feature vector based on a key matrix to obtain a key vector, and performing matrix mapping on the spliced feature vector based on a value matrix to obtain a value vector; performing self-attention coding based on the query vector, the key vector and the value vector to obtain a self-attention coding result; and performing cross attention processing by taking the self attention coding result as a query vector and the first feature processing result as a key vector and a value vector to obtain a fifth feature processing result.

In some embodiments, the fifth feature processing module includes six feature processing units, where one feature processing unit includes a residual calculation layer, a self-attention layer, and a cross-attention layer, and the fifth downsampling result is input into the residual calculation layer of the fifth feature processing module to perform convolution processing, so as to obtain a richer and more accurate feature representation, i.e. a feature vector after convolution. Inputting the gesture feature map of the image and the gesture feature map of the target gesture image into a fifth feature processing module, carrying out pooling processing on the gesture feature map of the image along the width dimension and the height dimension to extract key gesture features, obtaining gesture feature vectors of the image, carrying out pooling processing on the gesture feature map of the target gesture image along the width dimension and the height dimension to extract key gesture features, and obtaining the gesture feature vectors of the target gesture image. The query matrix (to_q) is an important parameter of the parallel network obtained in the training process, the convolved feature vector is mapped on the basis of the query matrix (to_q), the convolved feature vector is mapped to another space or dimension, and a mapped vector, namely a query vector (query), can be obtained. And splicing the attitude feature vector of the image, the attitude feature vector of the target attitude image and the convolved feature vector, and fusing the information of the plurality of vectors to obtain a spliced feature vector. The key matrix and the value matrix are important parameters of the parallel network obtained in the training process, the spliced feature vector is mapped on the basis of the key matrix (to_k), and the spliced feature vector is mapped to another space or dimension, so that a mapped vector, namely a key vector (key), can be obtained. The value matrix (to_v) is used for mapping the spliced feature vector, and the spliced feature vector is mapped to another space or dimension, so that a mapped vector, namely a value vector (value), can be obtained. The self-attention encoding is performed based on the query vector, the key vector and the value vector, and a weight distribution representing the similarity of the query vector and the key vector can be obtained by calculating the dot product of the query vector and the key vector, and the weighted value vector, namely the self-attention encoding result, is obtained by multiplying the weight distribution and the value vector. And performing cross attention processing by taking the self attention coding result as a query vector and taking the first feature processing result as a key vector and a value vector, obtaining a weight distribution by matching the query vector with a plurality of key vectors, and performing weighting processing on the value vector by the weight distribution to obtain a final feature representation, namely an output result of the feature processing unit, and performing feature processing by six feature processing units in a fifth feature processing module, wherein the processing process is as shown above, so as to obtain a fifth feature processing result.

As shown in fig. 4, the fifth feature processing module 315 includes a first feature processing unit 410, a second feature processing unit 420, a third feature processing unit 430, a fourth feature processing unit 440, a fifth feature processing unit 450, and a sixth feature processing unit 460, and the first feature processing unit 410 includes a residual calculation layer 411, a self-attention layer 412, and a cross-attention layer 413. Each feature processing unit comprises a residual calculation layer, a self-attention layer and a cross-attention layer. The fifth downsampling result is input to the residual calculation layer 411 of the first feature processing unit 410 of the fifth feature processing module 315 for convolution processing, so as to obtain a feature vector after convolution. And inputting the gesture feature map of the image and the gesture feature map of the target gesture image into a fifth feature processing module 315, and performing self-attention encoding based on the convolved feature vector, the gesture feature map of the image and the gesture feature map of the target gesture image to obtain a self-attention encoding result. The first feature processing result is input to the cross attention layer 413, and cross attention processing is performed based on the first feature processing result and the self attention encoding result, resulting in a cross attention processing result. And inputs the cross attention processing result, the first feature processing result, the gesture feature map of the image, and the gesture feature map of the target gesture image into other feature processing units (the second feature processing unit 420, the third feature processing unit 430, the fourth feature processing unit 440, the fifth feature processing unit 450, and the sixth feature processing unit 460) in the fifth feature processing module 315 to perform feature processing, and the processing procedure refers to the processing mode of the first feature processing unit, so as to finally obtain a fifth feature processing result.

In some embodiments, generating the target image based on the feature map of the noise and the feature map of the noise image includes: subtracting the characteristic image of the noise from the characteristic image of the noise image to obtain a characteristic image of the target image; and inputting the feature map of the target image into a decoder for decoding processing to obtain the target image containing the target object with the target posture.

In some embodiments, t noise maps are subtracted from the feature map of the noise image, the influence of noise features is removed from the feature map of the noise image, so that a purer feature map of the target image is obtained, the feature map of the target image is input to a decoder for decoding, the decoder refers to a deep learning model, which may be a decoder of a transform model, and may be used to convert the feature map of the target image into a pixel representation of the target image, and by processing of the decoder, a target image including a target object may be obtained, the target image includes the target object and has a second pose, and the pose of the target object in the target image is the same as the pose of the object in the target pose image. In the use stage of the parallel network, images of different poses of the target object in the reference image can be generated based on one reference image and a plurality of target pose images.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 5 is a schematic diagram of an object gesture generating device in an image according to an embodiment of the present disclosure. As shown in fig. 5, the object posture generating device in the image includes:

An acquiring module 501, configured to acquire an image and a target pose image, where a pose of an object in the image is different from a pose of the object in the target pose image;

The noise adding module 502 is configured to input an image into a parallel network to perform noise adding processing on the image, so as to obtain a feature map of a noise image corresponding to the image;

The gesture feature extraction module 503 is configured to perform skeletal gesture feature extraction on the target gesture image to obtain a gesture feature map of the target gesture image, and perform skeletal gesture feature extraction on the image to obtain a gesture feature map of the image;

the fusion module 504 is configured to perform convolution processing on the image to obtain a global feature map of the image, and perform feature fusion processing based on the global feature map of the image and the gesture feature map of the image to obtain a plurality of fusion feature maps of the image;

The noise prediction module 505 is configured to perform noise prediction on the noise image based on the multiple fusion feature maps of the image, the gesture feature map of the target gesture image, and the feature map of the noise image, so as to obtain a feature map of noise corresponding to the noise image;

an object pose image generation module 506, configured to generate a target image based on the feature map of the noise and the feature map of the noise image, where the object pose of the target image is the same as the object pose in the target pose image.

According to the technical scheme provided by the embodiment of the disclosure, the parallel network is designed, the image and the target posture image are input into the parallel network, the image contains the target object with the posture to be changed, the posture of the target object in the image is the first posture, the posture of the object in the target posture image is the second posture, the target image can be output and obtained, the posture of the target object in the target image is the second posture, and when the second postures of the objects in the target posture image input into the diffusion model are different, a plurality of images with different postures of the target object can be output and obtained. The image input parallel network carries out noise processing on the image to obtain a feature map of a noise image corresponding to the image, and carries out skeleton gesture feature extraction on the target gesture image and the image to obtain a gesture feature map of the target gesture image and a gesture feature map of the image. And carrying out feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image, carrying out noise prediction on the noise image by the plurality of fusion feature maps of the image, the attitude feature map of the target attitude image and the feature map of the noise image to obtain a noise feature map corresponding to the noise image, and generating a target image according to the noise feature map and the feature map of the noise image, wherein the attitude of a target object in the target image is a second attitude. The parallel network comprises two networks, one network is used for adding and removing noise from the image, the other network is used for extracting the characteristics of the image, meanwhile, the target posture image and the image are subjected to skeleton posture characteristic extraction and interaction calculation, finally, the target object with changed posture is output through the interaction calculation of the two networks of the parallel network, the purpose that the target image with the same object and different postures can be generated without additional training is achieved in the using stage, and the problem that the cost for generating the image with the same object and different postures is high in the prior art is solved.

In some embodiments, the parallel network includes a human body posture model skeleton posture extraction network and a first full connection layer, and the posture feature extraction module 503 is configured to perform skeleton posture feature extraction on the target posture image through the human body posture model skeleton posture extraction network, so as to obtain an initial posture feature map of the target posture image; and carrying out feature transformation on the initial gesture feature map of the target gesture image through the first full-connection layer to obtain a gesture feature map of the target gesture image.

In some embodiments, the parallel network includes a second full connection layer, and the gesture feature extraction module 503 is configured to perform skeleton gesture feature extraction on the image through the skeleton gesture extraction network of the human body gesture model, so as to obtain an initial gesture feature map of the image; and carrying out feature transformation on the initial gesture feature map of the image through the second full-connection layer to obtain the gesture feature map of the image.

In some embodiments, the parallel network includes a first residual processing module, a second residual processing module, a first feature processing module, a second feature processing module, a third feature processing module, and a fourth feature processing module, and the fusion module 504 is configured to add the global feature map of the image and the gesture feature map of the image to obtain a first initial fusion feature map of the image, input the first initial fusion feature map of the image to the first residual processing module to perform feature transformation to obtain a first residual processing result, and perform downsampling processing on the first residual processing result to obtain a first downsampling result; adding the first downsampling result and the attitude feature map of the image to obtain a second initial fusion feature map of the image, inputting the second initial fusion feature map of the image into a second residual processing module for feature transformation to obtain a second residual processing result, and downsampling the second residual processing result to obtain a second downsampling result; inputting the second downsampling result into a first feature processing module for feature processing to obtain a first feature processing result, and downsampling the first feature processing result to obtain a third downsampling result; inputting the third downsampling result into a second feature processing module for feature processing to obtain a second feature processing result; adding the two second feature processing results to obtain a third initial fusion feature image of the image, inputting the third initial fusion feature image of the image into a third feature processing module for feature processing to obtain a third feature processing result, and performing up-sampling processing on the third feature processing result to obtain a first up-sampling result; adding the first upsampling result and the first feature processing result to obtain a fourth fused initial feature image of the image, and inputting the fourth fused initial feature image of the image into a fourth feature processing module for feature processing to obtain a fourth feature processing result; and determining the first feature processing result, the second feature processing result, the third feature processing result and the fourth feature processing result as a plurality of fusion feature graphs of the image.

In some embodiments, the parallel network includes a third residual processing module, a fourth residual processing module, a fifth feature processing module, a sixth feature processing module, a seventh feature processing module, an eighth feature processing module, a fifth residual processing module, and a sixth residual processing module, where the noise prediction module 505 is configured to add the feature map of the noise image and the gesture feature map of the target gesture image to obtain a first fused feature map of the noise image, input the first fused feature map of the noise image to the third residual processing module to perform feature transformation to obtain a third residual processing result, and perform downsampling processing on the third residual processing result to obtain a fourth downsampling result; adding the fourth downsampling result and the gesture feature image of the target gesture image to obtain a second fusion feature image of the noise image, inputting the second fusion feature image of the noise image into a fourth residual processing module to perform feature transformation to obtain a fourth residual processing result, and performing downsampling processing on the fourth residual processing result to obtain a fifth downsampling result; performing feature processing by a fifth feature processing module based on a fifth downsampling result, a first feature processing result, an attitude feature image of the image and an attitude feature image of the target attitude image to obtain a fifth feature processing result, and performing downsampling processing on the fifth feature processing result to obtain a sixth downsampling result; performing feature processing by a sixth feature processing module based on a sixth downsampling result, a second feature processing result, an attitude feature map of the image and an attitude feature map of the target attitude image to obtain a sixth feature processing result; adding the two sixth feature processing results to obtain a third fusion feature image of the noise image, performing feature processing on the basis of the third fusion feature image of the noise image, the third feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image through a seventh feature processing module to obtain a seventh feature processing result, and performing up-sampling processing on the seventh feature processing result to obtain a second up-sampling result; adding the second up-sampling result and the fifth feature processing result to obtain a fourth fusion feature image of the noise image, performing feature processing by an eighth feature processing module based on the fourth fusion feature image of the noise image, the fourth feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image to obtain an eighth feature processing result, and performing up-sampling processing on the eighth feature processing result to obtain a third up-sampling result; adding the third upsampling result, the gesture feature image of the target gesture image and the fourth residual error processing result to obtain a fifth fusion feature image of the noise image, inputting the fifth fusion feature image of the noise image into a fifth residual error processing module for feature transformation to obtain a fifth residual error processing result, and upsampling the fifth residual error processing result to obtain a fourth upsampling result; adding the fourth upsampling result, the gesture feature image of the target gesture image and the third residual error processing result to obtain a sixth fusion feature image of the noise image, and inputting the sixth fusion feature image of the noise image into a sixth residual error processing module to perform feature transformation to obtain a sixth residual error processing result; and carrying out noise prediction processing on the feature map of the noise image based on the sixth residual result to obtain the feature map of the noise corresponding to the noise image.

In some embodiments, the noise prediction module 505 is configured to convolve the fifth downsampled result to obtain a convolved feature vector; pooling is carried out on the attitude feature images of the images to obtain attitude feature vectors of the images, and pooling is carried out on the attitude feature images of the target attitude images to obtain the attitude feature vectors of the target attitude images; performing matrix mapping on the convolved feature vector based on the query matrix to obtain a query vector; splicing the attitude feature vector of the image, the attitude feature vector of the target attitude image and the convolved feature vector to obtain a spliced feature vector, performing matrix mapping on the spliced feature vector based on a key matrix to obtain a key vector, and performing matrix mapping on the spliced feature vector based on a value matrix to obtain a value vector; performing self-attention coding based on the query vector, the key vector and the value vector to obtain a self-attention coding result; and performing cross attention processing by taking the self attention coding result as a query vector and the first feature processing result as a key vector and a value vector to obtain a fifth feature processing result.

In some embodiments, the object pose image generation module 506 is configured to subtract the feature map of the noise image from the feature map of the noise image to obtain a feature map of the target image; inputting the feature image of the target image into a decoder for decoding processing to obtain the target image containing the target object with the target posture

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 6 is a schematic diagram of an electronic device 6 provided by an embodiment of the present disclosure. As shown in fig. 6, the electronic device 6 of this embodiment includes: a processor 601, a memory 602 and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps of the various method embodiments described above are implemented by the processor 601 when executing the computer program 603. Or the processor 601 when executing the computer program 603 performs the functions of the modules/units of the apparatus embodiments described above.

The electronic device 6 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the electronic device 6 and is not limiting of the electronic device 6 and may include more or fewer components than shown, or different components.

The processor 601 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.

The memory 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 6. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium (e.g., a computer readable storage medium). Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A method for generating a pose of an object in an image, comprising:

acquiring an image and a target posture image, wherein the posture of an object in the image is different from that of the object in the target posture image;

inputting the image into a parallel network to perform noise adding processing on the image to obtain a feature map of a noise image corresponding to the image;

Extracting skeleton attitude features of the target attitude image to obtain an attitude feature map of the target attitude image, and extracting skeleton attitude features of the image to obtain an attitude feature map of the image;

Performing convolution processing on the image to obtain a global feature map of the image, and performing feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image;

Performing noise prediction on the noise image based on a plurality of fusion feature images of the image, a gesture feature image of the target gesture image and a feature image of the noise image to obtain a feature image of noise corresponding to the noise image;

And generating a target image based on the characteristic diagram of the noise and the characteristic diagram of the noise image, wherein the object posture of the target image is the same as the object posture in the target posture image.

2. The method of claim 1, wherein the parallel network comprises a human body posture model skeleton posture extraction network and a first full connection layer, and wherein the performing skeleton posture feature extraction on the target posture image to obtain a posture feature map of the target posture image comprises:

extracting skeleton posture features of the target posture image through the skeleton posture extraction network of the human body posture model to obtain an initial posture feature map of the target posture image;

And carrying out feature transformation on the initial gesture feature map of the target gesture image through the first full-connection layer to obtain a gesture feature map of the target gesture image.

3. The method of claim 2, wherein the parallel network includes a second full-connection layer, and wherein the performing skeletal gesture feature extraction on the image to obtain a gesture feature map of the image includes:

Extracting skeleton posture features of the image through the skeleton posture extraction network of the human body posture model to obtain an initial posture feature map of the image;

and carrying out feature transformation on the initial gesture feature map of the image through the second full-connection layer to obtain the gesture feature map of the image.

4. The method of claim 1, wherein the parallel network includes a first residual processing module, a second residual processing module, a first feature processing module, a second feature processing module, a third feature processing module, and a fourth feature processing module, and wherein performing feature fusion processing based on the global feature map of the image and the pose feature map of the image to obtain a plurality of fused feature maps of the image includes:

Adding the global feature map of the image and the attitude feature map of the image to obtain a first initial fusion feature map of the image, inputting the first initial fusion feature map of the image into the first residual processing module to perform feature transformation to obtain a first residual processing result, and performing downsampling processing on the first residual processing result to obtain a first downsampling result;

Adding the first downsampling result and the attitude feature map of the image to obtain a second initial fusion feature map of the image, inputting the second initial fusion feature map of the image into the second residual processing module to perform feature transformation to obtain a second residual processing result, and performing downsampling processing on the second residual processing result to obtain a second downsampling result;

Inputting the second downsampling result into the first feature processing module for feature processing to obtain a first feature processing result, and downsampling the first feature processing result to obtain a third downsampling result;

inputting the third downsampling result into the second feature processing module to perform feature processing to obtain a second feature processing result;

Adding the two second feature processing results to obtain a third initial fusion feature map of the image, inputting the third initial fusion feature map of the image into the third feature processing module to perform feature processing to obtain a third feature processing result, and performing up-sampling processing on the third feature processing result to obtain a first up-sampling result;

Adding the first upsampling result and the first feature processing result to obtain a fourth fused initial feature map of the image, and inputting the fourth fused initial feature map of the image into the fourth feature processing module to perform feature processing to obtain a fourth feature processing result;

and determining the first feature processing result, the second feature processing result, the third feature processing result and the fourth feature processing result as a plurality of fusion feature graphs of the image.

5. The method of claim 4, wherein the parallel network includes a third residual processing module, a fourth residual processing module, a fifth feature processing module, a sixth feature processing module, a seventh feature processing module, an eighth feature processing module, a fifth residual processing module, and a sixth residual processing module, and wherein the performing noise prediction on the noise image based on the plurality of fused feature maps of the image, the pose feature map of the target pose image, and the feature map of the noise image, and obtaining the feature map of noise corresponding to the noise image includes:

Adding the feature images of the noise image and the gesture feature images of the target gesture image to obtain a first fusion feature image of the noise image, inputting the first fusion feature image of the noise image into the third residual processing module to perform feature transformation to obtain a third residual processing result, and performing downsampling processing on the third residual processing result to obtain a fourth downsampling result;

adding the fourth downsampling result and the gesture feature image of the target gesture image to obtain a second fusion feature image of the noise image, inputting the second fusion feature image of the noise image into the fourth residual processing module to perform feature transformation to obtain a fourth residual processing result, and performing downsampling processing on the fourth residual processing result to obtain a fifth downsampling result;

Performing feature processing by the fifth feature processing module based on the fifth downsampling result, the first feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image to obtain a fifth feature processing result, and performing downsampling processing on the fifth feature processing result to obtain a sixth downsampling result;

Performing feature processing by the sixth feature processing module based on the sixth downsampling result, the second feature processing result, the gesture feature map of the image and the gesture feature map of the target gesture image to obtain a sixth feature processing result;

Adding the two sixth feature processing results to obtain a third fused feature image of the noise image, performing feature processing on the basis of the third fused feature image of the noise image, the third feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image through the seventh feature processing module to obtain a seventh feature processing result, and performing up-sampling processing on the seventh feature processing result to obtain a second up-sampling result;

Adding the second up-sampling result and the fifth feature processing result to obtain a fourth fusion feature image of the noise image, performing feature processing on the basis of the fourth fusion feature image of the noise image, the fourth feature processing result, the attitude feature image of the image and the attitude feature image of the target attitude image by the eighth feature processing module to obtain an eighth feature processing result, and performing up-sampling processing on the eighth feature processing result to obtain a third up-sampling result;

Adding the third upsampling result, the gesture feature image of the target gesture image and the fourth residual error processing result to obtain a fifth fusion feature image of the noise image, inputting the fifth fusion feature image of the noise image into the fifth residual error processing module to perform feature transformation to obtain a fifth residual error processing result, and performing upsampling processing on the fifth residual error processing result to obtain a fourth upsampling result;

Adding the fourth upsampling result, the gesture feature image of the target gesture image and the third residual error processing result to obtain a sixth fusion feature image of the noise image, and inputting the sixth fusion feature image of the noise image into the sixth residual error processing module to perform feature transformation to obtain a sixth residual error processing result;

and carrying out noise prediction processing on the feature map of the noise image based on the sixth residual result to obtain a feature map of noise corresponding to the noise image.

6. The method according to claim 5, wherein the performing, by the fifth feature processing module, feature processing based on the fifth downsampling result, the first feature processing result, the pose feature map of the image, and the pose feature map of the target pose image, to obtain a fifth feature processing result, includes:

Performing convolution processing on the fifth downsampling result to obtain a feature vector after convolution;

pooling the attitude feature images of the images to obtain attitude feature vectors of the images, and pooling the attitude feature images of the target attitude images to obtain the attitude feature vectors of the target attitude images;

Performing matrix mapping on the characteristic vector after convolution based on a query matrix to obtain a query vector;

splicing the attitude feature vector of the image, the attitude feature vector of the target attitude image and the convolved feature vector to obtain a spliced feature vector, performing matrix mapping on the spliced feature vector based on a key matrix to obtain a key vector, and performing matrix mapping on the spliced feature vector based on a value matrix to obtain a value vector;

performing self-attention coding based on the query vector, the key vector and the value vector to obtain a self-attention coding result;

and performing cross attention processing by taking the self attention coding result as a query vector and the first feature processing result as a key vector and a value vector to obtain the fifth feature processing result.

7. The method of claim 1, wherein the generating a target image based on the feature map of the noise and the feature map of the noise image comprises:

subtracting the characteristic diagram of the noise from the characteristic diagram of the noise image to obtain the characteristic diagram of the target image;

And inputting the feature map of the target image into a decoder for decoding processing to obtain a target image containing the target object with the target gesture.

8. An object posture generating device in an image, characterized by comprising:

the acquisition module is used for acquiring an image and a target gesture image, wherein the gesture of an object in the image is different from that of the object in the target gesture image;

The noise adding module is used for inputting the image into a parallel network to carry out noise adding processing on the image so as to obtain a feature map of a noise image corresponding to the image;

The gesture feature extraction module is used for extracting skeleton gesture features of the target gesture image to obtain a gesture feature image of the target gesture image, and extracting skeleton gesture features of the image to obtain a gesture feature image of the image;

the fusion module is used for carrying out convolution processing on the image to obtain a global feature map of the image, and carrying out feature fusion processing on the global feature map of the image and the attitude feature map of the image to obtain a plurality of fusion feature maps of the image;

The noise prediction module is used for carrying out noise prediction on the noise image based on a plurality of fusion feature images of the image, the gesture feature images of the target gesture image and the feature images of the noise image, so as to obtain a feature image of noise corresponding to the noise image;

And the object posture image generation module is used for generating a target image based on the characteristic image of the noise and the characteristic image of the noise image, wherein the object posture of the target image is the same as that in the target posture image.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.