CN115471658A - Action migration method and device, terminal equipment and storage medium - Google Patents

Action migration method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN115471658A
CN115471658A CN202211154081.1A CN202211154081A CN115471658A CN 115471658 A CN115471658 A CN 115471658A CN 202211154081 A CN202211154081 A CN 202211154081A CN 115471658 A CN115471658 A CN 115471658A
Authority
CN
China
Prior art keywords
image
graph
segmentation
map
foreground
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211154081.1A
Other languages
Chinese (zh)
Inventor
刘鑫辰
刘武
杨权威
梅涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Shangke Information Technology Co Ltd
Priority to CN202211154081.1A priority Critical patent/CN115471658A/en
Publication of CN115471658A publication Critical patent/CN115471658A/en
Priority to PCT/CN2023/097712 priority patent/WO2024060669A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses an action migration method, an action migration device, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the key point connection graph represents the driving posture of the first object; generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph; generating a second foreground image of a second object under the driving posture according to the second segmentation image and the first foreground image of the source image; and fusing the second foreground image and the first background image of the source image to obtain an action migration image. Through the technical scheme, the vivid motion migration image is obtained, the original Warp operation is abandoned, the situation of posture difference can be better adapted, the reality of people in the generated video is ensured, and the visual experience of a user is improved.

Description

Action migration method and device, terminal equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of image processing, in particular to an action migration method, an action migration device, terminal equipment and a storage medium.
Background
The action migration refers to the generation of a new video based on the source image and the driving video, wherein the new video contains the character in the source image, and the character performs the same action as the character in the driving video.
In the prior art, affine transformation (which may be referred to as Warp operation) is usually performed on a source image or an encoded feature image thereof according to a driving video to generate a new video.
In the process of implementing the present invention, the inventor finds that at least the following technical problems exist in the prior art:
when the postures of the characters in the source image and the characters in the driving video are different violently, affine transformation cannot be accurately carried out, so that the authenticity of the characters in the generated video is poor, and the visual experience of a user is seriously influenced.
Disclosure of Invention
The embodiment of the invention provides an action migration method, an action migration device, terminal equipment and a storage medium, which can better adapt to scenes with violent posture differences, ensure the authenticity of characters in a generated video and improve the visual experience of a user.
In a first aspect, an embodiment of the present invention provides an action migration method, including:
acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity map characterizes a driving pose of the first object;
generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph;
generating a second foreground map of the second object under the driving gesture according to the second segmentation map and the first foreground map of the source image;
and fusing the second foreground image and the first background image of the source image to obtain an action migration image.
In a second aspect, an embodiment of the present invention provides an action migration apparatus, including:
the image acquisition module is used for acquiring a key point connection graph of a first object in a driving image and a first segmentation graph of each preset area of a second object in a source image; the keypoint connectivity graph characterizes a driving pose of the first object;
the first generation module is used for generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph;
a second generation module, configured to generate a second foreground map of the second object in the driving pose according to the second segmentation map and the first foreground map of the source image;
and the synthesis module is used for fusing the second foreground image and the first background image of the source image to obtain an action migration image.
In a third aspect, an embodiment of the present invention provides a terminal device, including:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement an action migration method as in any embodiment of the invention.
In a fourth aspect, the embodiments of the present invention provide a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the action migration method according to any of the embodiments of the present invention.
According to the action migration method, the action migration device, the terminal equipment and the storage medium, provided by the embodiment of the invention, the key points of the first object in the driving image are used for connecting the graph with the first segmentation graph of each preset area of the second object in the source image, so that the second segmentation graph of each preset area according with the driving posture is obtained, and the conversion from the source posture to the driving posture of the corresponding segmentation graph of the second object is realized; further, a second foreground graph of the second object under the driving posture is generated according to the second segmentation graph and the first foreground graph of the source image, so that the conversion from the source posture to the driving posture of the foreground graph corresponding to the second object is realized, and the texture of the second object of the source image is given to the second segmentation graph; furthermore, the second foreground image and the first background image of the source image are fused to obtain a vivid action migration image, and the method abandons the original Warp operation, can better adapt to the situation of posture difference, ensures the reality of people in the generated video and improves the visual experience of users.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for action migration according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for action migration according to an embodiment of the present invention;
FIG. 3 is a flow chart of a method for action migration according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating an architecture of a partial generation network according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method for action migration according to an embodiment of the present invention;
FIG. 6 is a flow chart of a method for action migration according to an embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating an architecture of an overall synthetic network according to an embodiment of the present invention;
FIG. 8 is a flow chart of a method for action migration according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an action migration apparatus according to an embodiment of the present invention;
fig. 10 shows a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described through embodiments with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution. According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.
Fig. 1 shows a flowchart of an action migration method according to an embodiment of the present invention, which is applicable to a case of migrating an object gesture in an image and/or a video, for example, a case of human action migration. The method may be performed by an action migration apparatus, which is implemented in software and/or hardware, preferably configured in a terminal device, such as a computer device.
As shown in fig. 1, the action migration method provided in the embodiment of the present invention may include the following steps:
s110, acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity map characterizes a driving pose of the first object.
In the embodiment of the present invention, the action migration refers to a process of generating a new image based on a source image and a driving image, where the new image includes a first object in the source image, and the first object performs the same action as a second object in the driving image. Wherein, the driving image refers to an image with a driving gesture. Optionally, the number of the driving images is multiple, and the motion or posture of the object in each driving image is changed in association according to a preset sequence, for example, the driving image may be a video frame in the driving video. The first object in the driving image is a human or other region of interest, and is not limited herein. The key point connection map refers to a posture connection map to which key points of the first object are connected in a predefined connection mode, and can represent the driving posture of the first object, wherein each key point can be a key point corresponding to the body part of the first object. The source image refers to an image having a second object source pose, where the second object in the source image refers to a person or other region of interest. The first segmentation map refers to a segmentation map of each preset region of the second object, and may include, but is not limited to, a body part region segmentation map and a background region segmentation map. The number of channels of the first segmentation map may be determined according to the number of divided body parts, for example, 18 channels, 6 channels, 5 channels, etc., and is not limited herein.
It should be noted that the first object and the second object are used to distinguish the source image from the object in the driving image, and are not necessarily used to describe a specific order or sequence.
For example, the first object may be a driving person, the second object may be a source person, and the predetermined area may be a body part of the person or other region of interest. In order to represent the posture of a human body, a human body key point detection model OpenPose can be used for predicting a driving image to obtain two-dimensional key points of a driving figure
Figure BDA0003857695080000051
WhereinOpenPose is an open-source human body two-dimensional key point detection model; further, connecting the two-dimensional key points of the driving figures according to a predefined connection mode to obtain a key point connection diagram of the driving figures
Figure BDA0003857695080000052
The key point connection graph may be an RGB key point connection graph, and H × W represents the resolution of the image. In order to represent the Human body layout, the application uses a Human body analytic model SCHP (Self Correction for Human matching) to obtain an 18-channel semantic segmentation graph of a source image; considering the texture characteristics of different parts of a human body, combining 18-channel semantic segmentation image channels into 6 channels which are respectively a head, a jacket, a lower garment, shoes, four limbs and a background, thereby obtaining a first segmentation image of each preset area of a source character in a source image
Figure BDA0003857695080000061
And S120, generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph.
In this embodiment, the second segmentation map is an object preset area analysis map of the second object in the driving posture. In other words, the second segmentation map includes a plurality of preset regions of the second object which conform to the driving gesture.
Specifically, the keypoint connection diagram and the first segmentation diagram can be input into a first neural network model which is trained in advance to obtain a second segmentation diagram, so that the transformation from the source posture to the driving posture of the second object corresponding to the segmentation diagram is realized. The network structure of the first neural network model is not limited herein, and for example, the network structure may be composed of at least one encoder and at least one decoder.
And S130, generating a second foreground image of the second object under the driving posture according to the second segmentation image and the first foreground image of the source image.
In this embodiment, the second foreground map refers to a preset region foreground map of the second object in the driving posture, where the foreground map refers to a region of the object excluding the background in the image. In other words, the second foreground map is composed of a plurality of preset region foreground maps of the second object in accordance with the driving pose.
Specifically, the second segmentation graph and the first foreground graph of the source image can be input into a second neural network model which is trained in advance to obtain preset region foreground graphs under a plurality of driving postures of the second object, the preset region foreground graphs under the driving postures are combined to obtain the second foreground graph, and the conversion of the foreground graph corresponding to the second object from the source posture to the driving posture is achieved. The network structure of the second neural network model is not limited herein, and for example, the network structure may be composed of at least one encoder and at least one decoder.
And S140, fusing the second foreground image and the first background image of the source image to obtain an action migration image.
In this embodiment, the motion migration image refers to an image that uses the background of the source image as the background and includes the second object in the driving posture, that is, the motion migration image refers to an image that migrates the posture of the first object to the second object.
In some optional embodiments, the foreground in the second foreground map may be embedded into the source image at a position corresponding to the first background map, so as to implement image fusion. In some optional embodiments, texture alignment may be performed on the first foreground image and the second foreground image, and the foreground in the second foreground image after texture alignment is embedded into the position corresponding to the first background image of the source image, so as to implement image fusion, where an image fusion manner is not limited.
According to the action migration method provided by the embodiment of the invention, the key point connection graph of the first object in the driving image and the first segmentation graph of each preset area of the second object in the source image are utilized to obtain the second segmentation graph of which each preset area accords with the driving posture, so that the conversion from the source posture to the driving posture of the corresponding segmentation graph of the second object is realized; further, a second foreground graph of the second object under the driving posture is generated according to the second segmentation graph and the first foreground graph of the source image, so that the conversion from the source posture to the driving posture of the foreground graph corresponding to the second object is realized, and the texture of the second object of the source image is given to the second segmentation graph; furthermore, the second foreground image and the first background image of the source image are fused to obtain a vivid action migration image, and the method abandons the original Warp operation, can better adapt to the situation of posture difference, ensures the reality of people in the generated video and improves the visual experience of users.
Referring to fig. 2, fig. 2 is a schematic flow chart of an action migration method provided in an embodiment of the present disclosure, and the method of the present embodiment may be combined with various alternatives of the action migration method provided in the foregoing embodiment. The action migration method provided by the embodiment is further optimized. Optionally, after generating the second segmentation map in which each preset region conforms to the driving posture, the method further includes: determining an alignment parameter according to the first segmentation graph and the second segmentation graph; before generating a second foreground map of a second object in a driving posture according to the second segmentation map and the first foreground map of the source image, the method further comprises the following steps: and transforming the first foreground image according to the alignment parameters so as to align the first foreground image with the second segmentation image.
As shown in fig. 2, the method of this embodiment may include:
s210, acquiring a key point connection graph of a first object in a driving image and a first segmentation graph of each preset area of a second object in a source image; the keypoint connectivity map characterizes the driving pose of the first object.
And S220, generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph.
And S230, determining an alignment parameter according to the first segmentation chart and the second segmentation chart.
And S240, transforming the first foreground image according to the alignment parameters so as to align the first foreground image with the second segmentation image.
And S250, generating a second foreground image of the second object under the driving posture according to the second segmentation image and the first foreground image of the source image.
And S260, fusing the second foreground image with the first background image of the source image to obtain an action migration image.
In this embodiment, the alignment parameter refers to a parameter for aligning the first foreground map and the second segmentation map. The first foreground graph is aligned with the second segmentation graph through the alignment parameters, so that the situation that the source character and the driving character have great difference in size and spatial position can be avoided.
In some alternative embodiments, determining the alignment parameter from the first segmentation map and the second segmentation map comprises at least one of: determining a scaling parameter according to the size of each preset area in the first segmentation graph and the second segmentation graph; and determining a displacement parameter according to the center coordinates of each preset area in the first segmentation drawing and the second segmentation drawing.
The zoom parameter refers to a parameter for controlling the size of image zooming. Specifically, a first mask height is determined based on the first segmentation map, and a second mask height is determined based on the second segmentation map; a scaling parameter is determined based on the first mask height and the second mask height. The displacement parameter is a parameter that characterizes a positional shift of the finger image. Specifically, a first mask center coordinate is determined based on the first segmentation map, and a second mask center coordinate is determined based on the second segmentation map; a displacement parameter is determined based on the first mask center coordinate and the second mask center coordinate.
Illustratively, it can be calculated by the following formula:
Figure BDA0003857695080000091
wherein, R represents a scaling parameter,
Figure BDA0003857695080000092
denotes the height of the body mask in the second segmentation chart, H s Representing the height of the body mask in the first segmentation map. And:
c=[c x ,c y ] T
wherein c represents a displacement parameter, c x Representing the difference in horizontal direction between the first mask center coordinate and the second mask center coordinate, c y The difference in the vertical direction between the first mask center coordinate and the second mask center coordinate is represented.
It should be noted that the first foreground map is transformed through the alignment parameters, so that the first foreground map is aligned with the second segmentation map, thereby laying an image quality foundation for subsequently generating the second foreground map of the second object under the driving posture based on the second segmentation map and the first foreground map, and improving the image quality of the second foreground map.
On the basis of the embodiment, the embodiment of the invention adds the step of determining the alignment parameters according to the first segmentation graph and the second segmentation graph; the first foreground map is transformed according to the alignment parameters to align the first foreground map with the second segmentation map ". In addition, the embodiment of the present invention and the action migration method proposed by the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
Referring to fig. 3, fig. 3 is a schematic flow chart of an action migration method provided in an embodiment of the present disclosure, and the method of the present embodiment may be combined with various alternatives of the action migration method provided in the foregoing embodiment. The action migration method provided by the embodiment is further optimized. Optionally, the step of generating the second segmentation graph through the first generation countermeasure network and generating the second segmentation graph through the first generation countermeasure network includes: coding the first segmentation graph through a first coder to obtain a first characteristic graph; coding the key point connection graph through a second coder to obtain a second feature graph; and decoding the fused graph of the first feature graph and the second feature graph through a first decoder to obtain a second segmentation graph.
As shown in fig. 3, the method of this embodiment may include:
s310, acquiring a key point connection graph of a first object in a driving image and a first segmentation graph of each preset area of a second object in a source image; the keypoint connectivity map characterizes a driving pose of the first object.
And S320, coding the first segmentation graph through the first coder to obtain a first feature graph.
And S330, coding the key point connection graph through a second coder to obtain a second characteristic graph.
S340, decoding the fused graph of the first characteristic graph and the second characteristic graph through a first decoder to obtain a second segmentation graph.
And S350, generating a second foreground image of the second object under the driving posture according to the second segmentation image and the first foreground image of the source image.
And S360, fusing the second foreground image with the first background image of the source image to obtain an action migration image.
In this embodiment, the second segmentation map may be generated by a first generative countermeasure network, which may include a first encoder, a second encoder, and a first decoder. The first encoder is used for encoding the first segmentation graph, the second encoder is used for encoding the key point connection graph, and the first decoder is used for decoding the fusion graph of the encoding results of the first encoder and the second encoder, so that the clearer and sharper second segmentation graph is obtained. It should be noted that the first encoder, the second encoder and the first decoder are used to distinguish the roles of the encoders or the decoders, and are not necessarily used to describe a specific order or sequence.
In some alternative embodiments, if the driving image is a video frame, the method may further include: acquiring historical second segmentation maps corresponding to a preset number of video frames before the current video frame; encoding each historical second segmentation graph through a third encoder to obtain a third feature graph; correspondingly, decoding the fused graph of the first feature graph and the second feature graph by the first decoder to obtain a second segmentation graph, including: and decoding the fused graph of the first feature graph, the second feature graph and the third feature graph through a first decoder to obtain a second segmentation graph.
Wherein the historical second segmentation map refers to one or more second segmentation maps before the current video frame. It should be noted that, the historical second segmentation map is input to the first generation countermeasure network, so that the first generation countermeasure network effectively extracts the relationship between the video frames, thereby improving the timing sequence consistency of the video.
On the basis of the above embodiment, after obtaining the third feature map, the method further includes: decoding the fused graph of the second feature graph and the third feature graph through a second decoder to obtain optical flow parameters and weight parameters; after obtaining the second segmentation map, the method further comprises the following steps: and adjusting the second segmentation map according to the historical second segmentation map, the optical flow parameter and the weight parameter which correspond to the video frame before the current video frame.
In this implementation, the second decoder is configured to decode the fused graph of the second feature graph and the third feature graph to obtain optical flow parameters and weight parameters, where the optical flow parameters refer to the instantaneous speed of pixel motion of the moving object on the observation imaging plane.
Illustratively, fig. 4 shows an architectural schematic diagram of a local generation network provided by an embodiment of the present invention, where the local generation network generates images of regions of a source person in a driving posture; the locally generated network includes a first generation countermeasure network, which may be a Layout GAN network with a vid2vid framework. Specifically, the first generation countermeasure network includes three encoders and two decoders. Wherein the first encoder may be an encoder
Figure BDA0003857695080000111
Wherein l is the identification of Layout GAN, for characterizing that the first encoder belongs to the Layout GAN network,
Figure BDA0003857695080000112
for encoding the first segmentation map L s Obtaining a first characteristic diagram
Figure BDA0003857695080000113
The second encoder may be an encoder
Figure BDA0003857695080000114
Multiple key point connection graph for coding along channel splicing
Figure BDA0003857695080000121
Obtaining a second characteristic diagram
Figure BDA0003857695080000122
Where t represents the current time, t-1, t-2 represent two historically consecutive times before the current time, and the third encoder may be an encoder
Figure BDA0003857695080000123
Two historical second segmentation maps for time before encoding
Figure BDA0003857695080000124
Obtaining a third characteristic diagram
Figure BDA0003857695080000125
Further, by the first decoder
Figure BDA0003857695080000126
Decoding the added features
Figure BDA0003857695080000127
Obtain the original result
Figure BDA0003857695080000128
Similarly, by a second decoder
Figure BDA0003857695080000129
Decoding addition feature
Figure BDA00038576950800001210
And obtaining an optical flow parameter O and a weight parameter w thereof. Thereby obtaining a final result second segmentation chart
Figure BDA00038576950800001211
Layout GAN can be formulated as:
Figure BDA00038576950800001212
Figure BDA00038576950800001213
Figure BDA00038576950800001214
where, + represents point-to-point addition, x represents point-to-point multiplication, and { } represents where the inputs are spliced along the channel dimension. Warp (I, O) represents affine transformation of the image I according to the optical flow parameters O. It is emphasized that the optical flow and Warp operations used in this implementation are based on neighboring frames for the purpose of improving the temporal consistency of the generated video, rather than on the transformation between the source image and the driving image.
In some optional embodiments, the training step of the first generation of the antagonistic network may include: acquiring a third segmentation map of each preset area of the first object in the sample driving image; determining a first loss of a second segmentation map corresponding to the sample source image and a third segmentation map corresponding to the sample drive image; training the first generation countermeasure network according to the first loss.
The third segmentation map refers to a segmentation map of each preset region of the first object in the driving image.
Specifically, the first generation countermeasure network may be obtained by training a plurality of sample-driven images and sample source images in advance, where the sample-driven images and the sample source images may be paired training data, that is, characters in the sample-driven images and the sample source images are the same character, for example, a forward human video frame in a segment of video is selected as the sample source image, and the video is used as the sample-driven video. Among other things, the forward human video frame is selected as the source image because it contains more details of the appearance of the source person. In the trained first generation antagonistic network, the first loss can be cross entropy loss, and the cross entropy loss of the second segmentation graph corresponding to the sample source image and the third segmentation graph corresponding to the sample driving image is calculated, and network parameters of the first generation antagonistic network are adjusted based on the cross entropy loss, so that the cross loss is gradually reduced and tends to be stable until the network training is completed, and the first generation antagonistic network is obtained.
The embodiment of the invention refines the technical characteristics of generating the second segmentation chart on the basis of the embodiment. In addition, the embodiment of the present invention and the action migration method proposed by the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
Referring to fig. 5, fig. 5 is a schematic flow chart of an action migration method provided in an embodiment of the present disclosure, and the method of the present embodiment may be combined with various alternatives in the action migration method provided in the foregoing embodiment. The action migration method provided by the embodiment is further optimized. Optionally, the step of generating the second foreground map by the second generation countermeasure network and generating the second foreground map by the second generation countermeasure network may include: coding the second segmentation graph through a fourth coder to obtain a fourth feature graph; the first foreground image is coded through a fifth coder to obtain a fifth characteristic image; and decoding the fusion graph of the fourth feature graph and the fifth feature graph through a third decoder to obtain a second foreground graph.
As shown in fig. 5, the method of this embodiment may include:
s410, acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity map characterizes a driving pose of the first object.
And S420, generating a second segmentation graph of which each preset area accords with the driving attitude according to the key point connection graph and the first segmentation graph.
And S430, coding the second division graph through a fourth coder to obtain a fourth characteristic graph.
And S440, coding the first foreground image through a fifth coder to obtain a fifth characteristic image.
And S450, decoding the fused image of the fourth feature image and the fifth feature image through a third decoder to obtain a second foreground image.
And S460, fusing the second foreground image and the first background image of the source image to obtain an action migration image.
In this embodiment, the second foreground map is generated by a second generation countermeasure network, which may include a fourth encoder, a fifth encoder, and a third decoder, where the fourth encoder is configured to encode the second segmentation map, the fifth encoder is configured to encode the first foreground map, and the third decoder is configured to decode a fusion map of the fourth feature map and the fifth feature map, so as to obtain the second foreground map. It should be noted that the fourth encoder, the fifth encoder and the third decoder are used to distinguish the roles of the encoders or the decoders from each other, and are not necessarily used to describe a specific order or sequence.
On the basis of the foregoing embodiment, if the driving image is a video frame, encoding the first foreground map by a fifth encoder to obtain a fifth feature map, which may include: acquiring historical second foreground images corresponding to a preset number of video frames before the current video frame; and coding the fusion graph of the first foreground graph and each historical second foreground graph through a fifth coder to obtain a fifth feature graph.
Wherein, the historical second foreground map refers to one or more second foreground maps before the current video frame. It should be noted that, the historical second foreground map is input to the second generative confrontation network, so that the second generative confrontation network effectively extracts the relationship between the video frames, thereby improving the timing consistency of the video.
Illustratively, fig. 4 shows an architecture diagram of a local generation network provided by an embodiment of the present invention; the local generation network further includes a second generation countermeasure network, and the second generation countermeasure network in this embodiment may be a Region GAN network having a vid2vid framework, and the Region GAN is used only to generate an initial Region image, so that this embodiment uses one generator to generate 5 regions of the human body (without generating a background Region). This not only saves computational resources, but also prevents overfitting of the model. In particular, the second generative countermeasure network can include a fourth encoder
Figure BDA0003857695080000151
Fifth encoder
Figure BDA0003857695080000152
And a third decoder
Figure BDA0003857695080000153
Wherein r is the identifier of the Region GAN and is used for characterizing that the Region GAN belongs to the Region GAN network. By means of a fourth encoder
Figure BDA0003857695080000154
Mask for ith preset area at current time t
Figure BDA0003857695080000155
Coding to obtain a fourth characteristic diagram
Figure BDA0003857695080000156
By a fifth encoder
Figure BDA0003857695080000157
The codes being spliced along the channel dimension
Figure BDA0003857695080000158
Obtain the fifth characteristic diagram
Figure BDA0003857695080000159
Wherein the content of the first and second substances,
Figure BDA00038576950800001510
a second foreground plot representing the history before the current time t, I s,i The first foreground map representing the ith preset Region, so the original Region GAN can be expressed as:
Figure BDA00038576950800001511
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00038576950800001512
representing the third decoder, the coded characters may differ greatly in size and spatial location from the source character to the driver characterSign graph
Figure BDA00038576950800001513
And
Figure BDA00038576950800001514
there are instances of misalignment. To this end, the invention proposes to use a Global Alignment Module (GAM) for FG the first foreground image s Affine transformations are performed to match the corresponding second segmentation maps. First through L s And
Figure BDA00038576950800001515
calculating human body mask M s And
Figure BDA00038576950800001516
further, a source image I can be obtained s First foreground image FG of s . The overall flow of the global alignment module is expressed by the following formula:
Figure BDA00038576950800001517
wherein the content of the first and second substances,
Figure BDA00038576950800001518
representing the first foreground image after being aligned by the global alignment module. R denotes a scaling parameter. To pair
Figure BDA00038576950800001519
Dividing to obtain I of each preset region s,i 。c=[c x ,c y ] T Representing a displacement parameter. The final Region GAN can therefore be expressed as:
Figure BDA00038576950800001520
on the basis of the foregoing embodiment, the second training step of generating the countermeasure network may include: acquiring a third segmentation map of each preset area of the first object in the sample driving image; determining a second loss between a second foreground map corresponding to the sample source image and a foreground true value map corresponding to the sample source image; determining a second foreground map corresponding to the sample source image and a third loss of a third segmentation map corresponding to the sample drive image; and training the second generation countermeasure network according to the second loss and the third loss.
In this embodiment, the second loss may include a reconstruction loss, a perceptual loss; the third loss may be a countermeasure loss, i.e., an image distribution loss.
Specifically, the second generative countermeasure network may be trained in advance by a plurality of sample-driven images and sample source images, wherein the sample-driven images and the sample source images may be paired training data. In the trained second generated confrontation network, the second loss can comprise reconstruction loss and perception loss, the confrontation loss of a third segmentation image corresponding to the sample driving image is calculated by calculating a second foreground image corresponding to the sample source image and the reconstruction loss and the perception loss between the foreground true value images corresponding to the sample source image, the second foreground image corresponding to the sample source image and the confrontation loss of the third segmentation image corresponding to the sample driving image are calculated, and network parameters of the second generated confrontation network are adjusted based on the reconstruction loss, the perception loss and the confrontation loss, so that the reconstruction loss, the perception loss and the confrontation loss are gradually reduced and tend to be stable until the network training is completed, and the second generated confrontation network is obtained.
Illustratively, the present embodiment employs an L1 reconstruction loss, which is more focused on subtle differences between the generated image and the real image than the L2 reconstruction loss. The calculation formula is as follows:
Figure BDA0003857695080000161
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003857695080000162
indicating the predicted value of the second foreground image in the driving posture of the ith area at the time t,
Figure BDA0003857695080000163
and an actual value of the driving image segmentation map in the driving posture of the ith area at the time point t.
The perceptual loss is used to constrain the generated image and the real image to be close in the multi-dimensional feature space. Perceptual loss, including characteristic content loss and characteristic style loss, may be expressed as:
Figure BDA0003857695080000164
wherein j represents the j-th layer of the pre-trained VGG-19 model, and G represents the Gram matrix of the computation characteristic diagram.
The purpose of the countermeasures against the loss is to make the synthesized image have a similar distribution as the real image. In order to make the network focus on multi-scale image details, the invention uses the multi-scale conditional discriminator proposed in pix2 pixHD. It synthesizes images
Figure BDA0003857695080000171
And corresponding region masks
Figure BDA0003857695080000172
As an input. The expression is as follows:
Figure BDA0003857695080000173
thus, the loss function of the generator is as follows:
Figure BDA0003857695080000174
wherein λ is rec And λ per The reconstruction loss and the perceptual loss are weighted, respectively.
Further, the first generative countermeasure network and the second generative countermeasure network training processes include training of discriminators for which the loss function is:
Figure BDA0003857695080000175
on the basis of the embodiment, the embodiment of the invention adds the detail characteristic of determining the second foreground image. In addition, the embodiment of the present invention and the action migration method proposed in the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment and the above embodiment have the same beneficial effects.
Referring to fig. 6, fig. 6 is a schematic flow chart of an action migration method provided in an embodiment of the present disclosure, and the method of the present embodiment may be combined with various alternatives of the action migration method provided in the foregoing embodiment. The action migration method provided by the embodiment is further optimized. Optionally, after generating the second foreground map of the second object in the driving posture, the method further includes: determining texture enhancement parameters according to the first foreground image and the second foreground image; and performing texture enhancement on the second foreground image according to the texture enhancement parameter and the first foreground image.
As shown in fig. 6, the method of this embodiment may include:
s510, acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity map characterizes a driving pose of the first object.
And S520, generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph.
S530, according to the second segmentation graph and the first foreground graph of the source image, a second foreground graph of a second object under the driving posture is generated.
And S540, determining texture enhancement parameters according to the first foreground image and the second foreground image.
And S550, performing texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.
And S560, fusing the second foreground image and the first background image of the source image to obtain an action migration image.
In this embodiment, the texture enhancement parameter refers to an adjustment parameter for enhancing the texture of the image, and the features of the first foreground map and the features of the second foreground map are aligned by the texture enhancement parameter to retain more details, such as the texture of the clothes and the edge of the body.
In some optional embodiments, determining the texture enhancement parameter according to the first foreground map and the second foreground map may include: the first foreground image is coded through a sixth coder to obtain a sixth feature image; the second foreground image is coded through a seventh coder to obtain a seventh characteristic image; expanding the sixth characteristic diagram and the seventh characteristic diagram according to the channel to respectively obtain an eighth characteristic diagram and a ninth characteristic diagram; and taking the correlation matrix of the eighth feature map and the ninth feature map as a texture enhancement parameter.
Exemplarily, fig. 7 is a schematic architecture diagram of the overall synthesis network disclosed in this embodiment, where the purpose of the overall synthesis network is to integrate the region images generated by the local generation network to generate a final motion transition image, and at the same time, add a suitable background image to the generated motion transition image, and the loss of the overall synthesis network is the loss between the generated motion transition image and the driving image. Specifically, by the sixth encoder
Figure BDA0003857695080000181
The first foreground image is coded by the coding to obtain a sixth characteristic image
Figure BDA0003857695080000182
By a seventh encoder
Figure BDA0003857695080000183
For the second foreground map
Figure BDA0003857695080000184
Coding to obtain the seventh characteristic diagram
Figure BDA0003857695080000185
Wherein
Figure BDA0003857695080000186
A second foreground map representing a current time instant,
Figure BDA0003857695080000187
Figure BDA0003857695080000191
a second foreground map representing the history before the current time since
Figure BDA0003857695080000192
Has the same texture as the first foreground map but different pose, so the present embodiment proposes a Texture Alignment Module (TAM) to better fuse the feature maps
Figure BDA0003857695080000193
As shown in FIG. 7, first, a feature map is shown
Figure BDA0003857695080000194
Are respectively unfolded into
Figure BDA0003857695080000195
And
Figure BDA0003857695080000196
c represents the number of channels. The cosine distance is then used to calculate a correlation matrix (i.e., texture enhancement parameter) between the two feature maps
Figure BDA0003857695080000197
The calculation formula is as follows:
Figure BDA0003857695080000198
wherein H 1,i Represents H 1 Characteristic at i, likewise, H 2,j Represents H 2 The feature at j. Denotes matrix multiplication.
In some optional embodiments, performing texture enhancement on the second foreground map according to the texture enhancement parameter and the first foreground map may include: determining a texture enhancement map according to the eighth feature map and the texture enhancement parameter; integrating the texture increasing graph and the fusion graph of the ninth feature graph according to a channel to obtain a tenth feature graph; and decoding the tenth feature map through a fourth decoder to obtain a second foreground map after texture enhancement.
Illustratively, the tenth feature map is obtained by the following formula
Figure BDA0003857695080000199
Figure BDA00038576950800001910
Finally, through a fourth decoder
Figure BDA00038576950800001911
And obtaining a second foreground image after texture enhancement. Therefore, the generation of the second foreground map after the whole texture enhancement can be formulated as:
Figure BDA00038576950800001912
on the basis of the embodiment, the embodiment of the invention adds the technical characteristic of texture enhancement to realize better fusion of the characteristic diagram. In addition, the embodiment of the present invention and the action migration method proposed in the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment and the above embodiment have the same beneficial effects.
Referring to fig. 8, fig. 8 is a schematic flow chart of an action migration method provided in an embodiment of the present disclosure, and the method of the present embodiment may be combined with various alternatives of the action migration method provided in the foregoing embodiment. The action migration method provided by the embodiment is further optimized. Optionally, fusing the second foreground image with the first background image of the source image may include: according to the second segmentation graph and the key point connection graph, determining a posture mask graph; determining a second background image according to the posture mask image and the first background image; and fusing the second foreground image and the second background image.
As shown in fig. 8, the method of this embodiment may include:
s610, acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity map characterizes the driving pose of the first object.
And S620, generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph.
And S630, generating a second foreground image of the second object under the driving posture according to the second segmentation image and the first foreground image of the source image.
And S640, according to the second segmentation graph and the key point connection graph, determining a posture mask graph.
And S650, determining a second background image according to the posture mask image and the first background image.
And S660, fusing the second foreground image and the second background image to obtain the motion migration image.
In this embodiment, after obtaining the second foreground image of the second object in the driving posture, a reasonable background image needs to be added to the second foreground image to obtain a more real motion migration image.
Specifically, the second segmentation graph and the key point connection graph are input into an eighth encoder to obtain an eleventh feature graph, and the eleventh feature graph is further decoded by a fifth decoder to obtain an attitude mask graph, wherein the attitude mask graph is a soft mask graph containing the attitude of the first object; furthermore, the position corresponding to the first background image is shielded through the attitude mask image, so that a second background image shielding the first object is obtained.
In some optional embodiments, the fifth decoder decodes the fused graph of the seventh feature map and the eleventh feature map to obtain the pose mask map. The fusion graph has the advantages that the outline and the posture can be optimized, and a more accurate outline graph is obtained.
Illustratively, an eighth encoder is used
Figure BDA0003857695080000211
Encoding
Figure BDA0003857695080000212
Obtaining a characteristic diagram
Figure BDA0003857695080000213
Wherein the content of the first and second substances,
Figure BDA0003857695080000214
a second segmentation map representing a current time t and two historical second segmentation maps prior to the current time t;
Figure BDA0003857695080000215
a key point connection graph representing the current time t and two historical key point connection graphs before the current time t, and features to be added
Figure BDA0003857695080000216
And
Figure BDA0003857695080000217
is sent to a fifth decoder
Figure BDA0003857695080000218
Obtaining a posture mask image m, and finally obtaining a final motion migration image by using the following formula:
Figure BDA0003857695080000219
on the basis of the above embodiments, the second object comprises a virtual object. The virtual object can be a digital person, a virtual customer service and a virtual main broadcasting. The second object may also be a real person object.
Illustratively, the action migration method of the embodiment can be applied to action driving and generation of digital people, virtual customer service and virtual anchor, so that the fidelity and richness of the actions of the existing virtual character are improved, and the use experience of a user is improved.
In addition, the invention also designs a discriminator specially aiming at the face area to generate a real face. In the training process, the source image and the head portrait of the action migration image are input into a discriminator, and a model is trained, so that the foreground face generated by the generator is more real.
Exemplarily, a given source image I s And driving video
Figure BDA00038576950800002110
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00038576950800002111
representing the video frame of the driving video at time t. The goal of this embodiment is to generate a new video in which the person in the source image is doing the motion that drives the person in the video. The whole scheme can be formulated as:
Figure BDA00038576950800002112
where F (·,) represents the generative model in this embodiment, N represents the number of driving video frames,
Figure BDA0003857695080000221
representing the generated target video frame in which the pose of the source character is consistent with the driving character.
In the training stage, in order to obtain paired training data as the monitoring information, the present embodiment selects a forward human body video frame in a segment of video as a source image, and the video is used as a driving video. The reason for choosing a forward human video frame as the source image is that it contains more apparent details of the source person. The embodiment adopts a strategy of step training. First, 10 rounds of training were performed on Layout GAN and Region GAN, respectively. The output of the Region GAN is then used to perform 10 rounds of training on the overall synthetic network.
In the inference stage, the selection of the driving video is not limited as long as any single clear motion video is available. And the present embodiment may perform end-to-end inferences.
The generation framework proposed in this example achieves optimal or comparable results on both public datasets (the iPER and SoloDance datasets). The experimental results on the two data sets are shown in table 1 and table 2, respectively, where SSIM and PSNR are evaluation indexes based on similarity, and a larger value indicates a better generated image quality. LPIPS and FID are evaluation indexes based on a characteristic distance, and a smaller value indicates a better generated image quality. TCM is used to evaluate the timing consistency of the generated video, the larger the better.
Table 1 experimental results on data set iPER
Method SSIM PSNR LPIPS FID TCM
EDN 0.840 23.39 0.076 56.29 0.361
FSV2V 0.780 20.44 0.110 110.99 0.184
LWGAN 0.825 21.43 0.091 77.99 0.197
C2F 0.849 24.27 0.072 55.07 0.687
The invention 0.856 25.33 0.065 53.04 0.793
Table 2 data set experimental results on SoloDance
Figure BDA0003857695080000231
The generation framework proposed in this example achieves optimal or comparable results on both public datasets (the iPER and SoloDance datasets). The experimental results on the two data sets are shown in table 1 and table 2, respectively, where SSIM and PSNR are evaluation indexes based on similarity, and a larger value thereof indicates a better generated image quality. LPIPS and FID are evaluation indexes based on a characteristic distance, and a smaller value indicates a better generated image quality. TCM is used to evaluate the timing consistency of the generated video, the larger the better.
As can be seen from Table 1, the experimental effect of the embodiment of the invention achieves the optimum of all evaluation indexes on the iPER data set. As can be seen from table 2, although the SSIM and PSNR indexes are not optimal, the Mask-SSIM and Mask-PSNR indexes are optimal (the values in parentheses are set to 0 by setting the background region of the image through the human body Mask, and then the SSIM and PSNR indexes are calculated). This indicates that our method produces better human image quality than C2F.
Compared with other methods, the method can better process the situation of the violent change of the posture, and simultaneously keeps the appearance details of the source character. In addition, the motion transition images generated by the present invention generally have clearer facial details. This is due to the progressive generation model of our invention, where the initial face region image provides an important template for the final sharp face.
The embodiment of the invention adds the technical detail characteristic of determining the second background image on the basis of the embodiment. In addition, the embodiment of the present invention and the action migration method proposed by the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
Fig. 9 is a schematic structural diagram of a motion migration apparatus according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where a motion of an object in an image or a video is migrated, for example, a situation where a motion of a human body is migrated. The action migration method provided by the embodiment can be realized by the action migration device provided by the invention.
As shown in fig. 9, the action migration apparatus in the embodiment of the present invention may include:
an image obtaining module 710, configured to obtain a keypoint connection map of a first object in a driving image and a first segmentation map of each preset region of a second object in a source image; the key point connection graph represents the driving posture of the first object;
the first generating module 720 is configured to generate a second segmentation map, where each preset region conforms to the driving posture, according to the key point connection map and the first segmentation map;
a second generating module 730, configured to generate a second foreground map of the second object in the driving posture according to the second segmentation map and the first foreground map of the source image;
and the synthesizing module 740 is configured to fuse the second foreground image with the first background image of the source image to obtain an action migration image.
In some optional embodiments, the motion migration apparatus further comprises:
the alignment parameter determining module is used for determining an alignment parameter according to the first segmentation graph and the second segmentation graph;
and the image alignment module is used for transforming the first foreground image according to the alignment parameters so as to align the first foreground image with the second segmentation image.
In some optional embodiments, determining the alignment parameter from the first segmentation map and the second segmentation map comprises at least one of:
determining a scaling parameter according to the size of each preset area in the first segmentation drawing and the second segmentation drawing;
and determining a displacement parameter according to the center coordinates of each preset area in the first segmentation drawing and the second segmentation drawing.
In some alternative embodiments, the first generating module 720 includes:
the first coding unit is used for coding the first segmentation graph through a first coder to obtain a first characteristic graph;
the second coding unit is used for coding the key point connection graph through a second coder to obtain a second feature graph;
and the first decoding unit is used for decoding the fused graph of the first characteristic graph and the second characteristic graph through a first decoder to obtain a second segmentation graph.
In some optional embodiments, if the driving image is a video frame, the apparatus further comprises:
the historical second segmentation map acquisition module is used for acquiring historical second segmentation maps corresponding to a preset number of video frames before the current video frame;
the history division graph encoding module is used for encoding each history second division graph through a third encoder to obtain a third feature graph;
a first decoding unit further configured to:
and decoding the fused graph of the first characteristic graph, the second characteristic graph and the third characteristic graph through a first decoder to obtain a second segmentation graph.
In some optional embodiments, the apparatus is further configured to:
decoding the fused graph of the second feature graph and the third feature graph through a second decoder to obtain optical flow parameters and weight parameters;
and adjusting the second segmentation map according to the historical second segmentation map, the optical flow parameter and the weight parameter which correspond to the video frame before the current video frame.
In some optional embodiments, the training step of the first generation of the antagonistic network comprises:
acquiring a third segmentation map of each preset area of the first object in the sample driving image;
determining a first loss of a second segmentation map corresponding to the sample source image and a third segmentation map corresponding to the sample drive image;
training the first generation countermeasure network according to the first loss.
In some optional embodiments, the second generating module 730 includes:
the fourth coding unit is used for coding the second segmentation graph through a fourth coder to obtain a fourth feature graph;
a fifth encoding unit, configured to encode the first foreground map through a fifth encoder to obtain a fifth feature map;
and the third encoding unit is used for decoding the fusion graph of the fourth feature graph and the fifth feature graph through a third decoder to obtain a second foreground graph.
In some optional embodiments, if the driving image is a video frame, the fifth encoding unit is further configured to:
acquiring historical second foreground images corresponding to a preset number of video frames before the current video frame;
and coding the fusion graph of the first foreground graph and each historical second foreground graph through a fifth coder to obtain a fifth feature graph.
In some optional embodiments, the second training step of generating the countermeasure network includes:
acquiring a third segmentation map of each preset area of the first object in the sample driving image;
determining a second loss between a second foreground map corresponding to the sample source image and a foreground true value map corresponding to the sample source image;
determining a third loss of a second foreground map corresponding to the sample source image and a third segmentation map corresponding to the sample drive image;
and training the second generation countermeasure network according to the second loss and the third loss.
In some optional embodiments, the apparatus further comprises:
the texture enhancement parameter determining module is used for determining a texture enhancement parameter according to the first foreground image and the second foreground image;
and the texture enhancement module is used for carrying out texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.
In some optional embodiments, the texture enhancement parameter determining module is specifically configured to:
coding the first foreground image through a sixth coder to obtain a sixth characteristic image;
the second foreground image is coded through a seventh coder to obtain a seventh characteristic image;
expanding the sixth characteristic diagram and the seventh characteristic diagram according to the channel to respectively obtain an eighth characteristic diagram and a ninth characteristic diagram;
and taking the correlation matrix of the eighth feature map and the ninth feature map as a texture enhancement parameter.
In some optional embodiments, the texture enhancement module is specifically configured to:
determining a texture enhancement map according to the eighth feature map and the texture enhancement parameter;
integrating the texture increasing graph and the fusion graph of the ninth feature graph according to the channel to obtain a tenth feature graph;
and decoding the tenth feature map through a fourth decoder to obtain a second foreground map after texture enhancement.
In some alternative embodiments, the synthesis module 740 is specifically configured to:
according to the second segmentation graph and the key point connection graph, determining a posture mask graph;
determining a second background image according to the attitude mask image and the first background image;
and fusing the second foreground image and the second background image.
In some alternative embodiments, the second object comprises a virtual object.
The motion migration apparatus provided by the embodiment of the present invention is the same as the motion migration method provided by the above embodiment, and the technical details that are not described in detail in the embodiment of the present invention can be referred to the above embodiment, and the embodiment of the present invention has the same beneficial effects as the above embodiment.
Fig. 10 shows a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention. The terminal device in the embodiments of the present invention may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle mounted terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 10, the terminal device 900 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the terminal apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the terminal apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 10 illustrates a terminal apparatus 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. When the computer program is executed by the processing device 901, the above-described functions defined in the action migration method provided by the embodiment of the present invention or the like are executed.
The terminal provided by the embodiment of the present invention and the action migration method provided by the above embodiment belong to the same inventive concept, and the technical details that are not described in detail in the embodiment of the present invention can be referred to the above embodiment, and the embodiment of the present invention has the same beneficial effects as the above embodiment.
An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the action migration method provided in the above-described embodiment.
It should be noted that the computer readable storage medium in the above embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or FLASH Memory (FLASH), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In yet another embodiment of the invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer-readable storage medium may be included in the terminal device or may exist separately without being mounted in the terminal device.
The terminal device stores one or more programs, and when the one or more programs are executed by the terminal device, the terminal device is enabled to:
acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the key point connection graph represents the driving posture of the first object; generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph; generating a second foreground image of a second object under the driving posture according to the second segmentation image and the first foreground image of the source image; and fusing the second foreground image and the first background image of the source image to obtain an action migration image.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functional pages noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software or hardware. Wherein the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in some detail by the above embodiments, the invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the invention, and the scope of the invention is determined by the scope of the appended claims.

Claims (16)

1. An action migration method, comprising:
acquiring a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the keypoint connectivity graph characterizes a driving pose of the first object;
generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph;
generating a second foreground map of the second object under the driving gesture according to the second segmentation map and the first foreground map of the source image;
and fusing the second foreground image and the first background image of the source image to obtain an action migration image.
2. The method according to claim 1, wherein after the generating of the second segmentation map for the preset regions to conform to the driving gesture, the method further comprises:
determining an alignment parameter according to the first segmentation map and the second segmentation map;
before the generating a second foreground map of the second object in the driving pose according to the second segmentation map and the first foreground map of the source image, further comprising:
and transforming the first foreground map according to the alignment parameters so as to align the first foreground map with the second segmentation map.
3. The method according to claim 2, wherein the determining an alignment parameter from the first segmentation map and the second segmentation map comprises at least one of:
determining a scaling parameter according to the size of each preset area in the first segmentation graph and the second segmentation graph;
and determining a displacement parameter according to the center coordinates of each preset area in the first segmentation chart and the second segmentation chart.
4. The method of claim 1, wherein the step of generating the second segmentation map by a first generative countermeasure network, and generating the second segmentation map by the first generative countermeasure network comprises:
coding the first segmentation graph through a first coder to obtain a first characteristic graph;
coding the key point connection graph through a second coder to obtain a second feature graph;
and decoding the fused graph of the first characteristic graph and the second characteristic graph through a first decoder to obtain a second segmentation graph.
5. The method of claim 4, wherein if the driving image is a video frame, the method further comprises:
acquiring historical second segmentation maps corresponding to a preset number of video frames before the current video frame;
encoding each historical second segmentation graph through a third encoder to obtain a third feature graph;
correspondingly, the decoding, by the first decoder, the fused graph of the first feature graph and the second feature graph to obtain the second segmentation graph includes:
and decoding the fused graph of the first feature graph, the second feature graph and the third feature graph through a first decoder to obtain a second segmentation graph.
6. The method of claim 5, further comprising, after said obtaining the third feature map:
decoding the fused graph of the second feature graph and the third feature graph through a second decoder to obtain optical flow parameters and weight parameters;
after the obtaining the second segmentation map, further comprising:
and adjusting the second segmentation graph according to a historical second segmentation graph corresponding to a video frame before the current video frame, the optical flow parameter and the weight parameter.
7. The method of claim 4, wherein the first training step of generating a competing network comprises:
acquiring a third segmentation map of each preset area of the first object in the sample driving image;
determining a first loss of a second segmentation map corresponding to a sample source image and a third segmentation map corresponding to the sample drive image;
training the first generation antagonistic network according to the first loss.
8. The method of claim 1, wherein the step of generating the second foreground map through a second generative confrontation network and generating the second foreground map through the second generative confrontation network comprises:
coding the second segmentation graph through a fourth coder to obtain a fourth feature graph;
coding the first foreground image through a fifth coder to obtain a fifth feature image;
and decoding the fusion graph of the fourth feature graph and the fifth feature graph through a third decoder to obtain a second foreground graph.
9. The method of claim 8, wherein if the driving image is a video frame, the encoding the first foreground map by a fifth encoder to obtain a fifth feature map comprises:
acquiring historical second foreground images corresponding to a preset number of video frames before the current video frame;
and encoding the first foreground image and the fused image of each historical second foreground image through a fifth encoder to obtain a fifth feature image.
10. The method of claim 8, wherein the second training step of generating a competing network comprises:
acquiring a third segmentation map of each preset area of the first object in the sample driving image;
determining a second loss between a second foreground map corresponding to the sample source image and a foreground true value map corresponding to the sample source image;
determining a second foreground map corresponding to a sample source image and a third loss of a third segmentation map corresponding to the sample drive image;
training the second generative countermeasure network according to the second loss and the third loss.
11. The method of claim 1, further comprising, after said generating a second foreground map of the second object in the driven pose:
determining a texture enhancement parameter according to the first foreground image and the second foreground image;
and performing texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.
12. The method of claim 1, wherein the fusing the second foreground map with the first background map of the source image comprises:
according to the second segmentation graph and the key point connection graph, determining a posture mask graph;
determining a second background image according to the posture mask image and the first background image;
and fusing the second foreground image and the second background image.
13. The method of any of claims 1-12, wherein the second object comprises a virtual object.
14. An action migration apparatus, comprising:
the image acquisition module is used for acquiring a key point connection graph of a first object in a driving image and a first segmentation graph of each preset area of a second object in a source image; the keypoint connectivity graph characterizes a driving pose of the first object;
the first generation module is used for generating a second segmentation graph of which each preset area accords with the driving posture according to the key point connection graph and the first segmentation graph;
a second generation module, configured to generate a second foreground map of the second object in the driving pose according to the second segmentation map and the first foreground map of the source image;
and the synthesis module is used for fusing the second foreground image and the first background image of the source image to obtain an action migration image.
15. A terminal device, characterized in that the terminal device comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the action migration method of any of claims 1-13.
16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the action migration method according to any one of claims 1 to 13.
CN202211154081.1A 2022-09-21 2022-09-21 Action migration method and device, terminal equipment and storage medium Pending CN115471658A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211154081.1A CN115471658A (en) 2022-09-21 2022-09-21 Action migration method and device, terminal equipment and storage medium
PCT/CN2023/097712 WO2024060669A1 (en) 2022-09-21 2023-06-01 Action migration method and apparatus, and terminal device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211154081.1A CN115471658A (en) 2022-09-21 2022-09-21 Action migration method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115471658A true CN115471658A (en) 2022-12-13

Family

ID=84335422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211154081.1A Pending CN115471658A (en) 2022-09-21 2022-09-21 Action migration method and device, terminal equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115471658A (en)
WO (1) WO2024060669A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664603A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium
WO2024060669A1 (en) * 2022-09-21 2024-03-28 北京京东尚科信息技术有限公司 Action migration method and apparatus, and terminal device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10671855B2 (en) * 2018-04-10 2020-06-02 Adobe Inc. Video object segmentation by reference-guided mask propagation
CN115471658A (en) * 2022-09-21 2022-12-13 北京京东尚科信息技术有限公司 Action migration method and device, terminal equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024060669A1 (en) * 2022-09-21 2024-03-28 北京京东尚科信息技术有限公司 Action migration method and apparatus, and terminal device and storage medium
CN116664603A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium
CN116664603B (en) * 2023-07-31 2023-12-12 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2024060669A1 (en) 2024-03-28

Similar Documents

Publication Publication Date Title
EP3998552A1 (en) Image processing method and apparatus, and electronic device
CN115471658A (en) Action migration method and device, terminal equipment and storage medium
CN112040311B (en) Video image frame supplementing method, device and equipment and storage medium
CN111901598A (en) Video decoding and encoding method, device, medium and electronic equipment
CN110827380A (en) Image rendering method and device, electronic equipment and computer readable medium
WO2023232056A1 (en) Image processing method and apparatus, and storage medium and electronic device
CN115205150A (en) Image deblurring method, device, equipment, medium and computer program product
Zhao et al. Laddernet: Knowledge transfer based viewpoint prediction in 360◦ video
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
CN116524151A (en) Method, apparatus and computer program product for generating an avatar
WO2024056030A1 (en) Image depth estimation method and apparatus, electronic device and storage medium
CN110689478A (en) Image stylization processing method and device, electronic equipment and readable medium
CN113610034A (en) Method, device, storage medium and electronic equipment for identifying person entity in video
CN112714263A (en) Video generation method, device, equipment and storage medium
CN116824005A (en) Image processing method and device, storage medium and electronic equipment
CN113808157B (en) Image processing method and device and computer equipment
CN116310615A (en) Image processing method, device, equipment and medium
CN113486787A (en) Face driving and live broadcasting method and device, computer equipment and storage medium
CN116228895B (en) Video generation method, deep learning model training method, device and equipment
CN114418835B (en) Image processing method, device, equipment and medium
CN115661238B (en) Method and device for generating travelable region, electronic equipment and computer readable medium
CN115937338B (en) Image processing method, device, equipment and medium
US20240251083A1 (en) Context-based image coding
CN118038030A (en) Image generation method and device, electronic equipment and storage medium
Pei DMA-SGCN for Video Motion Recognition: A Tool for Advanced Sports Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination