CN112241708A

CN112241708A - Method and apparatus for generating new person image from original person image

Info

Publication number: CN112241708A
Application number: CN202011120139.1A
Authority: CN
Inventors: 王宝锋; 张武强; 方志杰; 郭子杰
Original assignee: Daimler AG
Current assignee: Mercedes Benz Group AG
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-19

Abstract

The invention relates to the field of artificial intelligence. In particular to a method for generating a new person image from an original person image, comprising: providing a first original character image; acquiring a posture key point of a first original figure image; segmenting a first original person image into a foreground image and a background image; inputting the foreground image and the posture key point data into a foreground characteristic extraction model to extract an appearance characteristic vector and a posture characteristic vector, and inputting the background image into a background characteristic extraction model to extract a background characteristic vector; and inputting the appearance feature vector, the posture feature vector and the background feature vector into a synthesis model to synthesize a reconstructed image. The method further comprises the following steps: the first and second original character images are input to the trained neural network model to synthesize a new character image having the appearance of the character in the first original character image and the background and character pose in the second original character image. And to an apparatus for generating a new person image from an original person image.

Description

Method and apparatus for generating new person image from original person image

Technical Field

The present invention relates to a method for generating a new person image from an original person image. The invention also relates to an apparatus for generating a new person image from an original person image.

Background

In recent years, with the development of artificial intelligence such as deep learning and neural networks, Generative models represented by a Genetic Adaptive Network (GAN) and a Variational self-encoding (VAE) have been advanced greatly and widely used for the generation of data such as images and speech.

In the field of image generation, a method of generating a human image is also becoming a focus of research. However, the current people generating network/algorithm usually cuts out the image containing the people in the form of rectangular blocks (or patches), and then inputs the cut image into the network for training, and usually only focuses on the restoration and reconstruction capability of the foreground (i.e. people) image area, and ignores the reconstruction of the background image. Because the background and the foreground are not decoupled, on one hand, the method has poor background reconstruction capability, the generated image has fuzzy background, and a specific scene cannot be identified; on the other hand, the network computing power is dispersed by background pixels, and the restoration capability of the foreground image, especially high-frequency information such as details and the like, cannot be optimal. In addition, because the data input form cannot effectively control the background, the generated image based on the method is usually limited to the size of the character, the semantic consistency of the foreground and the background is poor, the image with the full scene information cannot be generated, and the generalization capability of the application scene of the generating model is severely limited.

Furthermore, in the area of video entertainment, such as in the movie production and video game production industries, there is a need for character "look shifting," i.e., transforming the appearance of a character in image a to the appearance of a character in image B without changing the pose and background of the character in image a. However, it is difficult for existing generative networks to achieve true and natural "look migration".

Therefore, it is desirable to provide a method for generating a human image, which can control the posture, the foreground, and the background of the human image and can generate a human image in which the posture, the foreground, and the background are well integrated.

Disclosure of Invention

The object of the invention is achieved by providing a method for generating a new person image from an original person image, the method comprising at least the steps of:

i) providing a first original character image;

ii) obtaining pose key point data of the person in the first original person image;

iii) segmenting the first original person image into a foreground image and a background image;

iv) inputting the foreground image and the pose key point data into a foreground feature vector extraction model to extract character appearance feature vectors

And character pose key point feature vectors

Inputting the background image into a background feature vector extraction model to extract a background feature vector

And

v) transforming the character appearance feature vector

Character posture key point feature vector

And background feature vector

The image synthesis model is input to synthesize a reconstructed image of the first original personal image.

According to another aspect of the invention, the object of the invention is also achieved by a method for generating a new person image from an original person image, the method comprising at least the steps of:

i') providing a first original personal image and a second original personal image different from the first original personal image;

ii') obtaining pose key point data of the respective person in the first original person image and the second original person image;

iii') segmenting the first original person image and the second original person image into a foreground image and a background image, respectively;

iv') inputting the foreground image of the first original character image and the pose key point data of the second original character image into the foreground feature vector extraction model to extract the character appearance feature vector

And character pose key point feature vectors

Inputting the background image of the second original character image into a background feature vector extraction model to extract a background feature vector

And

v') the character appearance feature vector

Character posture key point feature vector

And background feature vector

Inputting an image synthesis model to synthesize a new person image, the new person imageHas the background and character pose in the second original character image and the character appearance in the first original character image.

According to an alternative embodiment of the invention, the foreground feature vector extraction model is configured as a foreground-generating network for reconstructing or generating foreground images, the extracted character appearance feature vectors

And character pose key point feature vectors

The feature is a dimension reduction feature extracted from foreground image and pose key point data in the foreground generation type network.

According to an alternative embodiment of the invention, the background feature vector extraction model is configured as a background generating network for reconstructing a background image, the extracted background feature vectors

Is a dimension reduction feature extracted from the background image in the background generation type network.

According to an alternative embodiment of the invention, the foreground or background generative network model is constructed using any of the following generative networks: generating type antagonistic neural network, variation self-coding and derivative model.

According to an alternative embodiment of the invention, step iii) or iii') is performed in the following way:

a) generating a character mask based on the pose key point data;

b) and performing image segmentation on the first original character image and the second original character image by using the character mask to generate a foreground image only containing characters and a background image only containing a background.

According to an alternative embodiment of the invention, step a) is performed in the following manner:

-connecting pose key points to each other based on real human skeletal structure to generate a skeletal binary image;

-performing dilation and/or erosion processing on the skeleton binary image; and

-filling isolated zero-valued void regions in the dilated and/or eroded skeleton binary image in order to generate a human mask.

According to still another aspect of the present invention, the object of the present invention is also achieved by a method for generating a new personal image from an original personal image, the method comprising: inputting the first original character image or the first and second original character images to a trained neural network model composed of a foreground generating network model, a background generating network model and a synthesizing network model to synthesize a new character image; the neural network model is trained in the following way:

i ") providing a training image containing a person;

ii ") obtaining pose key point data of the person in the training image;

iii ") segmenting the training image into a foreground image and a background image;

iv ") inputting the foreground image and the pose key point data into a foreground generation type network model to train the foreground generation type network model, and inputting the background image into a background generation type network model to train the background generation type network model; and

v') extracting character appearance feature vector as dimension reduction feature from foreground image and pose key point data in foreground generation type network model

And character pose key point feature vectors

And background feature vector extracted from background image as dimension-reduced feature in background generation type network model

The synthetic network model is input to train the synthetic network model.

According to an alternative embodiment of the invention, the foreground generative network model, the background generative network model and the synthetic network model are trained independently, interactively or in association.

According to a further aspect of the invention, the object of the invention is also achieved by an apparatus for generating a new person image from an original person image, the apparatus comprising a processor and a computer readable storage means communicatively connected to the processor, the computer readable storage means having stored thereon a computer program for implementing the method described herein, when the computer program is executed by the processor.

According to yet another aspect of the invention, the object of the invention is also achieved by an apparatus for generating a new person image from an original person image, said apparatus being configured for implementing the method described herein and comprising:

a pose key point recognition means configured to determine pose key point data of a person in the input original person image;

a person mask generation model configured to generate a person mask;

a front-background segmentation model configured to segment an input original character image into a foreground image and a background image;

a foreground feature vector extraction model configured for extracting character appearance feature vectors from foreground images and pose key point data

And character pose key point feature vectors

A background feature vector extraction model configured for extracting a background feature vector from a background image

And

an image synthesis model configured for synthesizing feature vectors from human appearance

Character posture key point feature vector

And background feature vector

A new person image is synthesized.

According to the invention, it is achieved that: in the training process, the foreground and the background are decoupled, then the foreground and the background are learned through two independent generation networks, and the foreground and the background are fused through a synthesis network, so that the mixed training of the whole image generation model is completed.

The invention provides a pedestrian image generation method based on generation network and mixed training of foreground and background, which effectively improves the image quality of the foreground and the background of the generated character image and the semantic consistency of the foreground and the background by performing mixed training of decoupling and fusion on the pedestrian and the background at different stages, and greatly improves the generalization capability of the application scene of a generation model.

Further advantages and advantageous embodiments of the inventive subject matter are apparent from the description, the drawings and the claims.

Drawings

Further features and advantages of the present invention will be further elucidated by the following detailed description of an embodiment thereof, with reference to the accompanying drawings. The attached drawings are as follows:

fig. 1 is a schematic block diagram of an apparatus 100 for generating a new personal image from an original personal image according to an exemplary embodiment of the present invention;

FIG. 2 illustrates a schematic diagram of pose keypoints, according to an exemplary embodiment of the invention;

FIG. 3 shows a flow diagram of an image segmentation process according to an exemplary embodiment of the present invention;

FIG. 4 is a block schematic diagram of a neural network model for generating a new person image from an original person image, according to an exemplary embodiment of the present invention;

FIG. 5 shows a flow diagram of a model training method 200 for training a neural network model, according to an example embodiment of the present invention;

FIG. 6 shows a flowchart of an image segmentation step according to an exemplary embodiment of the present invention;

FIG. 7 shows a flowchart of the steps of generating a human mask according to an exemplary embodiment of the invention;

FIG. 8 illustrates a flowchart of a method for synthesizing a new person image from two original person images, according to an exemplary embodiment of the present invention; and

fig. 9 shows a flowchart of a method for reconstructing an original person image according to an exemplary embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and exemplary embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention. In the drawings, the same or similar reference numerals refer to the same or equivalent parts.

Fig. 1 is a schematic block diagram of an apparatus 100 for generating a new personal image from an original personal image according to an exemplary embodiment of the present invention. The device 100 includes a processor 10 and a computer readable storage device 20 communicatively coupled to the processor 10. The computer-readable storage means 20 have stored therein a computer program for implementing the method for generating a person image, which will be explained in detail below, when the computer program is executed by the processor 10.

According to an exemplary embodiment, a display device 30 is provided in communicative connection with the processor 10. By means of the display device 30, the user can view the original personal image to be processed by the device 100 and the new personal image generated by the device 100.

According to an exemplary embodiment, an input device 40 is provided in communicative connection with the processor 10. By means of the input device 40, the user can select or input an original person image to be processed by the device 100. The input device 40 may include, for example: a keyboard, a mouse, and/or a touch screen.

According to an exemplary embodiment, a camera 50 is provided in communicative connection with the processor 10. By means of the camera device 50, the user can take a person image as an original person image to be processed by the device 100. The imaging device 50 is, for example, an in-vehicle imaging device.

According to an exemplary embodiment, a personal image set composed of a plurality of personal images is provided. The original personal image set may be stored in the computer readable storage device 20 or another storage device communicatively connected to the processor 10.

Fig. 4 is a schematic block diagram of a neural network model 400 for generating a new human image from an original human image according to an exemplary embodiment of the present invention.

The neural network model 400 mainly includes: a pose keypoint recognition model 410 configured to recognize human pose keypoints in the input original character image; a character mask generation model 420 configured to generate a character mask I that just covers the entire character in the input original character image_Mask(ii) a A front background segmentation model 430 configured for use, for example, based on the human mask I_MaskSegmenting an input original person image into a foreground image and a background image; a foreground feature vector extraction model 440 configured for extracting character appearance feature vectors

And character pose key point feature vectors

A background feature vector extraction model 450 configured for extracting background feature vectors

And an image composition model 460 configured for characterizing the vector by character appearance

Character posture key point feature vector

And background feature vector

A new person image is synthesized.

In an example, the foreground feature vector extraction model 440, the background feature vector extraction model 450, and the image synthesis model 460 are configured as a neural network model of a suitable form, such as a generative network model, and in particular as a foreground generative network model, a background generative network model, and a synthetic network model, respectively, trained by the model training method 200, which will be described in detail below with reference to fig. 5.

In one example, when the input original character image is accompanied by annotated pose keypoint information, the pose keypoint recognition model may be omitted.

Fig. 5 shows a flowchart of a model training method 200 for training a neural network model 400 for generating new personal images from original personal images, according to an exemplary embodiment of the present invention.

According to the model training method 200, in step S210, an original character image is provided. Illustratively, the original person image may be any one of the above-mentioned original person image sets. Alternatively, the original person image is a person, such as a pedestrian image, captured by the user by means of the camera 50, such as an in-vehicle camera, or a frame of person image captured from a video stream.

Next, in step S220, pose key point (key points) data of the person in the original person image is acquired. Pose key points generally include, but are not limited to: left and right eyes, left and right ears, nose, mouth, neck, left and right shoulders, left and right crotch, left and right elbows, left and right wrists, left and right knees, left and right ankles, etc., as shown by the plurality of white spots 50 in fig. 2.

In one example, pose key point data may be obtained by manually annotating an image. In another example, the pose keypoint data may be calculated by inputting an image of the original character into the pose keypoint recognition model 410. The Pose keypoint recognition model may be constructed using human detection algorithms such as Open Pose, Pifpaf, HR-Net, and the like.

According to an exemplary embodiment of the present invention, pre-labeling of pose keypoints may be performed for each image in the original character image set. In this case, the original personal image is provided in step S210 together with the pose key point data of the person in the original personal image.

Additionally or alternatively, the labeled original person image set may be divided into training subsets data_trainVerifying subset dat_avalAnd test subset data_test。

Then, in step S230, the foreground and background of the original character image are segmented based on the pose key point data acquired in step S220, for example, by means of the character mask generation model 420 and the front-background segmentation model 430 to obtain a foreground image and a background image.

According to an exemplary embodiment, step S220 further comprises (see fig. 6):

in step S221, a character mask I is generated based on the pose key point data_Mask；

Then, in step S222, the human mask I is used_MaskImage segmentation is performed on the original person image to generate a foreground image containing only persons and a background image containing only the background.

To this end, referring to fig. 3, fig. 3 shows an image segmentation process according to an exemplary embodiment of the present invention, wherein an original person image 31 is segmented into a background image 33 and a foreground image 34 by means of a generated person mask 32.

Further, step S221 may further, exemplarily, include (see fig. 7):

in step S2211, connecting the acquired pose key points to each other based on the real human skeleton structure to generate a character skeleton binary image;

in step S2212, a closing operation (expansion) and then erosion (erosion) processing is performed on the skeleton binary image; and

in step S2213, a filling process (fills) is performed on the expanded and eroded skeleton binary image to fill isolated zero-value hole regions in the skeleton binary image, thereby obtaining a human mask I_Mask。

In an example, the scale of the dilation and/or erosion operator may be determined according to the human skeleton size.

Alternatively, the person mask may be generated by other suitable methods known in the art, such as manual semantic segmentation labeling (i.e., manually labeling the pixels of the original person image that contain the person) or image pre-processing.

According to an exemplary embodiment, the acquired foreground image and background image may be the same size as the original person image, except that the background area in the foreground image is zero-valued, while the foreground area in the background image is zero-valued, as shown in fig. 3.

Next, in step S240, the foreground image obtained in step S230 and the pose key point data obtained in step S220 are input into a foreground generation type network model to train the foreground generation type network model, wherein the foreground generation type network model is configured to implement reconstruction and generation of a foreground character. The foreground generative network model corresponds to or includes the foreground feature vector extraction model 440 described previously.

In general, the following equations may be used to describe the foreground generated network model:

wherein x is_cRepresenting input pose-key data, x_aRepresenting the input foreground image data and,

representation generationOr a reconstructed foreground image.

According to an exemplary embodiment, the foreground generative network model is constructed using any of the following generative networks: generating type antagonistic neural network, variation self-coding and derivative model.

According to an exemplary embodiment, the generator of the foreground generative network model includes encoding (encode) and decoding (decode) processes. In the training and generating process, the generator will generate data x_cAnd x_aCoding and reducing dimension, and finally entering a bottleneck layer (bottle neck layer) to obtain a feature vector after dimension reduction

And

the layer characteristics contain main component information of character structure and appearance and have strong controllability. Then, the feature vector after dimension reduction

And

entering a decoding process to obtain a generated or reconstructed foreground image

In the training process, the model G is guided_fConverging to obtain better image restoration and generalization capability, and properly designing loss function

To constrain the training process of the foreground generative network model.

According to an exemplary embodiment of the present invention, a loss function of a foreground generated network model

Comprises the following steps:

wherein, Div (-) represents the divergence of the feature vector, L (-) represents the loss of L1 or L2 norm, GAN (-) represents the loss of GAN, and λ₁And λ₂Respectively, representing the weight parameters. The detailed definitions of the above parameters are not intended to limit the present invention.

According to an exemplary embodiment, before the foreground image is input to the generator, image pre-processing is first performed on the foreground image and the pre-processed foreground image is then input to the generator to perform the encoding and decoding processes. Illustratively, the image pre-processing comprises: and performing a series of image preprocessing operations such as character limb clipping, rotation and/or normalization on the foreground image based on the posture key point data. The present invention is not particularly limited with respect to the features and details of the image preprocessing operation.

In step S250, the background image acquired in step S230 is input into a background generation type network model to train the background generation type network model, wherein the background generation type network model is configured to be used for realizing reconstruction of the background. The background generated network model may correspond to or include the background feature vector extraction model 450 previously described.

Unlike the foreground-generated network model, the background-generated network model focuses on the reconstruction of the background, and does not need to change or migrate the content contained in the background.

In general, the following equations may be employed to describe a background generative network model:

wherein x is_bRepresenting the input background image data of the image,

representing the reconstructed background image.

According to an exemplary embodiment, the background generative network model is constructed using any of the following generative networks: generating type antagonistic neural network, variation self-coding and derivative model.

Without loss of generality, the generator of the background generative network model also comprises the processes of encoding and decoding, similar to the foreground generative network model. In the training and generating process, the generator firstly inputs background image data x_bCoding and reducing dimension, and finally entering a bottleneck layer to obtain a feature vector subjected to dimension reduction

The layer features generally contain the principal component information of the background and have strong controllability. Then, the feature vector

Entering a decoding process to obtain a reconstructed background image

Further, in the training process, the model G is guided_bConverging to achieve better background reconstruction capability by properly designing the loss function

To constrain the training process of the background generative network model.

According to an exemplary embodiment of the present invention, a loss function of a background generative network model

Comprises the following steps:

wherein Φ represents a visual feature vector extractor, which may be a network such as VGG, Resnet, etc., or several layers thereof, or may be an original image pixel itself; l represents a similarity equation used to measure the similarity of two visual feature vectors, which may be a known L1 distance equation and/or L2 distance equation.

According to an exemplary embodiment, the background image acquired in step S230 may be directly input to the generator of the background-generating network model to implement the encoding and decoding processes. Alternatively, before inputting the background image into the generator of the background generating network model, the background image may be first area-planned or cropped to obtain a combination of a series of image blocks (patch) and then the obtained combination of image blocks may be input into the generator to perform the encoding and decoding processes.

Then, in step S260, the human appearance feature vector of the encoder bottleneck layer obtained in step S240 is used

And character pose key point feature vectors

And the background feature vector of the encoder bottleneck layer obtained in step S250

Inputting the synthetic network model to train the synthetic network model.

In general, the following equations may be used to describe the synthetic network model:

wherein the content of the first and second substances,

representing a composite personal image that is a reconstruction of the entire image of the original personal image.

Further, in the training process of the synthetic network model, the model G is guided_sConvergence is promoted, foreground and background fusion is promoted, so that the generated image achieves a good global effect, and a loss function is properly designed

To constrain the training process for the synthetic network model.

According to an example of the present invention, a loss function of a network model is synthesized

Comprises the following steps:

wherein Φ represents a visual feature vector extractor, which may be a network such as VGG, Resnet, etc., or several layers thereof, or may be an original image pixel itself; l denotes a similarity equation used to measure the similarity of two visual feature vectors, which may be an L1 distance equation and/or an L2 distance equation, etc.

In the model training method according to the present invention, a foreground generative network model for generating or reconstructing a foreground image, a background generative network model for reconstructing a background image, and a synthetic network model may be trained independently, interactively, or jointly. In particular, the training of the three network models may be implemented in any suitable order or in any suitable interactive or joint manner.

Fig. 8 shows a flowchart of a method 300 for synthesizing a new personal image from two original personal images according to an exemplary embodiment of the present invention. The method 300 may be implemented by inputting two original human images to be synthesized into the neural network model 400 trained by the model training method 200 explained above in connection with fig. 5.

In the method 300, in step S310, a first original person image is provided.

Then, in step S320, pose keypoint data in the first original person image is acquired, for example, by inputting the first original person image into the pose keypoint recognition model 410.

In step S330, the first original character image is segmented into a foreground image and a background image based on the pose key point data acquired in step S320, for example, by means of the character mask generation model 420 and the front-background segmentation model 430.

On the other hand, in step S340, a second original personal image different from the first original personal image is provided.

Next, in step S350, pose keypoint data in the second original person image is acquired, for example, by inputting the second original person image into the pose keypoint recognition model 410.

In step S360, the second original character image is segmented into a foreground image and a background image based on the pose key point data acquired in step S350, for example, by means of the character mask generation model 420 and the front-background segmentation model 430.

Then, in step S370, the person appearance feature vector as a feature after dimension reduction (e.g., a feature of the encoder bottleneck layer) is extracted from the foreground image of the first original personal image and the pose key point data of the second original personal image obtained in step S350, for example, by inputting the foreground image of the first original personal image obtained in step S330 and the pose key point data of the second original personal image obtained in step S350 into the foreground feature vector extraction model 440

And character pose key point feature vectors

On the other hand, in step S380, a background feature vector as a feature after dimension reduction (e.g., a feature of an encoder bottleneck layer) is extracted from the background image of the second original personal image by, for example, inputting the background image of the second original personal image acquired in step S360 into the background feature vector extraction model 450

Next, in step S390, for example, by using the human appearance feature vector extracted in step S370

And character pose key point feature vectors

And the background feature vector extracted in step S380

The image synthesis model 460 is inputted to synthesize a new personal image. The new person image has the background and person pose in the second original person image but has the appearance (i.e., look and dress) of the person in the first original person image.

Thus, the method 300 may be essentially understood as a method of transforming the appearance of a person in one image of a person into the appearance of a person in another image of the person while maintaining the background and pose of the person. The method 300 may be used in a variety of situations, such as in the field of data enhancement or video entertainment.

FIG. 9 shows a flowchart of a method 500 for reconstructing an image of an original person, according to an example embodiment of the present invention. The method 500 may be implemented by inputting an original human image to be reconstructed into the neural network model 400 trained by the model training method 200 explained above in connection with fig. 5.

In the method 500, in step S510, an original person image is provided.

Then, in step S520, pose keypoint data in the original person image is acquired, for example, by inputting the original person image into the pose keypoint recognition model 410.

In step S530, the original character image is segmented into a foreground image and a background image based on the pose key point data acquired in step S520, for example, by means of the character mask generation model 420 and the front-background segmentation model 430.

Then, in step S540, the foreground image and pose are closed, for example, by inputting the foreground image acquired in step S530 and the pose key point data acquired in step S520 into the foreground feature vector extraction model 440Key point data extraction as a reduced-dimension feature (e.g., a feature of the encoder bottleneck layer) character appearance feature vector

And character pose key point feature vectors

On the other hand, in step S550, a background feature vector as a feature after dimension reduction (e.g., a feature of an encoder bottleneck layer) is extracted from the background image by, for example, inputting the background image acquired in step S530 into the background feature vector extraction model 450

Next, in step S560, for example, by using the human appearance feature vector extracted in step S540

And character pose key point feature vectors

And the background feature vector extracted in step S550

The image synthesis model 460 is inputted to synthesize a new personal image. The new personal image is a reconstruction or restoration of the entire image of the input original personal image.

The method 500 may be used in a variety of situations, such as data enhancement or video entertainment.

According to the invention, as the three network models of the foreground, the background and the synthesis are adopted, the generated image not only has vivid foreground and background, but also has natural transition between the foreground and the background and obviously enhanced scene semantic consistency.

Although some embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the invention. The appended claims and their equivalents are intended to cover all such modifications, substitutions and changes as fall within the true scope and spirit of the invention.

Claims

1. A method (500) for generating a new person image from an original person image, the method (500) comprising at least the steps of:

i) providing a first original character image;

iv) inputting foreground images and the pose key point data into a foreground feature vector extraction model (440) to extract character appearance feature vectors

And character pose key point feature vectors

Inputting the background image into a background feature vector extraction model (450) to extract a background feature vector

And

v) transforming the character appearance feature vector

Character posture key point feature vector

And background feature vector

An image synthesis model (460) is input to synthesize a reconstructed image of the first original person image.

2. A method (300) for generating a new person image from an original person image, said method (300) comprising at least the steps of:

iv') inputting the foreground image of the first original character image and the pose key point data of the second original character image into a foreground feature vector extraction model (440) to extract character appearance feature vectors

And character pose key point feature vectors

Inputting the background image of the second original person image into a background feature vector extraction model (450) to extract a background feature vector

And

v') the character appearance feature vector

Character posture key point feature vector

And background feature vector

Inputting an image synthesis model (460) to synthesize a new personal image having the second original personal imageThe background and character pose in (1) and the character appearance in the first original character image.

3. The method (300,500) of claim 1 or 2,

the foreground feature vector extraction model (440) is configured as a foreground-generating network for reconstructing or generating foreground images, the extracted character appearance feature vectors

And character pose key point feature vectors

The feature is a dimension reduction feature extracted from foreground image and posture key point data in a foreground generation type network; and/or

The background feature vector extraction model (450) is configured as a background generating network for reconstructing a background image, the extracted background feature vectors

4. The method (300,500) of claim 3,

constructing a foreground or background generative network model using any one of the following generative networks: generative antagonistic neural networks (GAN), variational self-encoding (VAE) and derived models thereof.

5. The method (300,500) according to any one of the preceding claims, wherein step iii) or iii') is performed in the following way:

a) generating a character mask based on the pose key point data;

b) the first and second original person images are image-segmented using the person mask to generate a foreground image containing substantially only persons and a background image containing substantially only backgrounds.

6. The method (300,500) according to claim 5, wherein step a) is performed in the following manner:

mutually connecting the posture key points based on the real human skeleton structure to generate a skeleton binary image;

performing expansion and/or corrosion treatment on the skeleton binary image; and

isolated zero-valued void regions in the dilated and/or eroded skeleton binary image are filled in to generate a human mask.

7. A method (300,500) for generating a new person image from an original person image, the method comprising: inputting the first original character image or the first and second original character images into a trained neural network model composed of a foreground generation type network model, a background generation type network model and a synthesis network model to synthesize a new character image; the neural network model is trained in the following way:

i ") providing a training image containing a person;

ii ") obtaining pose key point data of the person in the training image;

And character pose key point feature vectors

And backgroundBackground feature vector extracted from background image as dimension reduction feature in generative network model

The synthetic network model is input to train the synthetic network model.

8. The method (300,500) of claim 7,

the foreground generative network model, the background generative network model and the synthetic network model are trained independently, interactively or jointly.

9. An apparatus (100) for generating a new person image from an original person image, the apparatus (100) comprising a processor (10) and a computer readable storage device (20) communicatively connected to the processor (10), the computer readable storage device (20) having stored thereon a computer program for carrying out the method (300,500) according to any one of the preceding claims, when the computer program is executed by the processor (10).

10. An apparatus (400) for generating a new person image from an original person image, the apparatus (400) being configured to implement the method (300,500) according to any one of claims 1-8 and comprising:

a pose key point recognition means (410) configured to determine pose key point data of a person in the input original person image;

a person mask generation model (420) configured to generate a person mask;

a front-background segmentation model (430) configured to segment an input original character image into a foreground image and a background image;

a foreground feature vector extraction model (440) configured for extracting character appearance feature vectors from foreground images and pose key point data

And the posture of the personKeypoint feature vectors

A background feature vector extraction model (450) configured for extracting a background feature vector from a background image

And

an image composition model (460) configured for characterizing a vector of features by character appearance

Character posture key point feature vector

And background feature vector

A new person image is synthesized.