CN113326934A

CN113326934A - Neural network training method, and method and device for generating images and videos

Info

Publication number: CN113326934A
Application number: CN202110602135.5A
Authority: CN
Inventors: 鲁超
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-31
Anticipated expiration: 2041-05-31
Also published as: CN113326934B

Abstract

The disclosure provides a neural network training method, an image and video generating method and an image and video generating device, and relates to the technical field of image and video processing, in particular to the technical field of artificial intelligence. The scheme comprises the following steps: acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; inputting a sample source image and a sample reference image into an image generation network, and obtaining a prediction generation image output by the image generation network, wherein the prediction generation image comprises a sample source object, and the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image; inputting the predicted generated image into an image restoration network to obtain a predicted restoration image output by the image restoration network and aiming at the predicted generated image; determining a loss value based on the sample reference picture and the predicted repair picture; and adjusting a parameter of the image inpainting network based on the loss value.

Description

Neural network training method, and method and device for generating images and videos

Technical Field

The present disclosure relates to the field of image and video processing technologies, and in particular, to a method and an apparatus for training a neural network, a method and an apparatus for generating an image using a neural network, a method and an apparatus for generating a video using a neural network, an electronic device, a storage medium, and a computer program product.

Background

With the popularity of short video applications (apps), more and more users begin to use mobile terminals, such as mobile phones, to capture and share short videos. In some cases, when a user sees an interesting short video, a mimic shot may be made to shoot out his own video, i.e., mimic the gestures and motions of a person in the video. However, for most users, the difficulty of simulating shooting is large, and it is often difficult to reproduce the gesture or motion in the original video.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a neural network, a method and apparatus for generating an image using a neural network, a method and apparatus for generating a video using a neural network, an electronic device, a storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a computer-implemented method of training a neural network, the neural network including an image generation network and an image inpainting network, the method including: acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; inputting the sample source image and the sample reference image into the image generation network, and obtaining a prediction generation image output by the image generation network, wherein the sample source object is included in the prediction generation image, and the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image; inputting the predicted generated image into the image restoration network, and obtaining a predicted restoration image output by the image restoration network and aiming at the predicted generated image; determining a loss value based on the sample reference picture and the predictive restoration picture; and adjusting a parameter of the image inpainting network based on the loss value.

According to another aspect of the present disclosure, there is also provided a method for generating an image by using a neural network, the neural network being obtained by training according to the training method and including an image generation network and an image restoration network, the method including: inputting a source image and a reference image into the image generation network, and obtaining a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image; inputting the generated image into the image restoration network, and obtaining a restoration image output by the image restoration network and aiming at the generated image; and taking the restored image as a result image.

According to another aspect of the present disclosure, there is also provided a method for generating a video using a neural network, the neural network being obtained by training according to the training method and including an image generation network and an image restoration network, the method including: acquiring a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object; for each of the plurality of reference image frames, performing the following: inputting the source image and the reference image frame into the image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; inputting the generated image into the image restoration network, and obtaining a restoration image output by the image restoration network and aiming at the generated image; and splicing a plurality of restored images corresponding to the plurality of reference image frames to generate a result video.

According to another aspect of the present disclosure, there is also provided a training apparatus of a neural network including an image generation network and an image inpainting network, the apparatus including: the system comprises a sample acquisition module, a sample acquisition module and a sample reference module, wherein the sample acquisition module is configured to acquire a sample source image and a sample reference image, the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; a prediction generation module configured to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generated image output by the image generation network, wherein the sample source object is included in the prediction generated image, and the posture of the sample source object in the prediction generated image is consistent with the posture of the sample reference object in the sample reference image; a predictive restoration module configured to input the predictive-generated image into the image restoration network, to obtain a predictive restoration image for the predictive-generated image output by the image restoration network; a loss calculation module configured to determine a loss value based on the sample reference picture and the predictive restoration picture; and a parameter adjustment module configured to adjust a parameter of the image inpainting network based on the loss value.

According to another aspect of the present disclosure, there is also provided an apparatus for generating an image using a neural network, the neural network including an image generation network and an image restoration network, the apparatus including: an image generation module configured to input a source image and a reference image into the image generation network, and obtain a generated image output by the image generation network, wherein the source image includes a source object therein, the reference image includes a reference object therein, the generated image includes the source object therein, and a posture of the source object in the generated image is consistent with a posture of the reference object in the reference image; an image restoration module configured to input the generated image into the image restoration network, and obtain a restored image for the generated image output by the image restoration network; and an image output module configured to take the repair image as a result image.

According to another aspect of the present disclosure, there is also provided an apparatus for generating a video using a neural network including an image generation network and an image restoration network, the apparatus including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a source image and a reference video, the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object; a generating module comprising an image generating unit and an image inpainting unit, wherein the image generating unit is configured to, for each of the plurality of reference image frames, perform the following: inputting the source image and the reference image frame into the image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; the image restoration unit is configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and the video output module is configured to splice the repaired images corresponding to the plurality of reference image frames to generate a result video.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a method according to any of the above aspects.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a method according to any of the above aspects.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the method according to any of the above aspects when executed by a processor.

According to one or more embodiments of the present disclosure, a neural network includes an image generation network and an image inpainting network. In the training process of the neural network, a sample source image and a sample reference image are input into the image generation network, a prediction generated image output by the image generation network is obtained, the prediction generated image comprises a sample source object, and the posture of the included sample source object is consistent with the posture of a sample reference object in a sample reference image, so that the posture of the reference object can be migrated to the source object, and automatic and efficient posture migration is realized. The predicted and generated image is input into an image restoration network, and parameters of the image restoration network are adjusted according to the difference (namely loss value) between the predicted and restored image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and vivid posture migration result image (or result video) can be generated based on the neural network.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a method of training a neural network according to an embodiment of the present disclosure;

FIG. 3 shows a block diagram of a neural network according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram for constructing a sample video set according to an embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method of generating an image using a neural network, in accordance with an embodiment of the present disclosure;

6A-6D are schematic diagrams illustrating a process of generating an image according to the method illustrated in FIG. 5;

FIG. 7 shows a flow diagram of a method of generating video using a neural network in accordance with an embodiment of the present disclosure;

FIG. 8 shows a block diagram of a training apparatus for a neural network according to an embodiment of the present disclosure;

FIG. 9 is a block diagram illustrating an apparatus for generating an image using a neural network according to an embodiment of the present disclosure;

fig. 10 is a block diagram illustrating a structure of an apparatus for generating a video using a neural network according to an embodiment of the present disclosure; and

FIG. 11 shows a block diagram of an exemplary electronic device, according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, when a user sees an interesting video, a mimic shot may be performed to shoot his own video by mimicking the posture and motion of a person in the video. However, for most users, the difficulty of simulating shooting is large, and it is often difficult to reproduce the gesture or motion in the original video. Therefore, it is desirable to provide a gesture migration scheme, so that when a user sees an interested video, the user only needs to upload a photo of the user, and can migrate a gesture or an action in a specified video to the user, so as to automatically generate a simulated video of the user, without performing a complicated simulated shooting operation.

In order to realize efficient and high-quality posture migration, the disclosure provides a neural network and a training scheme thereof, and a scheme for generating images and videos by using the trained neural network. Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods described herein may be implemented, according to an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes a client device 110, a server 120, and a network 130 communicatively coupling the client device 110 and the server 120.

The client device 110 includes a display 114 and a client Application (APP)112 displayable via the display 114. The client application 112 may be an application that needs to be downloaded and installed before running, either as a web version (webapp) running in a browser, or as a small program (liteapp) of a lightweight application. In the case where the client application 112 is an application program that needs to be downloaded and installed before running, the client application 112 may be installed on the client device 110 in advance and activated. In the case where the client application 112 is a web page version program running in a browser, the user 102 can directly run the client application 112 by accessing a specific site in the browser without installing the client application 112. In the case where the client application 112 is an applet, the user 102 can run the client application 112 directly on the client device 110 without installing the client application 112 by searching the client application 112 in a host application (e.g., by the name of the client application 112, etc.) or by scanning a graphical code (e.g., barcode, two-dimensional code, etc.) of the client application 112, etc. In some embodiments, client device 110 may be any type of mobile computer device, including a mobile computer, a mobile phone, a wearable computer device (e.g., a smart watch, a head-mounted device, including smart glasses, etc.), or other type of mobile device. In some embodiments, client device 110 may alternatively be a stationary computer device, such as a desktop, server computer, or other type of stationary computer device.

The server 120 is typically a server deployed by an Internet Service Provider (ISP) or Internet Content Provider (ICP). Server 120 may represent a single server, a cluster of multiple servers, a distributed system, or a cloud server providing an underlying cloud service (such as cloud database, cloud computing, cloud storage, cloud communications). It will be understood that although the server 120 is shown in fig. 1 as communicating with only one client device 110, the server 120 may provide background services for multiple client devices simultaneously.

Examples of network 130 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the Internet. The network 130 may be a wired or wireless network. In some embodiments, data exchanged over network 130 is processed using techniques and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like. In addition, all or some of the links may also be encrypted using encryption techniques such as Secure Sockets Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), internet protocol security (IPsec), and so on. In some embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

For purposes of embodiments of the present disclosure, in the example of fig. 1, client application 112 may be an image or video application that may provide functionality to a user to generate gesture migration images and/or gesture migration videos. Accordingly, server 120 may be a server for use with the client application to provide services for generating pose migration images and/or pose migration videos to client application 112 running in client device 110.

Specifically, the server 120 may execute the neural network training method of the embodiment of the present disclosure to train the neural network. The user may upload or designate a source image (the source image may be, for example, a user's own photograph) and designate a reference image through the client application 112, and accordingly, the server 120 may execute the method for generating an image according to the embodiment of the present disclosure, process the source image and the reference image by using a trained neural network, and migrate the pose of the reference object in the reference image to the source object, so as to obtain a high-quality pose migration result image. The user may also upload or designate a source image (the source image may be, for example, a user's own photograph) and designate a reference video through the client application 112, and accordingly, the server 120 may execute the method for generating a video according to the embodiment of the present disclosure, process the source image and the reference video by using a trained neural network, migrate the pose of the reference object in each image frame of the reference video to the source object, and obtain a high-quality pose migration result video. Alternatively, the client device 110 may also perform a training method of a neural network to train the neural network, and the method of generating images and/or the method of generating videos of the embodiments of the present disclosure are performed by the client application 112 running in the client device 110 using the trained neural network to generate high-quality pose migration images and/or pose migration videos.

Fig. 2 shows a flow diagram of a method 200 of training a neural network according to an embodiment of the present disclosure. The method 200 may be performed at a server (e.g., the server 120 shown in fig. 1), that is, the execution subject of the steps of the method 200 may be the server 120 shown in fig. 1. It is to be appreciated that method 200 may also be performed at a client device (e.g., client device 110 shown in fig. 1).

The neural network trained according to the method 200 may be used to generate pose migration images and/or pose migration videos. Fig. 3 illustrates a structure of a neural network 300 according to an embodiment of the present disclosure, and as illustrated in fig. 3, the neural network 300 includes an image generation network 310 and an image restoration network 320 which are cascaded.

Still referring to fig. 2, as shown in fig. 2, the method 200 may include:

step S210, obtaining a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

step S220, inputting a sample source image and a sample reference image into an image generation network, and obtaining a prediction generation image output by the image generation network, wherein the prediction generation image comprises a sample source object, and the posture of the sample source object in the prediction generation image is consistent with the posture of a sample reference object in a sample reference image;

step S230, inputting the predicted generated image into an image restoration network, and obtaining a predicted restoration image output by the image restoration network and aiming at the predicted generated image;

step S240 of determining a loss value based on the sample reference image and the predicted restoration image; and

and step S250, adjusting parameters of the image restoration network based on the loss value.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image inpainting network. In the training process of the neural network, a sample source image and a sample reference image are input into the image generation network, a prediction generated image output by the image generation network is obtained, the prediction generated image comprises a sample source object, and the posture of the included sample source object is consistent with the posture of a sample reference object in a sample reference image, so that the posture of the reference object can be migrated to the source object, and automatic and efficient posture migration is realized. And inputting the predicted and generated image into an image restoration network, and adjusting parameters of the image restoration network according to the difference between the predicted and restored image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and vivid posture migration result image (or result video) can be generated based on the neural network.

The various steps of method 200 are described in detail below.

In step S210, a sample source image and a sample reference image are obtained, the sample source image includes a sample source object, and the sample reference image includes a sample reference object.

The sample source object, sample reference object may be any object capable of being posed including, but not limited to, real characters, cartoon animals, anthropomorphic objects, and the like. For the purpose of pose migration, typically, the sample source object in the sample source image has a different pose than the sample reference object in the sample reference image. The gesture may include, for example, a static motion of the object.

According to some embodiments, the sample source object and the sample reference object may be the same object or different objects. Preferably, the sample source object and the sample reference object may be the same object, so that in subsequent steps S240 and S250, when the loss value is determined according to the sample reference image and the predicted repaired image, and the parameter of the image repair network is adjusted based on the loss value, the sample reference image can be better used as the true value (ground true) of the predicted repaired image, and the repair effect is further improved. The following describes, with reference to an example, a principle that a sample source object and a sample reference object are the same object and can improve a restoration effect of an image restoration network.

For example, the sample source image a1 includes a sample source object o1, and the pose of the sample source object o1 is p1, so the object and the pose in the sample source image a1 can be represented as a two-dimensional vector (o1, p 1). Similarly, the sample reference image a2 includes the sample reference object o2 and the sample reference object o2 has a pose of p2, and the object and pose in the sample reference image a2 can be represented as (o2, p 2). The image generation network processes the sample source image a1 and the sample reference image a2 to generate a prediction generated image a3, the prediction generated image a3 includes the sample source object o1, and the sample source object o1 has the pose p2 of the sample reference object o2, then the objects and the pose in the prediction generated image a3 can be represented as (o1, p 2). The image inpainting network processes the prediction generated image a3 to generate a predicted inpainting image a4 for the prediction generated image a3, the predicted inpainting image a4 including the same object and pose as the prediction generated image a3, i.e., including the sample source object o1 and the pose p 2. The image restoration network restores the quality of the predicted generated image a3 such that the quality (e.g., color, sharpness, smoothness of edges, etc.) of the predicted restored image a4 is improved over the predicted generated image a 3. The object and pose in the predicted fix image a4 may be represented as (o1, p 2).

In order to enable the image restoration network to learn the image quality, a loss value is determined according to the predicted restoration image a4 and the true value thereof, and the parameters of the image restoration network are adjusted according to the loss value, so that the difference between the predicted restoration image a4 output by the image restoration network and the true value is as small as possible. The true value of the predicted repair image a4 should be a true image including the same object and pose as it. Since the prediction-generated image a3 is generated by the image generation network, it is not a true image, i.e., not a true value corresponding to the prediction-restored image a 4. The actual values are only possible for the specimen source image a1 or the specimen reference image a 2. For the purpose of pose migration, the pose p1 in the sample source image a1 is different from the pose p2 in the sample reference image a2, and the pose in the predicted fix-up image a4 is p2, so the true value corresponding to the predicted fix-up image a4 should be the sample reference image a 2. Further, by making the sample reference image a2 include the sample reference object o2 identical to the sample source object o1 included in the predicted repair image a4, that is, the sample source object and the sample reference object are the same object, it is possible to ensure that the sample reference image a2 is the true value of the predicted repair image a 4.

In the above embodiment, the sample source object and the sample reference object are the same object. It is understood that in other embodiments, the sample source object and the sample reference object may be different objects. In the case that the sample source object and the sample reference object are different objects, the sample reference image may be subjected to certain processing (for example, to adapt to size scaling of the different objects), and the processed sample reference image is used as a true value of the predicted repaired image.

As described above, to facilitate training of the image inpainting network (i.e., adjusting parameters of the image inpainting network), according to some embodiments, the sample source object and the sample reference object may be the same object, in which case the sample source image and the sample reference image may be different image frames in a video for the same object. Illustratively, the sample source image and the sample reference image may be obtained by extracting (either randomly or according to a certain rule) image frames of a single video for the same object. By decimating image frames of a single video multiple times, multiple (sample source image, sample reference image) image pairs can be obtained and the neural network trained based on these image pairs.

For example, two image frame frames 1, 2 may be randomly extracted from the dancing video B of dancer a as a sample source image and a sample reference image, respectively, resulting in a first image pair. Two image frame frames 3, 4 are again randomly extracted from the dance video B as a sample source image and a sample reference image, respectively, resulting in a second image pair. These two image pairs can be used to train a neural network. That is, the sample source object in the sample source image and the sample reference object in the sample reference image are both dancer a.

According to further embodiments, a sample video set may be constructed, the sample video set including a plurality of sample videos, each sample video corresponding to an object. The sample source image and the sample reference image may be different image frames in a sample video for the same object in a sample video set. Illustratively, the sample source image and the sample reference image may be obtained by extracting a plurality of sample videos in the sample video set (which may be randomly extracted or extracted according to a certain rule), and further extracting image frames of the extracted sample videos (which may be randomly extracted or extracted according to a certain rule). By decimating the sample video of the sample video set and the image frames of the sample video a plurality of (sample source image, sample reference image) image pairs are obtained and the neural network is trained based on these image pairs.

For example, the sample video set includes three sample videos, a dance video B1 of dancer a1, a fighting video B2 of cartoon character a2, and a walk show video B3 of model A3. A show video B3 is extracted from the sample video set, and two image frames frame1 and frame2 are extracted from the image frames of the show video B3 as a sample source image and a sample reference image, respectively, to obtain a first image pair, where the sample source object in the sample source image and the sample reference object in the sample reference image are both model A3. A fighting video B2 is extracted from the sample video set, and two image frames frame3 and frame4 are extracted from the image frames of the fighting video B2, and are respectively used as a sample source image and a sample reference image, so as to obtain a second image pair, wherein both the sample source object in the sample source image and the sample reference object in the sample reference image are cartoon character a 2. A fighting video B2 is extracted from the sample video set, and two image frames frame5 and frame6 are extracted from the image frames of the fighting video B2, and are respectively used as a sample source image and a sample reference image, so as to obtain a third image pair, wherein both the sample source object in the sample source image and the sample reference object in the sample reference image are cartoon character a 2. These three image pairs may be used to train a neural network.

According to the embodiment, the sample video set can include a large number (for example, tens of thousands) of sample videos with different objects (for example, real characters, cartoon animals and the like) and different postures (for example, dancing, sports, walking and showing, fighting and the like), and the training samples (namely, the sample source images and the sample reference images) are constructed based on the sample video set, so that the richness of the training samples can be improved, the neural network generated by training has good universality, and a good posture migration effect can be achieved for various objects and postures.

According to some embodiments, as shown in fig. 4, a sample video set may be constructed according to steps S410-S430.

In step S410, a plurality of original videos are acquired. The original video may be an unprocessed video uploaded by the user, and may be, for example, a dance video, a sports video, and the like uploaded by the user.

In step S420, for each original video, object detection is performed on each image frame in the original video. According to some embodiments, the image frame may be subject-detected by a preset subject detection model. That is, an image frame is input to a preset object detection model, and the object detection model outputs whether or not an object is included in the image frame, and in the case where an object is included, a position where the object is located may be further output. Specifically, the object detection model may be, for example, a model such as fast RCNN, YOLO, Cascade, or the like.

In step S430, image frames in the original video, which do not include the object, are removed to obtain a sample video.

For example, the original video a includes 100 image frames including 1-frame100, and if the object detection in step S420 determines that no object is included in the image frames 20-frame39, the image frames 20-frame39 are removed from the original video a. The remaining image frame frames 1-frame19, frame40-frame100 make up the sample video.

By removing image frames that do not include an object, the object may be included in each image frame of the sample video. Therefore, the sample source image and the sample reference image for training the neural network can be obtained by extracting the image frame of any sample video, so that the acquisition efficiency of the training sample is improved, and the training efficiency of the neural network is improved.

It is to be understood that the sample video set is not limited to be constructed by the above method, and for example, the sample video may be directly obtained by recording different objects respectively.

In the above embodiments, the sample source image and the sample reference image may be different image frames in a video for the same object. It should be noted that the sample source image and the sample reference image may be different images obtained by photographing the same object.

According to some embodiments, as shown in fig. 4, the method for constructing a sample video set may further include step S440. In step S440, the duration of the sample video is adjusted to a preset duration.

According to some embodiments, in a case where the frame rate (i.e., the number of image frames included per second) of each sample video is the same, each sample video may be respectively decimated so that each sample video includes the same number of image frames, thereby adjusting each sample video to the same duration, i.e., the preset duration. According to some embodiments, the preset duration may be set to a small value, such as 15-30 seconds.

According to the embodiment, the time length of each sample video is adjusted to the preset time length, so that each sample video has the same time length. The image frames are extracted from the sample video with the same time length and are used as training samples (namely sample source images and sample reference images), so that the training samples can be uniformly distributed in different objects and different postures, the richness of the training samples is improved, the neural network generated by training has good universality, and good posture migration effect can be achieved for various objects and postures.

According to some embodiments, as shown in fig. 4, the method for constructing a sample video set may further include step S450. In step S450, the size of the image frames of the sample video is adjusted to a preset size. The predetermined size may be 512 × 512, 256 × 256, etc. By adjusting the size of the image frame of each sample video to a preset size, the normalization of the image frame size can be realized, so that training samples (i.e. sample source images and sample reference images) extracted from the sample video have the same size, and the size adjustment of the training samples is not needed, thereby improving the training efficiency of the neural network.

After the sample source image and the sample reference image are obtained in step S210, step S220 may be executed to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generation image output by the image generation network. And the sample source object in the prediction generation image is consistent with the sample reference object in the sample reference image in posture.

The image generation Network may be, for example, a generation countermeasure Network (GAN) structure. Generative confrontation networks are an unsupervised learning model that includes generators (generators) and discriminators (discriminators) that learn to game each other to produce the desired outcome. More specifically, in some embodiments, the image generation network may be liquid WarpingGAN. In the case where the image generation network employs a GAN structure, the sample source image and the sample reference image may be input to a generator of GAN, respectively, which outputs a prediction generated image.

The image generation network may be trained in advance before step S220 is executed, so that training efficiency of a subsequent image inpainting network can be improved. The image generation network may be trained, for example, using a sample generation source image and a sample generation reference image, where the sample generation source image and the sample generation reference image are different image frames in a video for the same object.

According to some embodiments, similar to the obtaining of the sample source image and the sample reference image in step S210, the sample generation source image and the sample generation reference image used for training the image generation network may also be different image frames in the sample video for the same object in the aforementioned sample video set. For example, the sample generation source image and the sample generation reference image may be obtained by decimating a plurality of sample videos in the sample video set and further decimating image frames of the decimated sample videos. By decimating the sample video of the sample video set and the image frames of the sample video a plurality of (sample-generating source images, sample-generating reference images) image pairs are obtained and the image-generating network is trained on the basis of these image pairs.

In some embodiments, the sample generation source image and the sample generation reference image used for training the image generation network may be the same as the sample source image and the sample reference image in step S210 (specifically, the sample generation source image and the sample source image are the same, and the sample generation reference image and the sample reference image are the same), or may be different from the sample source image and the sample reference image in step S210 (specifically, the sample generation source image and the sample source image are different, and the sample generation reference image and the sample reference image are different).

It should be noted that, in the case that the sample generation source image and the sample generation reference image used for training the image generation network are the same as the sample source image and the sample reference image in step S210, the image generation network may also be trained simultaneously with the image inpainting network.

After the predicted generated image is obtained through step S220, step S230 may be performed to input the predicted generated image into the image inpainting network, and obtain a predicted inpainting image for the predicted generated image output by the image inpainting network.

According to some embodiments, the image inpainting network may include generating a countering network structure. More specifically, in some embodiments, the image repair network may be, for example, a pix2pixHD network that includes GAN structures. After the predicted image is input into the image restoration network, the image restoration network can process the predicted image and output the predicted restored image. The image quality (e.g., color, sharpness, edge smoothness, etc.) of a predictive restoration image is generally superior to that of a predictive generated image.

After obtaining a predicted repair image for the prediction generated image output by the image repair network through step S230, step S240 may be performed to determine a loss value based on the sample reference image and the predicted repair image.

In the embodiment of the disclosure, the sample reference image is used as the true value corresponding to the predicted repaired image, and the loss value is determined based on the sample reference image and the predicted repaired image, and the loss value can measure the difference between the predicted repaired image and the true value thereof. In the subsequent step S250, the parameters of the image restoration network are adjusted to reduce the difference (i.e., reduce the loss value) continuously, so that the image restoration network can learn the image quality, thereby improving the quality of the predicted restoration image output by the image restoration network.

Specifically, the loss value may be calculated in various ways. According to some embodiments, the Mean Square Error (MSE) of the sample reference image and the predicted fix image may be used as the loss value. According to other embodiments, the feature vectors of the sample reference image and the prediction restoration image may be extracted separately, and the distance between the feature vectors of the sample reference image and the prediction restoration image may be used as the loss value. It should be understood that the way in which the loss value is calculated is not limited to the two embodiments listed above.

According to some embodiments, at least one of the image generation network and the image inpainting network comprises generating a countermeasure network. For example, the image generation network and the image restoration network may each include a generation countermeasure network, the image generation network may be, for example, liquid warp GAN, and the image restoration network may be, for example, a pix2pixHD network including a GAN structure.

Based on the method 200, a neural network may be trained. The neural network can be used for generating attitude transition images or attitude transition videos, and high-efficiency and high-quality attitude transition is realized.

Fig. 5 illustrates a flow diagram of a method 500 of generating an image using a neural network trained by the method 200, according to an embodiment of the present disclosure. The method 500 may be performed at a server (e.g., the server 120 shown in fig. 1), that is, the execution subject of the steps of the method 500 may be the server 120 shown in fig. 1. It is to be appreciated that method 500 may also be performed at a client device (e.g., client device 110 shown in fig. 1).

As shown in fig. 5, method 500 may include:

step S510, inputting a source image and a reference image into an image generation network, and obtaining a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image;

step S520, inputting the generated image into an image restoration network to obtain a restoration image aiming at the generated image and output by the image restoration network; and

step S530 is to take the restored image as a result image.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image inpainting network. The image generation network can process a source image comprising a source object and a reference image comprising a reference object, and migrate the posture of the reference object to the source object to obtain a generated image, so that automatic and efficient posture migration is realized. The image restoration network can carry out quality restoration on the generated image to obtain a restored image, and the restored image is used as a result image of attitude migration, so that the image quality of the result image is improved, the result image is clearer and more vivid, and high-quality attitude migration is realized.

According to some embodiments, the source image in step S510 described above may be an image uploaded or specified by a user through a client application in a client device (e.g., client device 110 shown in fig. 1), which may be, for example, a user' S own photograph. The reference image may be an image specified by the user that wants to mimic the pose of the object therein. In this case, based on the method 500, the pose of the object in the image that the user wants to mimic can be migrated to the user in the source image, generating a high quality pose migration result image.

Fig. 6A-6D are schematic diagrams illustrating a process of generating an image according to the method 500 illustrated in fig. 5. Specifically, fig. 6A-6D are a source image, a reference image, a generated image, and a restored image (i.e., a resultant image), respectively. As shown in fig. 6A, a source object 610 is included in the source image, the source object 610 having a walking pose. As shown in fig. 6B, the reference image includes a reference object 620, and the reference object 620 has a basketball playing posture. Inputting the source image shown in fig. 6A and the reference image shown in fig. 6B into the image generation network, the generated image shown in fig. 6C can be obtained, the generated image includes the source object 610, and the posture of the source object 610 is consistent with the posture of the reference object 620 in fig. 6B, that is, the posture of the reference object 620 is migrated to the source object 610, so that the posture migration is realized. However, as shown in fig. 6C, the generated image is not realistic enough, blurry at the edge of the source object 610, and the image quality is not high. The generated image shown in fig. 6C is input to an image restoration network, and a restored image as shown in fig. 6D, that is, a resultant image, can be obtained. As can be seen by comparing fig. 6C and 6D, the resulting image shown in fig. 6D is more realistic, the edge is clearer, the image quality is higher than the generated image shown in fig. 6C, and high-quality pose migration is realized.

Fig. 7 shows a flow diagram of a method 700 of generating a video using a neural network trained by the method 200, according to an embodiment of the present disclosure. The method 700 may be performed at a server (e.g., the server 120 shown in fig. 1), that is, the execution subject of the steps of the method 700 may be the server 120 shown in fig. 1. It is to be appreciated that method 700 may also be performed at a client device (e.g., client device 110 shown in fig. 1).

As shown in fig. 7, method 700 may include:

step S710, a source image and a reference video are obtained, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

step S720, for each of a plurality of reference image frames, performing the following operations: inputting a source image and the reference image frame into an image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises a source object, and the posture of the source object in the generated image is consistent with that of a reference object in the reference image frame; inputting the generated image into an image restoration network, and obtaining a restoration image aiming at the generated image and output by the image restoration network; and

and step 730, splicing a plurality of repaired images corresponding to the plurality of reference image frames to generate a result video.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image inpainting network. The image generation network can process a source image comprising a source object and a reference image frame comprising a reference object, the posture of the reference object is transferred to the source object, and the generated images corresponding to the reference image frames are obtained respectively, so that automatic and efficient posture transfer is realized. The image restoration network can restore the quality of the generated image to obtain restoration images corresponding to the reference image frames, and the restoration images corresponding to the reference image frames are spliced to obtain a result video, so that the quality of the result video is improved, the result video is clearer and more vivid, and high-quality posture migration is realized.

According to some embodiments, the source image in step S710 described above may be an image uploaded or specified by a user through a client application in a client device (e.g., client device 110 shown in fig. 1), which may be, for example, a user' S own photograph. The reference image may be a video specified by a user that wants to mimic the pose of an object therein. In this case, based on the method 700, the object pose in the video that the user wants to mimic can be migrated to the user in the source image, generating a high quality pose migration result video.

According to another aspect of the present disclosure, a training apparatus of a neural network is also provided. Fig. 8 shows a block diagram of a training apparatus 800 of a neural network according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 may include a sample acquisition module 810, a prediction generation module 820, a prediction repair module 830, a loss calculation module 840, and a parameter adjustment module 850.

The sample acquisition module 810 can be configured to acquire a sample source image including a sample source object and a sample reference image including a sample reference object.

The prediction generation module 820 may be configured to input the sample source image and the sample reference image into an image generation network, obtain a prediction generated image output by the image generation network, wherein the sample source object is included in the prediction generated image, and a pose of the sample source object in the prediction generated image coincides with a pose of the sample reference object in the sample reference image.

The predictive restoration module 830 may be configured to input the predictive-generated image into an image restoration network, obtaining a predictive restoration image for the predictive-generated image output by the image restoration network.

The loss calculation module 840 may be configured to determine a loss value based on the sample reference picture and the predicted fix-up picture.

The parameter adjustment module 850 may be configured to adjust a parameter of the image inpainting network based on the loss value.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image inpainting network. In the training process of the neural network, a sample source image and a sample reference image are input into the image generation network, a prediction generated image output by the image generation network is obtained, the prediction generated image comprises a sample source object, and the posture of the included sample source object is consistent with the posture of a sample reference object in a sample reference image, so that the posture of the reference object can be migrated to the source object, and automatic and efficient posture migration is realized. The predicted and generated image is input into an image restoration network, and parameters of the image restoration network are adjusted according to the difference (namely loss value) between the predicted and restored image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and vivid posture migration result image (or result video) can be generated based on the neural network.

According to another aspect of the present disclosure, there is also provided an apparatus for generating an image using a neural network. Fig. 9 shows a block diagram of an apparatus 900 for generating an image using a neural network according to an embodiment of the present disclosure. As shown in fig. 9, the apparatus 900 may include an image generation module 910, an image inpainting module 920, and an image output module 930.

The image generation module 910 may be configured to input a source image and a reference image into an image generation network, obtain a generated image output by the image generation network, wherein the source image includes a source object therein, the reference image includes a reference object therein, the generated image includes the source object therein, and a pose of the source object in the generated image coincides with a pose of the reference object in the reference image.

The image inpainting module 920 may be configured to input the generated image into an image inpainting network, obtaining an inpainting image for the generated image output by the image inpainting network.

The image output module 930 may be configured to take the repair image as a result image.

According to another aspect of the present disclosure, there is also provided an apparatus for generating a video using a neural network. Fig. 10 shows a block diagram of an apparatus 1000 for generating a video using a neural network according to an embodiment of the present disclosure. As shown in fig. 10, the apparatus 1000 may include an acquisition module 1010, a generation module 1020, and a video output module 1030.

The acquisition module 1010 may be configured to acquire a source image including a source object therein and a reference video including a plurality of reference image frames each including a reference object therein.

The generating module 1020 may include an image generating unit 1022 and an image inpainting unit 1024, wherein the image generating unit 1022 may be configured to, for each of the plurality of reference image frames, perform the following operations: inputting the source image and the reference image frame into an image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame.

The image inpainting unit 1024 may be configured to input the generated image into an image inpainting network, and obtain an inpainting image for the generated image output by the image inpainting network.

The video output module 1030 may be configured to stitch a plurality of repaired images corresponding to the plurality of reference image frames to generate a resultant video.

It should be understood that the various modules of the apparatus 800 shown in fig. 8 may correspond to the various steps in the method 200 described with reference to fig. 2, the various modules of the apparatus 900 shown in fig. 9 may correspond to the various steps in the method 500 described with reference to fig. 5, and the various modules of the apparatus 1000 shown in fig. 10 may correspond to the various steps in the method 700 described with reference to fig. 7. Thus, the operations, features and advantages described above with respect to the

methods

200, 500, 700 are equally applicable to the apparatus 800, 900, 1000 and the modules/units comprised thereby. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. For example, the loss calculation module 840 and the parameter adjustment module 850 described above may be combined into a single module in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 8-10 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the sample acquisition module 810, the prediction generation module 820, the prediction restoration module 830, the loss calculation module 840, the parameter adjustment module 850, the image generation module 910, the image restoration module 920, the image output module 930, the acquisition module 1010, the generation module 1020 (including the image generation unit 1022 and the image restoration unit 1024), and the video output module 1030 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a method according to the above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method according to the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the method according to the above when executed by a processor.

Referring to fig. 11, a block diagram of a structure of an electronic device 1100, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. The electronic devices may be different types of computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the electronic device 1100 may include at least one processor 1101, a work memory 1102, an input unit 1104, a display unit 1105, a speaker 1106, a storage unit 1107, a communication unit 1108, and other output units 1109 that can communicate with each other through a system bus 1103.

Processor 1101 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 11101 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. The processor 1101 may be configured to retrieve and execute computer readable instructions stored in the working memory 1102, the storage unit 1107, or other computer readable medium, such as program code for the operating system 1102a, program code for the application program 1102b, and the like.

Working memory 1102 and storage unit 1107 are examples of computer-readable storage media for storing instructions that are executed by processor 1101 to perform the various functions described above. The working memory 1102 may include both volatile and non-volatile memory (e.g., RAM, ROM, etc.). Further, storage unit 1107 may include a hard disk drive, solid state drive, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and so forth. Both working memory 1102 and storage unit 1107 may be collectively referred to herein as memory or a computer-readable storage medium, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program code, which may be executed by processor 1101 as a particular machine configured to implement the operations and functions described in the examples herein.

The input unit 1106 may be any type of device capable of inputting information to the electronic device 1100, and the input unit 1106 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. The output units may be any type of device capable of presenting information and may include, but are not limited to, a display unit 1105, speakers 1106, and other output units 1109, which other output units 1109 may include, but are not limited to, video/audio output terminals, vibrators, and/or printers. The communication unit 1108 allows the electronic device 1100 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth^TMDevices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The application 1102b in the working register 1102 may be loaded to perform the various methods and processes described above, such as steps S210-S250 in fig. 2. For example, in some embodiments, the

methods

200, 400, 500, 700 described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1107. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 1100 via the storage unit 1107 and/or the communication unit 1108. When the computer program is loaded and executed by the processor 1101, one or more steps of the

methods

200, 400, 500, 700 described above may be performed. Alternatively, in other embodiments, the processor 1101 may be configured to perform the

methods

200, 400, 500, 700 by any other suitable means (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A computer-implemented method of training a neural network, wherein the neural network comprises an image generation network and an image inpainting network, the method comprising:

acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

inputting the sample source image and the sample reference image into the image generation network, and obtaining a prediction generation image output by the image generation network, wherein the sample source object is included in the prediction generation image, and the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image;

inputting the predicted generated image into the image restoration network, and obtaining a predicted restoration image output by the image restoration network and aiming at the predicted generated image;

determining a loss value based on the sample reference picture and the predictive restoration picture; and

adjusting a parameter of the image inpainting network based on the loss value.

2. The method of claim 1, wherein the sample source image and the sample reference image are different image frames in a video for the same object.

3. The method of claim 1, further comprising: constructing a sample video set, wherein the sample video set comprises a plurality of sample videos, and each sample video corresponds to an object;

wherein the sample source image and the sample reference image are different image frames in a sample video for the same object in the sample video set.

4. The method of claim 3, wherein said constructing a sample video set comprises:

acquiring a plurality of original videos;

for each of the plurality of original videos, performing the following operations:

carrying out object detection on each image frame in the original video; and

and removing the image frames which do not comprise the object in the original video to obtain a sample video.

5. The method of claim 3 or 4, wherein said constructing a sample video set further comprises: and adjusting the time length of the sample video to be a preset time length.

6. The method of any of claims 3-5, wherein the constructing a sample video set further comprises: and adjusting the size of the image frame of the sample video to a preset size.

7. The method of any of claims 1-6, wherein at least one of the image generation network and the image inpainting network comprises generating an antagonistic network.

8. The method of any one of claims 1-7, wherein the image generation network is trained using sample generation source images and sample generation reference images, the sample generation source images and the sample generation reference images being different image frames in a video for the same object.

9. A method of generating an image using a neural network, wherein the neural network is obtained by training according to the training method of any one of claims 1 to 8, the neural network comprises an image generation network and an image inpainting network, the method comprises:

inputting a source image and a reference image into the image generation network, and obtaining a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image;

inputting the generated image into the image restoration network, and obtaining a restoration image output by the image restoration network and aiming at the generated image; and

and taking the repaired image as a result image.

10. A method of generating a video using a neural network, wherein the neural network is obtained by training according to the training method of any one of claims 1-8, the neural network comprises an image generation network and an image inpainting network, and the method comprises:

acquiring a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

for each of the plurality of reference image frames, performing the following:

inputting the source image and the reference image frame into the image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; and

and splicing a plurality of restored images corresponding to the plurality of reference image frames to generate a result video.

11. An apparatus for training a neural network, the neural network including an image generation network and an image inpainting network, the apparatus comprising:

the system comprises a sample acquisition module, a sample acquisition module and a sample reference module, wherein the sample acquisition module is configured to acquire a sample source image and a sample reference image, the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

a prediction generation module configured to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generated image output by the image generation network, wherein the sample source object is included in the prediction generated image, and the posture of the sample source object in the prediction generated image is consistent with the posture of the sample reference object in the sample reference image;

a predictive restoration module configured to input the predictive-generated image into the image restoration network, to obtain a predictive restoration image for the predictive-generated image output by the image restoration network;

a loss calculation module configured to determine a loss value based on the sample reference picture and the predictive restoration picture; and

a parameter adjustment module configured to adjust a parameter of the image inpainting network based on the loss value.

12. An apparatus for generating an image using a neural network, wherein the neural network includes an image generation network and an image inpainting network, the apparatus comprising:

an image generation module configured to input a source image and a reference image into the image generation network, and obtain a generated image output by the image generation network, wherein the source image includes a source object therein, the reference image includes a reference object therein, the generated image includes the source object therein, and a posture of the source object in the generated image is consistent with a posture of the reference object in the reference image;

an image restoration module configured to input the generated image into the image restoration network, and obtain a restored image for the generated image output by the image restoration network; and

an image output module configured to take the repair image as a result image.

13. An apparatus for generating a video using a neural network, wherein the neural network comprises an image generation network and an image inpainting network, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a source image and a reference video, the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

a generating module including an image generating unit and an image inpainting unit, wherein,

the image generation unit is configured to, for each of the plurality of reference image frames, perform the following operations: inputting the source image and the reference image frame into the image generation network, and obtaining a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame;

the image restoration unit is configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and

and the video output module is configured to splice a plurality of repair images corresponding to the plurality of reference image frames to generate a result video.

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program which, when executed by the at least one processor, implements the method according to any one of claims 1-10.

15. A non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-10.

16. A computer program product comprising a computer program, wherein the computer program realizes the method according to any of claims 1-10 when executed by a processor.