CN113326934B

CN113326934B - Training method of neural network, method and device for generating images and videos

Info

Publication number: CN113326934B
Application number: CN202110602135.5A
Authority: CN
Inventors: 鲁超
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-03-29
Anticipated expiration: 2041-05-31
Also published as: CN113326934A

Abstract

The disclosure provides a training method of a neural network, a method and a device for generating images and videos, relates to the technical field of image and video processing, and particularly relates to the technical field of artificial intelligence. The scheme comprises the following steps: acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; inputting the sample source image and the sample reference image into an image generation network to obtain a prediction generation image output by the image generation network, wherein the prediction generation image comprises a sample source object, and the gesture of the sample source object in the prediction generation image is consistent with the gesture of the sample reference object in the sample reference image; inputting the prediction generated image into an image restoration network to obtain a prediction restoration image which is output by the image restoration network and aims at the prediction generated image; determining a loss value based on the sample reference image and the predicted repair image; and adjusting parameters of the image restoration network based on the loss value.

Description

Training method of neural network, method and device for generating images and videos

Technical Field

The present disclosure relates to the field of image and video processing technologies, and in particular, to the field of artificial intelligence technologies, and in particular, to a neural network training method and apparatus, a method and apparatus for generating an image using a neural network, a method and apparatus for generating a video using a neural network, an electronic device, a storage medium, and a computer program product.

Background

With the popularity of short video applications (apps), more and more users are beginning to use mobile terminals such as cell phones to capture and share short videos. In some cases, when a user sees an interesting short video, an imitation shot may be made, i.e., the pose and motion of a person in the video are imitated to shoot his or her own video. However, it is difficult for most users to simulate shooting, and it is often difficult to reproduce gestures or actions in the original video.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a neural network, a method and apparatus for generating an image using a neural network, a method and apparatus for generating a video using a neural network, an electronic device, a storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a neural network implemented by a computer, the neural network including an image generation network and an image restoration network, the method comprising: acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; inputting the sample source image and the sample reference image into the image generation network to obtain a prediction generation image output by the image generation network, wherein the prediction generation image comprises the sample source object, and the gesture of the sample source object in the prediction generation image is consistent with the gesture of the sample reference object in the sample reference image; inputting the prediction generation image into the image restoration network to obtain a prediction restoration image which is output by the image restoration network and aims at the prediction generation image; determining a loss value based on the sample reference image and the predicted repair image; and adjusting parameters of the image restoration network based on the loss value.

According to another aspect of the present disclosure, there is also provided a method of generating an image using a neural network, the neural network being trained according to the above training method and including an image generation network and an image restoration network, the method including: inputting a source image and a reference image into the image generation network to obtain a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image; inputting the generated image into the image restoration network to obtain a restoration image which is output by the image restoration network and aims at the generated image; and taking the repair image as a result image.

According to another aspect of the present disclosure, there is also provided a method of generating video using a neural network, the neural network being trained according to the above training method and including an image generation network and an image restoration network, the method including: acquiring a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object; for each of the plurality of reference image frames, performing the following: inputting the source image and the reference image frame into the image generation network to obtain a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; inputting the generated image into the image restoration network to obtain a restoration image which is output by the image restoration network and aims at the generated image; and splicing the multiple repair images corresponding to the multiple reference image frames respectively to generate a result video.

According to another aspect of the present disclosure, there is also provided a training apparatus of a neural network including an image generation network and an image restoration network, the apparatus including: a sample acquisition module configured to acquire a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object; a prediction generation module configured to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generation image output by the image generation network, wherein the prediction generation image comprises the sample source object, and the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image; a prediction restoration module configured to input the prediction generation image into the image restoration network, and obtain a prediction restoration image for the prediction generation image output by the image restoration network; a loss calculation module configured to determine a loss value based on the sample reference image and the predicted repair image; and a parameter adjustment module configured to adjust parameters of the image restoration network based on the loss value.

According to another aspect of the present disclosure, there is also provided an apparatus for generating an image using a neural network including an image generation network and an image restoration network, the apparatus including: an image generation module configured to input a source image and a reference image into the image generation network, and obtain a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the pose of the source object in the generated image is consistent with the pose of the reference object in the reference image; an image restoration module configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and an image output module configured to take the repair image as a result image.

According to another aspect of the present disclosure, there is also provided an apparatus for generating video using a neural network including an image generation network and an image restoration network, the apparatus including: an acquisition module configured to acquire a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object; a generation module including an image generation unit and an image restoration unit, wherein the image generation unit is configured to perform, for each of the plurality of reference image frames, the following operations: inputting the source image and the reference image frame into the image generation network to obtain a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; the image restoration unit is configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and the video output module is configured to splice the repair images corresponding to the plurality of reference image frames to generate a result video.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a method according to any of the above aspects.

According to another aspect of the present disclosure there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method according to any of the above aspects.

According to another aspect of the present disclosure there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to any of the above aspects.

In accordance with one or more embodiments of the present disclosure, a neural network includes an image generation network and an image restoration network. In the training process of the neural network, a sample source image and a sample reference image are input into an image generation network, a prediction generation image output by the image generation network is obtained, the prediction generation image comprises a sample source object, and the gesture of the included sample source object is consistent with the gesture of a sample reference object in the sample reference image, so that the gesture of the reference object can be migrated to the source object, and automatic and efficient gesture migration is realized. The prediction generated image is input into an image restoration network, and parameters of the image restoration network are adjusted according to the difference (namely loss value) between the prediction restoration image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and realistic gesture migration result image (or result video) can be generated based on the neural network.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a method of training a neural network, according to an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of a neural network, according to an embodiment of the present disclosure;

FIG. 4 illustrates a flowchart of constructing a sample video collection according to an embodiment of the present disclosure;

FIG. 5 illustrates a flowchart of a method of generating an image using a neural network, according to an embodiment of the present disclosure;

FIGS. 6A-6D are schematic diagrams illustrating a process of generating an image according to the method illustrated in FIG. 5;

FIG. 7 illustrates a flowchart of a method of generating video using a neural network, according to an embodiment of the present disclosure;

FIG. 8 shows a block diagram of a training device of a neural network, according to an embodiment of the present disclosure;

FIG. 9 shows a block diagram of an apparatus for generating an image using a neural network, according to an embodiment of the present disclosure;

FIG. 10 shows a block diagram of an apparatus for generating video using a neural network, according to an embodiment of the present disclosure; and

fig. 11 shows a block diagram of an exemplary electronic device, according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, when a user sees an interesting video, it is possible to perform a simulated shooting, that is, to simulate the pose and motion of a person in the video, to shoot his or her own video. However, it is difficult for most users to simulate shooting, and it is often difficult to reproduce gestures or actions in the original video. Therefore, it is desirable to provide a gesture migration scheme, so that when a user sees a video of interest, the user only needs to upload his/her own photograph, and can migrate the gesture or action in the specified video onto his/her own body, and automatically generate his/her own simulated video, without performing a complicated simulated shooting operation.

In order to achieve efficient, high quality pose migration, the present disclosure provides a neural network and training schemes therefor, and schemes for generating images and videos using trained neural networks. Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes a client device 110, a server 120, and a network 130 communicatively coupling the client device 110 with the server 120.

Client device 110 includes a display 114 and a client Application (APP) 112 that is displayable via display 114. The client application 112 may be an application program that needs to be downloaded and installed before running, either as a web page program (webapp) running in a browser or as a applet (liteapp) for a lightweight application program. In the case where the client application 112 is an application program that needs to be downloaded and installed before running, the client application 112 may be pre-installed on the client device 110 and activated. In the case where the client application 112 is a web page program that runs in a browser, the user 102 may directly run the client application 112 by accessing a particular site in the browser without installing the client application 112. In the case where the client application 112 is an applet, the user 102 may run the client application 112 directly on the client device 110 by searching the client application 112 in the host application (e.g., by name of the client application 112, etc.) or by scanning a graphical code (e.g., bar code, two-dimensional code, etc.) of the client application 112, etc., without installing the client application 112. In some embodiments, the client device 110 may be any type of mobile computer device, including a mobile computer, a mobile phone, a wearable computer device (e.g., a smart watch, a head-mounted device, including smart glasses, etc.), or other type of mobile device. In some embodiments, client device 110 may alternatively be a stationary computer device, such as a desktop, server computer, or other type of stationary computer device.

Server 120 is typically a server deployed by an Internet Service Provider (ISP) or Internet Content Provider (ICP). Server 120 may represent a single server, a cluster of multiple servers, a distributed system, or a cloud server providing basic cloud services (such as cloud databases, cloud computing, cloud storage, cloud communication). It will be appreciated that although server 120 is shown in fig. 1 as communicating with only one client device 110, server 120 may provide background services for multiple client devices simultaneously.

Examples of network 130 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the internet. The network 130 may be a wired or wireless network. In some embodiments, the data exchanged over the network 130 is processed using techniques and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like. In addition, all or some of the links may also be encrypted using encryption techniques such as Secure Sockets Layer (SSL), transport Layer Security (TLS), virtual Private Network (VPN), internet protocol security (IPsec), and the like. In some embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

For purposes of embodiments of the present disclosure, in the example of fig. 1, client application 112 may be an image or video application that may provide a user with functionality to generate gesture-migrated images and/or gesture-migrated videos. In response, server 120 may be a server for use with the client application for providing services for generating gesture-migrated images and/or gesture-migrated videos to client application 112 running in client device 110.

In particular, the server 120 may perform the training method of the neural network of the embodiments of the present disclosure to train the neural network. The user may upload or designate a source image (the source image may be, for example, a photograph of the user) and designate a reference image through the client application 112, and accordingly, the server 120 may execute the method for generating an image according to the embodiment of the present disclosure, process the source image and the reference image by using the trained neural network, and migrate the pose of the reference object in the reference image onto the source object, so as to obtain a high-quality pose migration result image. The user may also upload or designate a source image (the source image may be, for example, a photograph of the user) and designate a reference video through the client application 112, and accordingly, the server 120 may execute the method for generating a video according to the embodiment of the present disclosure, process the source image and the reference video by using a trained neural network, and migrate the pose of the reference object in each image frame of the reference video onto the source object, so as to obtain a high-quality pose migration result video. Alternatively, the client device 110 may also perform a training method of the neural network to train the neural network, and the trained neural network is utilized by the client application 112 running in the client device 110 to perform the method of generating images and/or the method of generating videos of embodiments of the present disclosure to generate high quality gesture migration images and/or gesture migration videos.

Fig. 2 shows a flowchart of a neural network training method 200, according to an embodiment of the present disclosure. The method 200 may be performed at a server (e.g., the server 120 shown in fig. 1), i.e., the subject of execution of the steps of the method 200 may be the server 120 shown in fig. 1. It is to be appreciated that the method 200 may also be performed at a client device (e.g., the client device 110 shown in fig. 1).

The neural network trained according to method 200 may be used to generate pose migration images and/or pose migration videos. Fig. 3 illustrates a structure of a neural network 300 of an embodiment of the present disclosure, as illustrated in fig. 3, the neural network 300 includes a cascaded image generation network 310 and an image restoration network 320.

Still referring to fig. 2, as shown in fig. 2, method 200 may include:

step S210, acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

step S220, inputting the sample source image and the sample reference image into an image generation network to obtain a prediction generation image output by the image generation network, wherein the prediction generation image comprises a sample source object, and the gesture of the sample source object in the prediction generation image is consistent with the gesture of the sample reference object in the sample reference image;

Step S230, inputting the predicted generated image into an image restoration network to obtain a predicted restoration image which is output by the image restoration network and aims at the predicted generated image;

step S240, determining a loss value based on the sample reference image and the predicted repair image; and

step S250, parameters of the image restoration network are adjusted based on the loss value.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image restoration network. In the training process of the neural network, a sample source image and a sample reference image are input into an image generation network, a prediction generation image output by the image generation network is obtained, the prediction generation image comprises a sample source object, and the gesture of the included sample source object is consistent with the gesture of a sample reference object in the sample reference image, so that the gesture of the reference object can be migrated to the source object, and automatic and efficient gesture migration is realized. The prediction generated image is input into an image restoration network, and parameters of the image restoration network are adjusted according to the difference between the prediction restoration image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and realistic gesture migration result image (or result video) can be generated based on the neural network.

The steps of method 200 are described in detail below.

In step S210, a sample source image and a sample reference image are acquired, wherein the sample source image includes a sample source object, and the sample reference image includes a sample reference object.

The sample source object, sample reference object may be any object capable of making a gesture including, but not limited to, a real character, a cartoon animal, a personified object, and the like. For the purpose of pose migration, typically, a sample source object in a sample source image has a different pose than a sample reference object in a sample reference image. The gesture may include, for example, a static action of the object.

According to some embodiments, the sample source object and the sample reference object may be the same object or different objects. Preferably, the sample source object and the sample reference object may be the same object, so that in the subsequent steps S240 and S250, when determining a loss value according to the sample reference image and the predicted repair image and adjusting the parameters of the image repair network based on the loss value, the sample reference image can be better used as a real value (ground trunk) of the predicted repair image, so as to further improve the repair effect. The following describes, in connection with an example, the principle that the sample source object and the sample reference object are the same object, which can promote the restoration effect of the image restoration network.

For example, the sample source image a1 includes a sample source object o1, and the pose of the sample source object o1 is p1, so that the object and the pose in the sample source image a1 may be represented as a two-dimensional vector (o 1, p 1). Similarly, the sample reference image a2 includes a sample reference object o2, and the pose of the sample reference object o2 is p2, and then the object and the pose in the sample reference image a2 may be represented as (o 2, p 2). The image generating network processes the sample source image a1 and the sample reference image a2 to generate a prediction generating image a3, the prediction generating image a3 includes a sample source object o1, and the sample source object o1 has a pose p2 of the sample reference object o2, and then the object and the pose in the prediction generating image a3 can be represented as (o 1, p 2). The image restoration network processes the prediction generation image a3 to generate a prediction restoration image a4 for the prediction generation image a3, and the prediction restoration image a4 and the prediction generation image a3 include the same object and pose, that is, include the sample source object o1 and the pose p2. The image restoration network restores the quality of the prediction generated image a3 so that the quality (e.g., color, sharpness, smoothness of edges, etc.) of the prediction restored image a4 is improved over the prediction generated image a 3. The object and pose in the predicted repair image a4 may be represented as (o 1, p 2).

In order to enable the image restoration network to learn the image quality, a loss value is determined according to the predicted restoration image a4 and the true value thereof, and parameters of the image restoration network are adjusted according to the loss value, so that the difference between the predicted restoration image a4 output by the image restoration network and the true value is as small as possible. The true value of the predictive repair image a4 should be the true image that includes the same object and pose as it. Since the prediction generated image a3 is generated by the image generation network, it is not a true image, that is, a true value corresponding to the prediction repair image a 4. The true value may only be the sample source image a1 or the sample reference image a2. For the purpose of pose migration, the pose p1 in the sample source image a1 is different from the pose p2 in the sample reference image a2, and the pose in the predictive repair image a4 is p2, so the true value corresponding to the predictive repair image a4 should be the sample reference image a2. Further, by making the sample reference object o2 included in the sample reference image a2 identical to the sample source object o1 included in the prediction repair image a4, that is, the sample source object and the sample reference object are identical, it is possible to ensure that the sample reference image a2 is a true value of the prediction repair image a 4.

In the above embodiment, the sample source object and the sample reference object are the same object. It will be appreciated that in other embodiments, the sample source object, the sample reference object may be different objects. In the case that the sample source object and the sample reference object are different objects, a certain process may be performed on the sample reference image (for example, to adapt to the size scaling of the different objects), and the processed sample reference image is used as the true value of the prediction repair image.

As described above, to facilitate training of the image restoration network (i.e., adjusting parameters of the image restoration network), the sample source object, the sample reference object, and the sample source image may be the same object, in which case the sample source image and the sample reference image may be different image frames in a video for the same object, according to some embodiments. The sample source image and the sample reference image may be derived by decimating (either randomly or according to a rule) image frames of a single video for the same object, for example. By taking multiple shots of image frames of a single video, multiple (sample source image, sample reference image) image pairs can be obtained and the neural network trained based on these image pairs.

For example, two image frames frame1 and frame2 may be randomly extracted from dance video B of dancer a, and used as a sample source image and a sample reference image, respectively, to obtain a first image pair. And randomly extracting two image frames frame3 and frame4 from the dance video B again to respectively serve as a sample source image and a sample reference image, so as to obtain a second image pair. These two image pairs can be used to train a neural network. That is, the sample source object in the sample source image and the sample reference object in the sample reference image are both dancer a.

According to other embodiments, a set of sample videos may be constructed, the set of sample videos including a plurality of sample videos, each sample video corresponding to an object. The sample source image and the sample reference image may be different image frames in a sample video for the same object in a sample video set. The sample source image and the sample reference image may be obtained by extracting a plurality of sample videos in a sample video set (may be randomly extracted or may be extracted according to a certain rule), and further extracting image frames of the extracted sample videos (may be randomly extracted or may be extracted according to a certain rule). By decimating the sample video in the sample video set and the image frames of the sample video multiple times, multiple image pairs (sample source image, sample reference image) can be obtained and the neural network trained based on those image pairs.

For example, the sample video set includes three sample videos of dance video B1 of dance A1, fighting video B2 of cartoon character A2, and walk-show video B3 of model A3. Extracting a walk-show video B3 from a sample video set, extracting two image frames frame1 and frame2 from the image frames of the walk-show video B3, respectively serving as a sample source image and a sample reference image, and obtaining a first image pair, wherein a sample source object in the sample source image and a sample reference object in the sample reference image are models A3. And extracting a fight video B2 from the sample video set, and extracting two image frames frame3 and frame4 from the image frames of the fight video B2, wherein the two image frames are respectively used as a sample source image and a sample reference image to obtain a second image pair, and a sample source object in the sample source image and a sample reference object in the sample reference image are both cartoon characters A2. And extracting a fight video B2 from the sample video set, and extracting two image frames, namely a frame5 and a frame6, from the image frames of the fight video B2, which are respectively used as a sample source image and a sample reference image, so as to obtain a third image pair, wherein a sample source object in the sample source image and a sample reference object in the sample reference image are both cartoon characters A2. These three image pairs can be used to train a neural network.

According to the embodiment, the sample video set may include a large number (e.g., tens of thousands) of sample videos of different objects (e.g., real characters, cartoon animals, etc.), different gestures (e.g., dance, sports, running shows, fighting, etc.), and training samples (i.e., sample source images and sample reference images) are constructed based on the sample video set, so that the richness of the training samples can be improved, and thus, the neural network generated by training has good universality, and good gesture migration effects can be achieved for various objects and gestures.

According to some embodiments, as shown in FIG. 4, a sample video set may be constructed according to steps S410-S430.

In step S410, a plurality of original videos are acquired. The original video may be unprocessed video uploaded by the user, such as dance video, sports video, etc. uploaded by the user.

In step S420, for each original video, object detection is performed for each image frame in the original video. According to some embodiments, object detection may be performed on image frames by a preset object detection model. That is, the image frame is input into a preset object detection model, and the object detection model outputs whether or not an object is included in the image frame, and in the case where the object is included, the position where the object is located may be further output. Specifically, the object detection model may be, for example, a model of Faster RCNN, YOLO, cascade, or the like.

In step S430, image frames excluding the object in the original video are removed to obtain a sample video.

For example, where the original video A includes 100 frames from frame1 to frame100, and the object detection of step S420 determines that no object is included in the image frames from frame20 to frame39, then the image frames from frame20 to frame39 are removed from the original video A. The remaining image frames frame1-frame19, frame40-frame100 make up the sample video.

By removing image frames that do not include an object, the object may be included in each image frame of the sample video. Therefore, the image frames of any sample video are extracted, and the sample source image and the sample reference image for training the neural network can be obtained, so that the acquisition efficiency of training samples is improved, and the training efficiency of the neural network is further improved.

It can be appreciated that the sample video set is not limited to being constructed by the above method, for example, the sample video may be obtained directly by recording different objects.

In the above embodiment, the sample source image and the sample reference image may be different image frames in the video for the same object. It should be noted that the sample source image and the sample reference image may be different pictures obtained by photographing the same object.

According to some embodiments, as shown in fig. 4, the method for constructing a sample video set may further include step S440. In step S440, the duration of the sample video is adjusted to a preset duration.

According to some embodiments, in the case that the frame rate (i.e., the number of image frames included per second) of each sample video is the same, each sample video may be respectively frame-decimated so that each sample video includes the same number of image frames, thereby adjusting each sample video to the same duration, i.e., the preset duration. According to some embodiments, the preset time period may be set to a small value, such as 15-30 seconds.

According to the embodiment, the time length of each sample video is adjusted to be the preset time length, so that each sample video can have the same time length. The image frames are extracted from the sample videos with the same duration to serve as training samples (namely sample source images and sample reference images), so that the training samples can be uniformly distributed in different objects and different gestures, the richness of the training samples is improved, the neural network generated by training has good universality, and good gesture migration effects can be achieved for various objects and gestures.

According to some embodiments, as shown in fig. 4, the method for constructing a sample video set may further include step S450. In step S450, the size of the image frame of the sample video is adjusted to a preset size. The predetermined size may be 512×512, 256×256, etc. By adjusting the size of the image frame of each sample video to a preset size, the normalization of the size of the image frame can be realized, so that training samples (i.e., a sample source image and a sample reference image) extracted from the sample video have the same size, and the size adjustment is not required, thereby improving the training efficiency of the neural network.

After the sample source image and the sample reference image are acquired through step S210, step S220 may be performed to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generation image output from the image generation network. Wherein the sample source object is included in the prediction generated image, and the pose of the sample source object in the prediction generated image is consistent with the pose of the sample reference object in the sample reference image.

The image generation network may employ, for example, a generation countermeasure network (Generative Adversarial Network, GAN) architecture. Generating an antagonism network is an unsupervised learning model that includes a Generator (Generator) and a arbiter (Discriminor) by which to learn gambling with each other to produce the desired output. More specifically, in some embodiments, the image generation network may be a liquidWarpingGAN. In the case where the image generation network adopts a GAN structure, the sample source image and the sample reference image may be input to a generator of GAN, respectively, and the generator outputs a prediction generated image.

The image generation network may be pre-trained before the execution of step S220, so that the training efficiency of the subsequent image restoration network can be improved. The image generation network may be trained, for example, using a sample generation source image and a sample generation reference image, which are different image frames in a video for the same object.

According to some embodiments, similar to the acquisition of the sample source image and the sample reference image in step S210 described above, the sample generation source image and the sample generation reference image for the training image generation network may also be different image frames in the sample video for the same object in the sample video set described above. For example, the sample generation source image and the sample generation reference image may be derived by decimating a plurality of sample videos in a sample video set, and further decimating image frames of the decimated sample videos. By decimating the sample video in the set of sample videos and the image frames of the sample video multiple times, multiple image pairs (sample generated source image, sample generated reference image) can be obtained and the image generation network trained based on these image pairs.

In some embodiments, the sample generation source image and the sample generation reference image for the training image generation network may be the same as (in particular, the same as) the sample source image and the sample reference image in step S210, or may be different from (in particular, the different from) the sample source image and the sample reference image in step S210.

In the case where the sample generation source image and the sample generation reference image used for training the image generation network are the same as those in step S210, the image generation network may train simultaneously with the image restoration network.

After obtaining the prediction generated image through step S220, step S230 may be performed to input the prediction generated image into the image restoration network, and obtain a prediction restoration image for the prediction generated image output by the image restoration network.

According to some embodiments, the image restoration network may include generating an countermeasure network structure. More specifically, in some embodiments, the image restoration network may be, for example, a pix2pixHD network including a GAN structure. After the prediction generation image is input into the image restoration network, the image restoration network can process the prediction generation image and output the prediction restoration image. The image quality (e.g., color, sharpness, edge smoothness, etc.) of the predictive repair image is generally better than the predictive generated image.

After obtaining the predicted repair image for the predicted generated image output by the image repair network through step S230, step S240 may be performed to determine a loss value based on the sample reference image and the predicted repair image.

In an embodiment of the present disclosure, a sample reference image is taken as a true value corresponding to a predicted repair image, a loss value is determined based on the sample reference image and the predicted repair image, and the loss value can measure a difference between the predicted repair image and the true value thereof. In the subsequent step S250, the difference (i.e. the shrinkage loss value) is continuously reduced by adjusting the parameters of the image restoration network, so that the image restoration network can learn the image quality, thereby improving the quality of the predicted restoration image output by the image restoration network.

Specifically, the loss value may be calculated in various ways. According to some embodiments, a mean square error (Mean Square Error, MSE) of the sample reference image and the predicted repair image may be taken as a loss value. According to other embodiments, feature vectors of the sample reference image and the predicted repair image may be extracted separately, and a distance between the feature vectors of the sample reference image and the predicted repair image may be used as a loss value. It should be appreciated that the manner in which the loss value is calculated is not limited to the two embodiments listed above.

According to some embodiments, at least one of the image generation network and the image restoration network includes generating an countermeasure network. For example, the image generation network and the image restoration network may each include a generation countermeasure network, the image generation network may be, for example, a liquidWarpingGAN, and the image restoration network may be, for example, a pix2pixHD network including a GAN structure.

Based on the method 200, a neural network may be trained. By utilizing the neural network, an attitude migration image or an attitude migration video can be generated, and efficient and high-quality attitude migration is realized.

Fig. 5 illustrates a flowchart of a method 500 of generating an image using a neural network trained by the method 200, according to an embodiment of the disclosure. The method 500 may be performed at a server (e.g., the server 120 shown in fig. 1), i.e., the subject of execution of the steps of the method 500 may be the server 120 shown in fig. 1. It is to be appreciated that the method 500 may also be performed at a client device (e.g., the client device 110 shown in fig. 1).

As shown in fig. 5, method 500 may include:

step S510, inputting a source image and a reference image into an image generation network to obtain a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises a source object, and the pose of the source object in the generated image is consistent with the pose of the reference object in the reference image;

Step S520, inputting the generated image into an image restoration network to obtain a restoration image which is output by the image restoration network and aims at the generated image; and

step S530, taking the repair image as a result image.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image restoration network. The image generation network can process a source image comprising a source object and a reference image comprising a reference object, and migrate the gesture of the reference object to the source object to obtain a generated image, so that automatic and efficient gesture migration is realized. The image restoration network can carry out quality restoration on the generated image to obtain a restoration image, and the restoration image is used as a result image of gesture migration, so that the image quality of the result image is improved, the result image is clearer and more vivid, and high-quality gesture migration is realized.

According to some embodiments, the source image in step S510 described above may be an image uploaded or specified by the user through a client application in a client device (e.g., client device 110 shown in fig. 1), which may be, for example, a photograph of the user himself. The reference image may be an image specified by the user that is intended to mimic the pose of the object therein. In this case, based on the method 500, the object pose in the image that the user wants to mimic can be migrated to the user in the source image, generating a high quality pose migration result image.

Fig. 6A-6D are schematic diagrams illustrating a process of generating an image according to the method 500 illustrated in fig. 5. Specifically, fig. 6A to 6D are a source image, a reference image, a generated image, and a repair image (i.e., a result image), respectively. As shown in fig. 6A, a source object 610 is included in the source image, and the source object 610 has a walking posture. As shown in FIG. 6B, a reference object 620 is included in the reference image, the reference object 620 having a basketball playing pose. Inputting the source image shown in fig. 6A and the reference image shown in fig. 6B into the image generating network, a generated image shown in fig. 6C may be obtained, the generated image includes the source object 610, and the posture of the source object 610 is consistent with the posture of the reference object 620 in fig. 6B, that is, the posture of the reference object 620 is migrated to the source object 610, so as to realize posture migration. However, as shown in fig. 6C, the generated image is not realistic enough, blurred at the edges of the source object 610, and the image quality is not high. The generated image shown in fig. 6C is input into an image restoration network, and a restoration image shown in fig. 6D can be obtained, namely, a result image is obtained. As can be seen by comparing fig. 6C with fig. 6D, the resulting image shown in fig. 6D is more realistic than the generated image shown in fig. 6C, the edges are clearer, the image quality is higher, and high-quality posture migration is realized.

Fig. 7 illustrates a flowchart of a method 700 of generating video using a neural network trained by the method 200, according to an embodiment of the disclosure. The method 700 may be performed at a server (e.g., the server 120 shown in fig. 1), i.e., the subject of execution of the steps of the method 700 may be the server 120 shown in fig. 1. It is to be appreciated that the method 700 may also be performed at a client device (e.g., the client device 110 shown in fig. 1).

As shown in fig. 7, method 700 may include:

step S710, acquiring a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

step S720, for each of the plurality of reference image frames, performing the following operations: inputting the source image and the reference image frame into an image generation network to obtain a generated image output by the image generation network, wherein the generated image comprises a source object, and the gesture of the source object in the generated image is consistent with the gesture of the reference object in the reference image frame; inputting the generated image into an image restoration network to obtain a restoration image which is output by the image restoration network and aims at the generated image; and

And step 730, splicing a plurality of repair images corresponding to the plurality of reference image frames respectively to generate a result video.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image restoration network. The image generation network can process a source image comprising a source object and a reference image frame comprising a reference object, and migrate the gesture of the reference object to the source object to respectively obtain generated images corresponding to the reference image frames, so that automatic and efficient gesture migration is realized. The image restoration network can restore the quality of the generated image to obtain restoration images corresponding to the reference image frames, splice the restoration images corresponding to the reference image frames to obtain a result video, improve the quality of the result video, enable the result video to be clearer and more vivid, and realize high-quality gesture migration.

According to some embodiments, the source image in step S710 described above may be an image uploaded or specified by the user through a client application in a client device (e.g., client device 110 shown in fig. 1), which may be, for example, a photograph of the user himself. The reference image may be a user-specified video that is intended to mimic the pose of the object therein. In this case, based on method 700, object poses in the video that the user wants to mimic can be migrated to the user in the source image, generating high quality pose migration result video.

According to another aspect of the present disclosure, there is also provided a training apparatus for a neural network. Fig. 8 shows a block diagram of a training apparatus 800 of a neural network according to an embodiment of the present disclosure. As shown in fig. 8, apparatus 800 may include a sample acquisition module 810, a prediction generation module 820, a prediction repair module 830, a loss calculation module 840, and a parameter adjustment module 850.

The sample acquisition module 810 may be configured to acquire a sample source image and a sample reference image, wherein the sample source image includes a sample source object and the sample reference image includes a sample reference object.

The prediction generation module 820 may be configured to input the sample source image and the sample reference image into the image generation network to obtain a prediction generation image output by the image generation network, wherein the prediction generation image includes the sample source object therein, and the pose of the sample source object in the prediction generation image is consistent with the pose of the sample reference object in the sample reference image.

The predictive restoration module 830 may be configured to input the predictive generated image into an image restoration network, obtain a predictive restoration image for the predictive generated image output by the image restoration network.

The loss calculation module 840 may be configured to determine a loss value based on the sample reference image and the predicted repair image.

The parameter adjustment module 850 may be configured to adjust parameters of the image restoration network based on the loss value.

According to an embodiment of the present disclosure, a neural network includes an image generation network and an image restoration network. In the training process of the neural network, a sample source image and a sample reference image are input into an image generation network, a prediction generation image output by the image generation network is obtained, the prediction generation image comprises a sample source object, and the gesture of the included sample source object is consistent with the gesture of a sample reference object in the sample reference image, so that the gesture of the reference object can be migrated to the source object, and automatic and efficient gesture migration is realized. The prediction generated image is input into an image restoration network, and parameters of the image restoration network are adjusted according to the difference (namely loss value) between the prediction restoration image output by the image restoration network and a real sample reference image, so that the image restoration network can learn the image quality, the quality of the image output by the image restoration network is improved, and a clear and realistic gesture migration result image (or result video) can be generated based on the neural network.

According to another aspect of the present disclosure, there is also provided an apparatus for generating an image using a neural network. Fig. 9 shows a block diagram of an apparatus 900 for generating an image using a neural network according to an embodiment of the present disclosure. As shown in fig. 9, apparatus 900 may include an image generation module 910, an image restoration module 920, and an image output module 930.

The image generation module 910 may be configured to input a source image and a reference image into the image generation network, obtain a generated image output by the image generation network, wherein the source image includes a source object, the reference image includes a reference object, the generated image includes a source object, and a pose of the source object in the generated image is consistent with a pose of the reference object in the reference image.

The image restoration module 920 may be configured to input the generated image into an image restoration network, and obtain a restoration image for the generated image output by the image restoration network.

The image output module 930 may be configured to take the repair image as a resultant image.

According to another aspect of the present disclosure, there is also provided an apparatus for generating video using a neural network. Fig. 10 shows a block diagram of an apparatus 1000 for generating video using a neural network according to an embodiment of the present disclosure. As shown in fig. 10, the apparatus 1000 may include an acquisition module 1010, a generation module 1020, and a video output module 1030.

The acquisition module 1010 may be configured to acquire a source image including a source object and a reference video including a plurality of reference image frames each including a reference object therein.

The generation module 1020 may include an image generation unit 1022 and an image restoration unit 1024, wherein the image generation unit 1022 may be configured to perform the following operations for each of the plurality of reference image frames: the source image and the reference image frame are input into an image generation network, a generated image output by the image generation network is obtained, wherein the source object is included in the generated image, and the pose of the source object in the generated image is consistent with the pose of the reference object in the reference image frame.

The image restoration unit 1024 may be configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network.

The video output module 1030 may be configured to splice a plurality of repair images corresponding to each of the plurality of reference image frames to generate a resultant video.

It should be appreciated that the various modules of the apparatus 800 shown in fig. 8 may correspond to the various steps in the method 200 described with reference to fig. 2, the various modules of the apparatus 900 shown in fig. 9 may correspond to the various steps in the method 500 described with reference to fig. 5, and the various modules of the apparatus 1000 shown in fig. 10 may correspond to the various steps in the method 700 described with reference to fig. 7. Thus, the operations, features and advantages described above with respect to the methods 200, 500, 700 are equally applicable to the apparatus 800, 900, 1000 and modules/units comprised thereof. For brevity, certain operations, features and advantages are not described in detail herein.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. For example, the loss calculation module 840 and the parameter adjustment module 850 described above may be combined into a single module in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 8-10 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the sample acquisition module 810, the prediction generation module 820, the prediction repair module 830, the loss calculation module 840, the parameter adjustment module 850, the image generation module 910, the image repair module 920, the image output module 930, the acquisition module 1010, the generation module 1020 (including the image generation unit 1022 and the image repair unit 1024), the video output module 1030 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a method according to the above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a method according to the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to the above.

Referring to fig. 11, a block diagram of an electronic device 1100 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. The electronic devices may be different types of computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the electronic device 1100 may include at least one processor 1101, a working memory 1102, an input unit 1104, a display unit 1105, a speaker 1106, a storage unit 1107, a communication unit 1108, and other output units 1109 that are capable of communicating with each other through a system bus 1103.

The processor 1101 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 11101 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 1101 may be configured to obtain and execute computer readable instructions stored in the working memory 1102, the storage unit 1107, or other computer readable media, such as program code of the operating system 1102a, program code of the application program 1102b, and the like.

The working memory 1102 and the storage unit 1107 are examples of computer-readable storage media for storing instructions that are executed by the processor 1101 to implement the various functions described previously. The working memory 1102 may include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, the storage unit 1107 may include a hard disk drive, a solid state drive, a removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. The working memory 1102 and storage unit 1107 may both be referred to herein collectively as memory or computer-readable storage medium and may be non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by the processor 1101 as a particular machine configured to implement the operations and functions described in the examples herein.

The input unit 1106 may be any type of device capable of inputting information to the electronic device 1100, the input unit 1106 may receive input numeric or character information, and generating key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard,A touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. The output unit may be any type of device capable of presenting information and may include, but is not limited to, a display unit 1105, a speaker 1106, and other output units 1109 may include, but are not limited to, a video/audio output terminal, a vibrator, and/or a printer. The communication unit 1108 allows the electronic device 1100 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth ^TM Devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

Application 1102b in working register 1102 may be loaded to perform the various methods and processes described above, e.g., step S210-step S250 in fig. 2. For example, in some embodiments, the methods 200, 400, 500, 700 described above may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1107. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1100 via the storage unit 1107 and/or the communication unit 1108. One or more of the steps of the methods 200, 400, 500, 700 described above may be performed when the computer program is loaded and executed by the processor 1101. Alternatively, in other embodiments, the processor 1101 may be configured to perform the methods 200, 400, 500, 700 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A computer-implemented method of training a neural network, wherein the neural network comprises an image generation network and an image restoration network, the method comprising:

acquiring a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

inputting the sample source image and the sample reference image into the image generation network to obtain a prediction generation image output by the image generation network, wherein the prediction generation image comprises the sample source object, the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image, and the image generation network is obtained by training the sample generation source image and the sample generation reference image, the sample generation source image and the sample source image are identical, and the sample generation reference image is identical with the sample reference image;

Inputting the prediction generation image into the image restoration network to obtain a prediction restoration image which is output by the image restoration network and aims at the prediction generation image, wherein the edge smoothness of the prediction restoration image is better than that of the prediction generation image;

determining a loss value based on the sample reference image and the predicted repair image; and

parameters of the image restoration network are adjusted based on the loss values.

2. The method of claim 1, wherein the sample source image and the sample reference image are different image frames in a video for the same object.

3. The method of claim 1, further comprising: constructing a sample video set comprising a plurality of sample videos, each sample video corresponding to an object;

wherein the sample source image and the sample reference image are different image frames in a sample video for the same object in the sample video set.

4. The method of claim 3, wherein the constructing a sample video set comprises:

acquiring a plurality of original videos;

for each of the plurality of original videos, performing the following:

Performing object detection on each image frame in the original video; and

and removing the image frames which do not comprise the object in the original video to obtain a sample video.

5. The method of claim 3 or 4, wherein the constructing a sample video set further comprises: and adjusting the duration of the sample video to be a preset duration.

6. The method of any of claims 3-4, wherein the constructing a sample video set further comprises: and adjusting the size of the image frame of the sample video to a preset size.

7. The method of any of claims 1-4, wherein at least one of the image generation network and the image restoration network comprises generating an countermeasure network.

8. The method of any of claims 1-4, wherein the sample-generated source image and the sample-generated reference image are different image frames in a video for the same object.

9. A method of generating an image using a neural network, wherein the neural network is trained in accordance with the training method of any one of claims 1-8, the neural network comprising an image generation network and an image restoration network, the method comprising:

Inputting a source image and a reference image into the image generation network to obtain a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image;

inputting the generated image into the image restoration network to obtain a restoration image which is output by the image restoration network and aims at the generated image; and

and taking the repair image as a result image.

10. A method of generating video using a neural network, wherein the neural network is trained in accordance with the training method of any one of claims 1-8, the neural network comprising an image generation network and an image restoration network, the method comprising:

acquiring a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

for each of the plurality of reference image frames, performing the following:

Inputting the source image and the reference image frame into the image generation network to obtain a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame; and

and splicing the multiple repair images corresponding to the multiple reference image frames respectively to generate a result video.

11. A training apparatus for a neural network, the neural network comprising an image generation network and an image restoration network, the apparatus comprising:

a sample acquisition module configured to acquire a sample source image and a sample reference image, wherein the sample source image comprises a sample source object, and the sample reference image comprises a sample reference object;

a prediction generation module configured to input the sample source image and the sample reference image into the image generation network, and obtain a prediction generation image output by the image generation network, wherein the sample source object is included in the prediction generation image, and the posture of the sample source object in the prediction generation image is consistent with the posture of the sample reference object in the sample reference image, and wherein the image generation network is obtained by training a sample generation source image and a sample generation reference image, the sample generation source image and the sample source image are the same, and the sample generation reference image is the same as the sample reference image;

A prediction restoration module configured to input the prediction generation image into the image restoration network, obtain a prediction restoration image for the prediction generation image output by the image restoration network, and have edge smoothness superior to that of the prediction generation image;

a loss calculation module configured to determine a loss value based on the sample reference image and the predicted repair image; and

a parameter adjustment module configured to adjust parameters of the image restoration network based on the loss value.

12. An apparatus for generating an image using a neural network, wherein the neural network is trained in accordance with the training method of any one of claims 1-8, the neural network comprising an image generation network and an image restoration network, the apparatus comprising:

an image generation module configured to input a source image and a reference image into the image generation network, and obtain a generated image output by the image generation network, wherein the source image comprises a source object, the reference image comprises a reference object, the generated image comprises the source object, and the pose of the source object in the generated image is consistent with the pose of the reference object in the reference image;

An image restoration module configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and

and an image output module configured to take the repair image as a result image.

13. An apparatus for generating video using a neural network, wherein the neural network is trained in accordance with the training method of any one of claims 1-8, the neural network comprising an image generation network and an image restoration network, the apparatus comprising:

an acquisition module configured to acquire a source image and a reference video, wherein the source image comprises a source object, the reference video comprises a plurality of reference image frames, and each reference image frame comprises a reference object;

the generating module comprises an image generating unit and an image restoring unit, wherein,

the image generation unit is configured to perform the following operations for each of the plurality of reference image frames: inputting the source image and the reference image frame into the image generation network to obtain a generated image output by the image generation network, wherein the generated image comprises the source object, and the posture of the source object in the generated image is consistent with the posture of the reference object in the reference image frame;

The image restoration unit is configured to input the generated image into the image restoration network, and obtain a restoration image for the generated image output by the image restoration network; and

and the video output module is configured to splice a plurality of repair images corresponding to the plurality of reference image frames respectively to generate a result video.

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program which, when executed by the at least one processor, implements the method according to any one of claims 1-10.

15. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method according to any one of claims 1-10.