CN114998814B

CN114998814B - Target video generation method and device, computer equipment and storage medium

Info

Publication number: CN114998814B
Application number: CN202210930311.2A
Authority: CN
Inventors: 刘世超
Original assignee: Guangzhou This Voice Network Technology Co ltd
Current assignee: Guangzhou This Voice Network Technology Co ltd
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-11-15
Anticipated expiration: 2042-08-04
Also published as: CN114998814A

Abstract

The application relates to a target video generation method, a target video generation device, computer equipment and a storage medium. The method comprises the following steps: receiving an initial video and a target picture; identifying keypoints in each image frame of the initial video and the target picture; determining a current frame and acquiring an associated frame corresponding to the current frame; when the associated frame is shielded, acquiring a predicted frame corresponding to the associated frame; calculating to obtain an initial affine transformation matrix according to the current frame, the associated frame without occlusion and a predicted frame corresponding to the associated frame with occlusion; predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame; and generating a target video according to the prediction frame corresponding to each current frame. By adopting the method, the accuracy can be ensured, and the continuity of the action is ensured.

Description

Target video generation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for generating a target video, a computer device, and a storage medium.

Background

The image action generation brings more possibility to the richness of visual interaction in the internet, aims to give a driving video and a two-dimensional static picture, and outputs a corresponding video for the two-dimensional picture through an action generation algorithm.

Conventionally, actions in a driving video are recognized and the recognized actions are applied to a two-dimensional still picture, but occlusion may exist in the driving video, which may cause the actions in the generated video to be incoherent.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a target video generation method, apparatus, computer device and readable storage medium capable of ensuring consistency of actions in view of the above technical problems.

In a first aspect, the present application provides a target video generation method, where the method includes:

receiving an initial video and a target picture;

identifying keypoints in each image frame of the initial video and the target picture;

determining a current frame and acquiring an associated frame corresponding to the current frame;

when the associated frame is shielded, acquiring a predicted frame corresponding to the associated frame;

calculating to obtain an initial affine transformation matrix according to the current frame, the associated frame without occlusion and a predicted frame corresponding to the associated frame with occlusion;

predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame;

and generating a target video according to the prediction frame corresponding to each current frame.

In one embodiment, the method further comprises:

and when the associated frame is not shielded, calculating to obtain an initial affine transformation matrix through the current frame and the associated frame.

In one embodiment, the predicting according to the initial affine transformation matrix and the key point in the target picture to obtain a predicted frame corresponding to the current frame includes:

identifying background features in the target picture;

performing motion estimation according to the initial affine transformation matrix and key points in the target picture;

determining a region to be filled according to the action estimation result;

and carrying out background filling on the area to be filled based on the background characteristics to obtain a predicted frame corresponding to the current frame.

In one embodiment, before obtaining the predicted frame corresponding to the associated frame when the associated frame has an occlusion, the method further includes:

judging whether the associated frame is shielded or not;

the judging whether the associated frame has occlusion includes at least one of the following:

judging whether the associated frame is shielded or not according to the number of the key points of the current frame and the associated frame; or

Extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has shielding or not according to the visual features.

In one embodiment, the obtaining the associated frame corresponding to the current frame includes:

calculating the similarity between the current frame and a preset number of adjacent frames;

when the similarity is larger than or equal to a threshold value, determining the adjacent frame of which the similarity is larger than or equal to the threshold value as an associated frame.

In one embodiment, after determining the current frame, the method further includes:

if the current frame is a first frame, constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture;

adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame;

and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

In one embodiment, the prediction frame is obtained by predicting through a prediction model obtained by training in advance; the training method of the prediction model comprises the following steps:

acquiring a sample video and a corresponding sample picture;

identifying each frame in the sample video and keypoints in the sample picture;

inputting each frame in the sample video and the key points in the sample pictures into an initial model to obtain a sample prediction video;

calculating the similarity of the sample video and the prediction sample video;

and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining a prediction model.

In a second aspect, the present application further provides a target video generating apparatus, including:

the receiving module is used for receiving the initial video and the target picture;

a first identification module for identifying each image frame of the initial video and key points in the target picture;

the relevant frame determining module is used for determining a current frame and acquiring a relevant frame corresponding to the current frame;

the predicted frame determining module is used for acquiring a predicted frame corresponding to the associated frame when the associated frame is blocked;

the initial affine transformation matrix generating module is used for calculating an initial affine transformation matrix according to the current frame, the associated frame without occlusion and the predicted frame corresponding to the associated frame with occlusion;

the prediction module is used for predicting according to the initial affine transformation matrix and key points in the target picture to obtain a prediction frame corresponding to the current frame;

and the generating module is used for generating a target video according to the prediction frame corresponding to each current frame.

In a third aspect, the present application further provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method in any one of the above embodiments when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, which computer program, when executed by a processor, implements the steps of the method in any one of the above-described embodiments.

After the initial video and the target picture are received, identifying each image frame of the initial video and key points in the target picture, further determining a related frame corresponding to the current frame, and judging whether the related frame has occlusion, if the related frame has occlusion, predicting a predicted frame corresponding to the current frame through the predicted frame corresponding to the related frame, for example, calculating an initial affine transformation matrix according to the current frame, the related frame without occlusion and the predicted frame corresponding to the related frame with occlusion, so that predicting is performed according to the initial affine transformation matrix and the key points in the target picture, and obtaining the predicted frame corresponding to the current frame, so that the combination of all the predicted frames is the target video, so that the predicted frame is predicted through the related frame and the current frame, a plurality of image frames are integrated, and the related frame is determined to be not occluded during predicting, thereby ensuring that the predicted frame is accurate, and ensuring the continuity of actions.

Drawings

FIG. 1 is a diagram of an application environment of a target video generation method in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a method for generating a target video in one embodiment;

FIG. 3 is a diagram illustrating the correspondence between image frames and video frames in one embodiment;

FIG. 5 is a schematic diagram of a motion estimation network in one embodiment;

FIG. 4 is a complete flow diagram of a video generation method in one embodiment;

FIG. 6 is a block diagram showing the structure of a target video generating apparatus according to an embodiment;

FIG. 7 is a diagram of the internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The target video generation method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server.

The terminal 102 sends the initial video and the target picture to the server 104, wherein the server 104 identifies each image frame of the initial video and a key point in the target picture; determining a current frame and acquiring an associated frame corresponding to the current frame; when the associated frame is shielded, acquiring a prediction frame corresponding to the associated frame; calculating to obtain an initial affine transformation matrix according to the current frame, the associated frame without occlusion and the predicted frame corresponding to the associated frame with occlusion; predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame; and generating a target video according to the predicted frame corresponding to each current frame.

Therefore, after the initial video and the target picture are received, each image frame of the initial video and key points in the target picture are firstly identified, then an associated frame corresponding to the current frame is determined, whether the associated frame is blocked or not is judged, if the associated frame is blocked, the prediction frame corresponding to the current frame is predicted through the prediction frame corresponding to the associated frame, for example, an initial affine transformation matrix is obtained through calculation according to the current frame, the associated frame without blocking and the prediction frame corresponding to the associated frame with blocking, the prediction frame corresponding to the current frame is obtained through prediction according to the initial affine transformation matrix and the key points in the target picture, therefore, the combination of all the prediction frames is the target video, the prediction frames are predicted through the associated frame and the current frame, a plurality of image frames are integrated, and the associated frame is determined to be not blocked during prediction, so that the predicted frame is also accurate, and continuity of actions is guaranteed.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a target video generation method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

s202: an initial video and a target picture are received.

Specifically, the initial video is a corresponding reference video, which mainly gives a motion, and the target picture is a corresponding still picture.

Alternatively, the initial video and the target picture may be determined according to a scene, and are not particularly limited herein.

S204: keypoints in each image frame of the initial video and in the target picture are identified.

Specifically, the keypoints refer to the keypoints of each image frame of the initial video and the object in the target picture, for example, a human figure, wherein the keypoints may refer to the keypoints of each image frame of the initial video and the skeleton of the object in the target picture. In other embodiments, if each image frame of the initial video and the object in the target picture are of other types, such as a vehicle, etc., the key point may be a key point of an object of other types, which is not limited in this respect.

The key point extraction network may extract each image frame of the initial video and key points in the target picture at one time, or extract key points in the target picture and key points of a preset number of image frames of the initial video first, and then extract key points of other image frames in the initial video in sequence along with the calculation process, or determine the number of image frames that can be processed each time according to a thread that can be processed in parallel, so as to extract key points in the image frames in parallel according to the number.

S206: and determining the current frame and acquiring an associated frame corresponding to the current frame.

Specifically, the current frame refers to a frame to be processed, that is, a frame to be determined from a current frame in the original video as a corresponding predicted frame, specifically, see fig. 3, where the current frame is an image frame a and the predicted frame generated therefrom is a predicted frame a'.

The associated frame is a frame close to the current frame in distance, for example, a preset number of frames adjacent to the current frame up and down, where the preset number may be one or two in order to reduce the amount of calculation, and is not limited in this respect.

S208: and when the related frame has shielding, acquiring a predicted frame corresponding to the related frame.

Specifically, occlusion means that an object in an associated frame is occluded by another target, where occlusion can be determined by the number of key points or by feature images of different scales. When the number of the key points is used for judgment, the number of the key points in each image frame in the initial video can be calculated, when the number of most key points in the image frame is N, the image frame is considered to have no occlusion when the number of the key points in the image frame is N, and therefore whether the associated frame has the occlusion or not can be determined according to the relation between the number of the key points in the associated frame and N. In other embodiments, the feature images of different scales can be extracted, the feature images of low-level scales pay more attention to overall abstract features, and the feature images of high-level scales pay more attention to detailed features such as texture, saturation, color and the like, so that object recognition is performed according to the feature images of different scales to determine whether an object is occluded or not. When occlusion exists, the prediction frame corresponding to the associated frame is obtained, because the occlusion does not exist in the prediction frame, and the prediction frame is obtained by predicting according to the previous non-occlusion associated frame or the prediction frame, so that the accuracy is ensured.

S210: and calculating to obtain an initial affine transformation matrix according to the current frame, the associated frames without occlusion and the predicted frames corresponding to the associated frames with occlusion.

Specifically, the initial affine change matrix is a change matrix of an action generated according to an initial video, where a current frame is a target action, a previous frame is a historical action, and both the previous frame and the previous frame may have occlusion or may not have occlusion. For this purpose, the current frame and the associated frame are used for the calculation of the initial affine transformation matrix.

In one optional embodiment, the server may first determine the current frame, and if the current frame is not occluded, perform calculation of the initial affine transformation matrix according to the associated frame where no occlusion exists, the predicted frame corresponding to the associated frame where an occlusion exists, and the current frame, for example, determine a target position of a key point according to the current frame, determine an initial position according to the associated frame where no occlusion exists and the predicted frame corresponding to the associated frame where an occlusion exists, and then determine the initial affine transformation matrix according to the initial position and the target position. If the current frame has occlusion, the non-occluded part can still be calculated in the above manner, and for the occluded part, the prediction is performed by using the associated frame without occlusion and the prediction frame corresponding to the associated frame with occlusion, for example, if the associated frame is vertically adjacent, the target position of the occluded part can be calculated by calculating the weighted average position according to the position information of the associated frame and the current frame. In other embodiments, the initial affine transformation matrix of the occluded part may also be calculated according to the initial affine transformation matrix of the unoccluded part, which is not specifically limited herein.

S212: and predicting according to the initial affine transformation matrix and the key points in the target picture to obtain a predicted frame corresponding to the current frame.

Specifically, the key points in the target picture are key points in the first frame picture, wherein the key points in the predicted frame can be referred to in the subsequent frame, so that the initial affine transformation matrix is used for representing the action change of the adjacent predicted frame, and the server performs prediction according to the initial affine transformation matrix and the key points in the target picture to obtain the predicted frame corresponding to the current frame.

S214: and generating a target video according to the predicted frame corresponding to each current frame.

Specifically, as shown in fig. 3, all the predicted frames are arranged in time sequence to obtain the target video, so that the purpose of changing the still picture according to the motion of the initial video to obtain the target video is achieved.

In the above embodiment, after the initial video and the target picture are received, each image frame of the initial video and a key point in the target picture are identified first, and then an associated frame corresponding to the current frame is determined, and whether the associated frame is occluded or not is judged, if the associated frame is occluded, the predicted frame corresponding to the current frame is predicted through the predicted frame corresponding to the associated frame, for example, an initial affine transformation matrix is obtained through calculation according to the current frame, the associated frame which is not occluded, and the predicted frame corresponding to the current frame, so that the predicted frame is obtained through prediction according to the initial affine transformation matrix and the key point in the target picture, and a combination of all the predicted frames is the target video.

In one embodiment, the method further comprises: and when the associated frame is not shielded, calculating to obtain an initial affine transformation matrix through the current frame and the associated frame.

Specifically, when there is no occlusion in the associated frame, the initial affine transformation matrix may be calculated directly from the current frame and the associated frame, where one case is that there is occlusion in the current frame, and at this time, the normal calculation is performed on the part where there is no occlusion by using the associated frame, and for the part where there is occlusion, the prediction is performed on the occluded part by using the associated frame, and then the corresponding initial affine transformation matrix is calculated according to the predicted occluded part and the associated frame.

In the above embodiment, it is first determined whether the associated frame is occluded, if not, the initial affine transformation matrix is obtained by predicting through the associated frame and the current frame, and if so, the predicted frame corresponding to the associated frame with occlusion is obtained, so that it is ensured that no occlusion exists in the image frames participating in the calculation of the initial affine transformation matrix, the accuracy is ensured, and the continuity of the actions is ensured.

In one embodiment, predicting according to the initial affine transformation matrix and the key points in the target picture to obtain a predicted frame corresponding to the current frame includes: identifying background features in the target picture; performing motion estimation according to the initial affine transformation matrix and key points in the target picture; determining a region to be filled according to the result of the motion estimation; and performing background filling on the area to be filled based on the background characteristics to obtain a prediction frame corresponding to the current frame.

Specifically, referring to fig. 4, fig. 4 is a complete flowchart of a video generation method in an embodiment, where, in fig. 4, because a background exists in a target picture, when motion estimation is performed according to an initial affine transformation matrix and key points in the target picture, and motion in the target picture, that is, a position of an object, is modified, a blank area appears due to movement of the object, and therefore the blank area needs to be filled according to background features of the target picture.

Therefore, in this embodiment, a cascade tensor is obtained according to the detected key point as an input, the background features of the target picture are fused, and a comprehensive affine transformation matrix is estimated. Specifically, motion estimation can be performed according to the initial affine transformation matrix and key points in the target picture; and determining the region to be filled according to the action estimation result, generating an affine transformation matrix corresponding to the region to be filled according to the background characteristics corresponding to the region to be filled, and adding the initial affine transformation matrix to obtain a comprehensive affine transformation matrix.

It should be noted that, in the early network training stage, due to a lot of missing details, a large number of invalid predicted values, such as zero values, are generated in a video prediction result area, and these results are meaningless for the whole network training and therefore do not participate in the inverse gradient propagation, and are extremely prone to fall into a local optimum state, especially after calculation of the softmax layer, resulting in poor quality of the generated result. In order to solve the problem, a dropout regularization method is added in the method, so that the problem is avoided to a certain extent, and the network robustness is increased.

At the later stage of network training, as the correlation between the learned features and the video frames is richer, dropout operation is removed, calculation power is saved, convergence is accelerated, and meanwhile a better action estimation result can be obtained.

In the above embodiment, not only the motion estimation of the foreground object but also the background feature are considered, so that the motion estimation and the background feature are combined to generate the predicted frame, on one hand, the motion continuity is ensured, and on the other hand, the relative integrity of the whole picture is ensured.

In one embodiment, when there is an occlusion in the associated frame, before obtaining the predicted frame corresponding to the associated frame, the method further includes: judging whether the associated frame is shielded; judging whether the associated frame has occlusion, wherein the judgment comprises at least one of the following steps: judging whether the associated frame is shielded or not according to the number of key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

Specifically, occlusion means that an object in an associated frame is occluded by another target, where occlusion can be determined by the number of key points or by feature images of different scales. When the number of the key points is used for judgment, the number of the key points in each image frame in the initial video can be calculated, when the number of most key points in the image frame is N, the image frame is not shielded when the number of the key points in the image frame is N, and therefore whether shielding exists in the associated frame can be determined according to the relation between the number of the key points in the associated frame and N. In other embodiments, the object recognition can be performed according to the feature images of different scales to determine whether the object has occlusion or not by extracting feature images of different scales, wherein the feature images of low scales focus more on overall abstract features, and the feature images of high scales focus more on detail features such as texture, saturation, color and the like.

Specifically, as shown in fig. 5, the extracted visual features can be fused on different scales by using a secondary hourglass network with a residual structure. The network firstly estimates the occlusion area generated in the video action process, and generates an affine transformation matrix for the area which is predicted to be lost. The feature map of low-level scale focuses more on the overall abstract features, and the feature map of high-level scale focuses more on the detailed features such as texture, saturation, color and the like, so that the module respectively processes and fuses feature maps of different scales, and the fused features have coarse-grained global information and fine-grained detailed information.

In the embodiment, the occlusion judgment is carried out in different modes, and a foundation is laid for the generation accuracy of the subsequent prediction frame.

In one embodiment, obtaining an associated frame corresponding to a current frame includes: calculating the similarity between the current frame and a preset number of adjacent frames; when the similarity is greater than or equal to the threshold, determining the adjacent frame with the similarity greater than or equal to the threshold as the associated frame.

Specifically, since the more the number of the associated frames, the greater the calculation amount, and in an actual scene, there is a case where a user scene transitions, for example, from one dance to another dance, from indoor to outdoor, and at this time, the reference meaning of the associated frame is not large, the server calculates the similarity between the current frame and a preset number of adjacent frames in order to improve the reference meaning of the associated frame; when the similarity is greater than or equal to the threshold, determining the adjacent frame with the similarity greater than or equal to the threshold as the associated frame. Therefore, adjacent video frames can be utilized, cross-frame contact can be achieved, and due to the fact that the amount of available information which can be used for the effective information increase, the problems of shielding and the like in the video motion prediction process are effectively solved.

For example, there is an image frame ABCDE, the current frame is an image frame C, and if similarity calculation is not performed, the related frames are the image frame a, the image frame B, the image frame D, and the image frame E, but in order to reduce the calculation complexity, the server first calculates the similarities between the image frame C and the image frames a, B, D, and E, respectively, and deletes the image frame with the similarity smaller than a threshold value, for example, deletes the image frame B and the image frame E, so as to leave the image frame a and the image frame D, and predict a predicted frame corresponding to the current frame C from the image frame a, the image frame D, and the current frame C. On one hand, adjacent video frames can be utilized, on the other hand, cross-frame contact is achieved, and due to the fact that the number of available information which can be used for being based on is increased, the problems of shielding and the like in the video motion prediction process are effectively solved.

In one embodiment, after determining the current frame, the method further includes: if the current frame is a first frame, constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture; adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame; and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

Specifically, the step of initializing in this embodiment is to establish a relationship between a key point in the target picture and a key point in the first frame, so as to construct an initial affine transformation matrix according to the key point in the first frame and the key point in the target picture; and adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame, so that the key points of the target picture are adjusted to initialize the target picture.

Optionally, the server may further perform processing in combination with the background feature during initialization to ensure accuracy of initialization, for example, during initialization, a relationship between a key point in the first frame and a key point in the target picture is established first to perform motion estimation, then a blank area is determined according to the motion estimation, and then the blank area is filled according to the background feature.

In one embodiment, the prediction frame is obtained by predicting through a prediction model obtained by training in advance; the training method of the prediction model comprises the following steps: acquiring a sample video and a corresponding sample picture; identifying each frame in the sample video and key points in the sample picture; inputting each frame in the sample video and the key points in the sample picture into the initial model to obtain a sample prediction video; calculating the similarity of the sample video and the prediction sample video; and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining the prediction model.

Specifically, the prediction frame in this embodiment may be obtained by predicting through a prediction model obtained through pre-training; the training mode of the prediction model is carried out according to a sample video and a corresponding sample picture, for a prediction task of a key point, equal variance loss can be adopted, for a task generated by action, loss between an input driving video and a generated result video can be calculated, the purpose is that the smaller the loss is, the better the loss is, the smaller the loss is, the more the loss is, the generated result video is closer to an original input video, and the more accurate the generated result is, and the specific formula is as follows:

wherein the content of the first and second substances,

and

respectively representing a driving video characteristic matrix and a prediction video result matrix.

Therefore, in the embodiment, the complete loss function can be obtained through the equal variance loss of the key point prediction task and the loss of the video.

In order to facilitate understanding of the present application by those skilled in the art, a complete embodiment is provided, in which a server acquires an initial video and a target picture, and then identifies each image frame in the initial video and a key point of the target picture, so that the target picture is initialized according to the key point in the first frame of the initial video and the key point of the target picture, so that the target picture is aligned with the first frame of the initial video, and optionally, a background feature of the target picture is further combined during initialization to fill a blank position after alignment.

And generating a subsequent video, namely acquiring a current frame in the initial video and acquiring a related frame corresponding to the current frame, wherein optionally, acquiring a preset number of image frames adjacent to each other up and down corresponding to the current frame, then calculating the similarity, and taking the image frame with the similarity meeting the threshold requirement with the current frame as the related frame.

The general processing is to match the key points of the current frame and the previous frame to generate an affine transformation matrix, but because occlusion may exist in the current frame and the previous frame, a related frame is introduced, wherein whether the related frame has occlusion is judged firstly, if occlusion exists, a predicted frame corresponding to the related frame is obtained, and then an initial affine transformation matrix is generated according to the predicted frame, the related frame without occlusion and the current frame, so that motion estimation can be performed according to the initial affine transformation matrix, specifically, weighted average can be performed according to the positions of the key points in the related frame, and the weighted average and the positions of the key points in the current frame are averaged to obtain a target position.

After motion estimation, background features are also combined to fill in blank positions after motion change.

In the above embodiment, after the initial video and the target picture are received, each image frame of the initial video and a key point in the target picture are identified first, and then an associated frame corresponding to the current frame is determined, and whether the associated frame has occlusion or not is judged, if the associated frame has occlusion, a predicted frame corresponding to the current frame is predicted by using the predicted frame corresponding to the associated frame, for example, an initial affine transformation matrix is obtained by calculation according to the current frame, the associated frame not having occlusion, and the predicted frame corresponding to the current frame is obtained by performing prediction according to the initial affine transformation matrix and the key point in the target picture, so that a combination of all the predicted frames is the target video, so that the predicted frames are predicted by using the associated frame and the current frame, multiple image frames are integrated, and the associated frame is determined to be not occluded during prediction, so that the predicted frame is also accurate, and continuity of actions is ensured.

It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a target video generation apparatus for implementing the above-mentioned target video generation method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the target video generating apparatus provided below may refer to the limitations in the above target video generating method, and details are not described herein again.

In one embodiment, as shown in fig. 6, there is provided a target video generating apparatus including: receiving module 601, first identifying module 602, associated frame determining module 603, predicted frame determining module 604, initial affine transformation matrix generating module 607605, predicting module 606 and generating module 607, wherein:

a receiving module 601, configured to receive an initial video and a target picture;

a first identification module 602, configured to identify each image frame of the initial video and a key point in the target picture;

an associated frame determining module 603, configured to determine a current frame and obtain an associated frame corresponding to the current frame;

a predicted frame determining module 604, configured to obtain a predicted frame corresponding to the associated frame when the associated frame is occluded;

an initial affine transformation matrix generating module 607605, configured to calculate an initial affine transformation matrix according to the current frame, the associated frame without occlusion, and the predicted frame corresponding to the associated frame with occlusion;

the prediction module 606 is configured to perform prediction according to the initial affine transformation matrix and the key points in the target picture to obtain a prediction frame corresponding to the current frame;

the generating module 607 is configured to generate a target video according to the predicted frame corresponding to each current frame.

In one embodiment, the initial affine transformation matrix generating module 607605 is further configured to obtain an initial affine transformation matrix by calculating the current frame and the associated frame when there is no occlusion in the associated frame.

In one embodiment, the prediction module 606 includes:

the background feature identification unit is used for identifying background features in the target picture;

the motion estimation unit is used for performing motion estimation according to the initial affine transformation matrix and key points in the target picture;

the device comprises a to-be-filled region determining unit, a motion estimation unit and a motion estimation unit, wherein the to-be-filled region determining unit is used for determining a to-be-filled region according to the motion estimation result;

and the filling unit is used for filling the background of the area to be filled based on the background characteristics to obtain a predicted frame corresponding to the current frame.

In one embodiment, the video generating apparatus further includes:

the judging module is used for judging whether the associated frame is shielded or not; specifically, whether the associated frame is shielded or not is judged according to the number of key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

In one embodiment, the association frame determining module 603 includes:

the similarity calculation unit is used for calculating the similarity between the current frame and a preset number of adjacent frames;

and the associated frame calculating unit is used for determining the adjacent frame with the similarity larger than or equal to the threshold as the associated frame when the similarity is larger than or equal to the threshold.

In one embodiment, the video generating apparatus further includes:

the construction module is used for constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture if the current frame is the first frame;

a predicted frame generation module 607, configured to adjust a key point in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame; and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

In one embodiment, the prediction frame is obtained by predicting through a prediction model obtained by training in advance; the above video generation apparatus further includes:

the system comprises a sample acquisition module, a video acquisition module and a video processing module, wherein the sample acquisition module is used for acquiring a sample video and a corresponding sample picture;

the second identification module is used for identifying each frame in the sample video and key points in the sample picture;

the prediction module 606 is configured to input each frame in the sample video and the key point in the sample picture into the initial model to obtain a sample prediction video;

the training module is used for calculating the similarity of the sample video and the prediction sample video; and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining the prediction model.

The respective modules in the above target video generating apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the initial video and the target picture. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a target video generation method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: receiving an initial video and a target picture; identifying key points in each image frame of the initial video and the target picture; determining a current frame and acquiring an associated frame corresponding to the current frame; when the associated frame is shielded, acquiring a predicted frame corresponding to the associated frame; calculating to obtain an initial affine transformation matrix according to the current frame, the associated frames without occlusion and the predicted frames corresponding to the associated frames with occlusion; predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame; and generating a target video according to the predicted frame corresponding to each current frame.

In one embodiment, the processor when executing the computer program further performs the steps of: and when the associated frame is not shielded, calculating to obtain an initial affine transformation matrix through the current frame and the associated frame.

In one embodiment, the predicting according to the initial affine transformation matrix and the key point in the target picture when the processor executes the computer program, and obtaining the predicted frame corresponding to the current frame includes: identifying background features in the target picture; performing action estimation according to the initial affine transformation matrix and the key points in the target picture; determining a region to be filled according to the result of the motion estimation; and carrying out background filling on the area to be filled based on the background characteristics to obtain a predicted frame corresponding to the current frame.

In one embodiment, before the obtaining of the predicted frame corresponding to the associated frame when the associated frame has an occlusion when the processor executes the computer program, the method further includes: judging whether the associated frame has shielding or not; the judgment of whether the associated frame has occlusion or not realized when the processor executes the computer program comprises at least one of the following steps: judging whether the associated frame is shielded or not according to the number of key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

In one embodiment, the obtaining of the associated frame corresponding to the current frame, as implemented by the processor when executing the computer program, comprises: calculating the similarity between the current frame and the adjacent frames with the preset number; when the similarity is greater than or equal to the threshold, determining the adjacent frame with the similarity greater than or equal to the threshold as the associated frame.

In one embodiment, after determining the current frame, the processor, implemented when executing the computer program, further comprises: if the current frame is a first frame, constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture; adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame; and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

In one embodiment, the predicted frame involved in the execution of the computer program by the processor is predicted by a pre-trained prediction model; the training method of the prediction model realized when the processor executes the computer program comprises the following steps: acquiring a sample video and a corresponding sample picture; identifying each frame in the sample video and key points in the sample picture; inputting each frame in the sample video and the key points in the sample picture into the initial model to obtain a sample prediction video; calculating the similarity of the sample video and the prediction sample video; and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining the prediction model.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: receiving an initial video and a target picture; identifying key points in each image frame of the initial video and the target picture; determining a current frame and acquiring an associated frame corresponding to the current frame; when the associated frame is shielded, acquiring a prediction frame corresponding to the associated frame; calculating to obtain an initial affine transformation matrix according to the current frame, the associated frames without occlusion and the predicted frames corresponding to the associated frames with occlusion; predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame; and generating a target video according to the predicted frame corresponding to each current frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: and when the associated frame has no occlusion, calculating to obtain an initial affine transformation matrix through the current frame and the associated frame.

In one embodiment, the predicting according to the initial affine transformation matrix and the key point in the target picture when the computer program is executed by the processor, and obtaining a predicted frame corresponding to the current frame includes: identifying background features in the target picture; performing action estimation according to the initial affine transformation matrix and the key points in the target picture; determining a region to be filled according to the result of the motion estimation; and performing background filling on the area to be filled based on the background characteristics to obtain a prediction frame corresponding to the current frame.

In one embodiment, before obtaining a predicted frame corresponding to an associated frame when an occlusion exists in the associated frame, the computer program, when executed by a processor, further comprises: judging whether the associated frame is shielded; the judgment of whether the associated frame has occlusion or not realized when the processor executes the computer program comprises at least one of the following steps: judging whether the associated frame is shielded or not according to the number of key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

In one embodiment, the computer program, when executed by a processor, implements obtaining an associated frame corresponding to a current frame, comprising: calculating the similarity between the current frame and the adjacent frames with the preset number; when the similarity is greater than or equal to the threshold, determining the adjacent frame with the similarity greater than or equal to the threshold as the associated frame.

In one embodiment, the computer program, when executed by the processor, further comprises, after determining the current frame: if the current frame is a first frame, constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture; adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame; and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

In one embodiment, the predicted frame to which the computer program is executed by the processor is predicted by a pre-trained prediction model; the training method of the prediction model, which is realized when the computer program is executed by a processor, comprises the following steps: acquiring a sample video and a corresponding sample picture; identifying each frame in the sample video and a key point in the sample picture; inputting each frame in the sample video and the key points in the sample picture into the initial model to obtain a sample prediction video; calculating the similarity of the sample video and the prediction sample video; and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining the prediction model.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of: receiving an initial video and a target picture; identifying key points in each image frame of the initial video and the target picture; determining a current frame and acquiring an associated frame corresponding to the current frame; when the associated frame is shielded, acquiring a prediction frame corresponding to the associated frame; calculating to obtain an initial affine transformation matrix according to the current frame, the associated frame without occlusion and the predicted frame corresponding to the associated frame with occlusion; predicting according to the initial affine transformation matrix and key points in the target picture to obtain a predicted frame corresponding to the current frame; and generating a target video according to the predicted frame corresponding to each current frame.

In one embodiment, the predicting according to the initial affine transformation matrix and the key point in the target picture when the computer program is executed by the processor, and obtaining a predicted frame corresponding to the current frame includes: identifying background features in the target picture; performing action estimation according to the initial affine transformation matrix and the key points in the target picture; determining a region to be filled according to the result of the motion estimation; and carrying out background filling on the area to be filled based on the background characteristics to obtain a predicted frame corresponding to the current frame.

In one embodiment, before obtaining a predicted frame corresponding to an associated frame when an occlusion exists in the associated frame, the computer program, when executed by a processor, further comprises: judging whether the associated frame has shielding or not; the determining whether the associated frame has occlusion when the processor executes the computer program includes at least one of: judging whether the associated frame is shielded or not according to the number of key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

In one embodiment, the computer program, when executed by a processor, implements obtaining an associated frame corresponding to a current frame, comprising: calculating the similarity between the current frame and a preset number of adjacent frames; when the similarity is greater than or equal to the threshold, determining the adjacent frame with the similarity greater than or equal to the threshold as the associated frame.

In one embodiment, the computer program, when executed by a processor, further comprises, after determining the current frame: if the current frame is a first frame, constructing an initial affine transformation matrix according to key points in the first frame and key points in the target picture; adjusting key points in the target picture according to the initial affine transformation matrix to obtain a predicted frame corresponding to the first frame; and if the current frame is not the first frame, continuously acquiring the associated frame corresponding to the current frame.

In one embodiment, the predicted frame in which the computer program is executed by the processor is predicted by a pre-trained prediction model; the training method of the prediction model, which is realized when the computer program is executed by the processor, comprises the following steps: acquiring a sample video and a corresponding sample picture; identifying each frame in the sample video and key points in the sample picture; inputting each frame in the sample video and the key points in the sample picture into the initial model to obtain a sample prediction video; calculating the similarity of the sample video and the prediction sample video; and when the similarity of the sample video and the prediction sample video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the prediction sample video meets the requirement, and obtaining the prediction model.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of target video generation, the method comprising:

receiving an initial video and a target picture;

calculating to obtain an initial affine transformation matrix according to the current frame, the associated frame without occlusion and a predicted frame corresponding to the associated frame with occlusion, wherein the predicted frame corresponding to the associated frame with occlusion is predicted according to a previous associated frame without occlusion or a predicted frame;

generating a target video according to the predicted frame corresponding to each current frame;

the predicting according to the initial affine transformation matrix and the key points in the target picture to obtain a predicted frame corresponding to the current frame comprises:

identifying background features in the target picture;

determining a region to be filled according to the action estimation result;

performing background filling on the region to be filled based on the background features to obtain a prediction frame corresponding to the current frame;

when the associated frame is blocked, before the obtaining of the predicted frame corresponding to the associated frame, the method further includes:

judging whether the associated frame is shielded or not;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein obtaining the associated frame corresponding to the current frame comprises:

4. The method of claim 1, wherein after determining the current frame, further comprising:

5. The method according to claim 1, wherein the predicted frame is predicted by a pre-trained prediction model; the training method of the prediction model comprises the following steps:

acquiring a sample video and a corresponding sample picture;

identifying each frame in the sample video and keypoints in the sample picture;

calculating the similarity of the sample video and the sample prediction video;

and when the similarity of the sample video and the sample prediction video does not meet the requirement, adjusting the initial model until the similarity of the sample video and the sample prediction video meets the requirement, and obtaining a prediction model.

6. A target video generation apparatus, the apparatus comprising:

a first identification module for identifying each image frame of the initial video and a key point in the target picture;

an initial affine transformation matrix generating module, configured to calculate an initial affine transformation matrix according to the current frame, the associated frame without occlusion, and a predicted frame corresponding to the associated frame with occlusion, where the predicted frame corresponding to the associated frame with occlusion is predicted according to a previous associated frame without occlusion or a previous predicted frame;

the prediction module is used for predicting according to the initial affine transformation matrix and the key points in the target picture to obtain a prediction frame corresponding to the current frame;

the generating module is used for generating a target video according to the prediction frame corresponding to each current frame;

the prediction module comprises:

the background feature identification unit is used for identifying the background features in the target picture;

a region to be filled determining unit, configured to determine a region to be filled according to a result of the motion estimation;

the filling unit is used for carrying out background filling on the area to be filled based on the background characteristics to obtain a prediction frame corresponding to the current frame;

the video generation apparatus further includes:

the judging module is used for judging whether the associated frame is blocked or not; the judging module judges whether the associated frame has occlusion or not through at least one of the following methods: judging whether the associated frame is shielded or not according to the number of the key points of the current frame and the associated frame; or extracting visual features of the current frame and the associated frame on different scales; and judging whether the associated frame has occlusion according to the visual features.

7. The apparatus of claim 6, wherein the initial affine transformation matrix generating module is further configured to calculate an initial affine transformation matrix from the current frame and the associated frame when there is no occlusion in the associated frame.

8. The apparatus of claim 6, wherein the association frame determination module comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.