CN116580707A

CN116580707A - Method and device for generating action video based on voice

Info

Publication number: CN116580707A
Application number: CN202310558500.6A
Authority: CN
Inventors: 许靖
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-11

Abstract

The present disclosure provides a method, apparatus, electronic device, and storage medium for generating motion video based on speech, the method comprising: extracting audio features from an input speech signal; inputting the extracted audio features into a trained key action prediction model to obtain semantic categories of predicted key actions corresponding to voice data frames of the voice signals; obtaining a first key action frame sequence matched with a voice data frame of a voice signal from a video training set based on the semantic category of the predicted key action; inputting the obtained first key action frame sequence into a trained gesture complement model to obtain a complement action frame sequence; and generating an action video based on the complemented action frame sequence. The method avoids the uncertainty mapping relation between direct regression voice and human body actions, and enables the generated actions to have good sense of reality.

Description

Method and device for generating action video based on voice

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to a method, apparatus, electronic device, and storage medium for generating motion video based on speech, and a method, apparatus, electronic device, and storage medium for training a model for generating motion video based on speech.

Background

The aim of the technology of writing the virtual person is to construct the virtual person capable of replacing the real person through the technical means, and bring the viewing experience of the real person to the consumer in live broadcast, broadcasting and other scenes. Typical application scenarios include live delivery, 7 x 24 hours unmanned live delivery, short video production, etc. The general form of this technique is to give a piece of speech as input and generate a piece of speaker video that matches it. A person often accompanies natural limb movements while speaking. These limb actions may assist in the organization and presentation of language content, with a better listening experience for the viewer.

The related art voice-based production action method uses a convolutional neural network or a sequence model, takes voice characteristics as input, directly returns the body posture of a speaker (generally characterized as the position of a limb key point), uses a mean square error as a supervision signal, and trains a deterministic regression model. However, because there is a non-deterministic association between limb motion and speech, there may be multiple different sequences of motion for the same segment of speech, and different speech may also correspond to the same sequence of motion. Accordingly, the related art schemes mainly have problems including: 1. the generated limb movement has poor sense of reality, such as serious deformation of the limb movement, poor time sequence stability, serious shake of the movement and the like; 2. the generated action and the voice have poor semantic matching degree.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device and storage medium for generating motion video based on speech, a method and apparatus for training a model for generating motion video based on speech, an electronic device and storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a training method of an action video generation model, wherein the action video generation model includes a key action prediction model and a posture completion model, the method including: extracting audio features of speech data frames of a speaking video in a video training set, obtaining semantic categories of predicted key actions from the key action prediction model based on the extracted audio features, and adjusting the key action prediction model according to differences between the semantic categories of the predicted key actions and the semantic categories of the key actions marked for the speech data frames to obtain a trained key action prediction model, wherein the semantic categories of the key actions are predefined for expressing specific semantics of actions; acquiring a first action frame sequence representing actions of a speaker in the speaking video, inputting key action frames and partial non-key action frames corresponding to the key actions in the first action frame sequence into the gesture completion model to output a second action frame sequence after completion, and adjusting the gesture completion model based on the difference between the second action frame sequence and the first action frame sequence to obtain a trained gesture completion model; and obtaining the action video generation model based on the trained key action prediction model and the trained gesture completion model.

According to a first aspect of the embodiments of the present disclosure, the speaking video is marked with a semantic category of the key action and a start frame number and an end frame number of the key action, wherein the inputting the key action frame and a part of non-key action frames corresponding to the key action in the first action frame sequence into the gesture completion model includes: based on the starting frame sequence number and the ending frame sequence number of the key action, randomly shielding a first part of frames in the part of non-key action frames from the first action frame sequence, and inputting the rest of second part of frames in the part of non-key action frames and the key action frames into a gesture completion model.

According to a first aspect of embodiments of the present disclosure, the obtaining the semantic category of the predicted key action from the key action prediction model based on the extracted audio features includes: and acquiring a semantic category predicted by the key action prediction model for a previous voice data frame of the voice data frame, and inputting the extracted audio features and the predicted semantic category of the previous voice data frame into the key action prediction model to output the semantic category of the predicted key action corresponding to the voice data frame, wherein the previous voice data frame is a voice data frame positioned before the voice data frame in the voice data frame sequence of the speaking video.

According to a first aspect of embodiments of the present disclosure, the key action prediction model is a semantic recognition model with an attention mechanism, wherein the adjusting the key action prediction model according to a difference between a semantic class of a predicted key action and a semantic class of a key action for the speech data frame marker comprises: parameters of the critical action prediction model are adjusted by a cross entropy loss function constructed based on the semantic class of the predicted critical action and the semantic class of the critical action for the speech data frame markers.

According to a first aspect of embodiments of the present disclosure, the obtaining a first sequence of action frames representing actions of a speaker in the speaking video includes: and acquiring the coordinates of key points of a human body in video frames in the speaking video, and generating the first action frame sequence based on a sequence representing the acquired coordinates of the key points of the video frames.

According to a first aspect of the embodiments of the present disclosure, the gesture completion model is a full convolution network performing image segmentation processing based on semantics, wherein adjusting the gesture completion model based on a difference between the output second action frame sequence and the first action frame sequence includes: constructing an absolute difference loss function by using action frames based on the first action frame sequence and corresponding predicted action frames in the second action frame sequence, and constructing a differential loss function based on the difference between two adjacent action frames in the second action frame sequence; parameters of the attitude completion model are adjusted using the absolute difference loss function and the differential loss function.

According to a first aspect of embodiments of the present disclosure, the constructing an absolute difference loss function using motion frames based on the first sequence of motion frames and corresponding predicted motion frames in the second sequence of motion frames comprises: constructing a first loss function term based on absolute differences between partial action frames in the first action frame sequence and corresponding predicted action frames in the second action frame sequence, and giving a preset weight to the first loss function term, wherein the partial action frames are a preset number of action frames in the boundary range of key action frames and non-key action frames; constructing a second loss function term based on absolute differences between the rest of the motion frames in the first motion sequence except the part of motion frames and corresponding predicted motion frames in the second motion sequence; the absolute difference loss function is obtained based on a first loss function term and the second loss function term weighted by the predetermined weight.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for an action video generation model, wherein the model includes a key action prediction model and a posture completion model, the training apparatus including: a key action prediction model training unit configured to extract audio features of a speech data frame of a speaking video in a video training set, obtain a semantic class of a predicted key action from the key action prediction model based on the extracted audio features, and adjust the key action prediction model according to a difference between the semantic class of the predicted key action and the semantic class of the key action marked for the speech data frame to obtain a trained key action prediction model, wherein the semantic class of the key action is predefined for expressing a specific semantic of an action; the gesture completion model training unit is configured to acquire a first action frame sequence representing actions of a speaker in the speaking video, input key action frames and partial non-key action frames corresponding to the key actions in the first action frame sequence into the gesture completion model to output a second action frame sequence after completion, and adjust the gesture completion model based on differences between the second action frame sequence and the first action frame sequence to obtain a trained gesture completion model; and the action video generation model unit is configured to obtain the action video generation model based on the trained key action prediction model and the trained gesture complement model.

According to a second aspect of embodiments of the present disclosure, the spoken video is labeled with a semantic category of the key action and a start frame number and an end frame number of the key action, wherein the pose completion model training unit is configured to: based on the starting frame sequence number and the ending frame sequence number of the key action, randomly shielding a first part of frames in the part of non-key action frames from the first action frame sequence, and inputting the rest of second part of frames in the part of non-key action frames and the key action frames into a gesture completion model.

According to a second aspect of embodiments of the present disclosure, the key-action prediction model training unit is configured to obtain a semantic category predicted by the key-action prediction model for a preceding speech data frame of the speech data frame, and input the extracted audio features and the predicted semantic category of the preceding speech data frame into the key-action prediction model to output the semantic category of the predicted key-action corresponding to the speech data frame, wherein the preceding speech data frame is a speech data frame preceding the speech data frame in the sequence of speech data frames of the speech video.

According to a second aspect of embodiments of the present disclosure, the critical action prediction model is a semantic recognition model with an attention mechanism, wherein the critical action prediction training unit is configured to: parameters of the critical action prediction model are adjusted by a cross entropy loss function constructed based on the semantic class of the predicted critical action and the semantic class of the critical action for the speech data frame markers.

According to a second aspect of embodiments of the present disclosure, the gesture completion model training unit is configured to obtain key point coordinates of a human body in a video frame in the spoken video, and to generate the first sequence of action frames based on a sequence representing the key point coordinates of the obtained video frame.

According to a second aspect of the embodiments of the present disclosure, the posture completion model is a full convolution network performing image segmentation processing based on semantics, and the posture completion model training unit is configured to construct an absolute difference loss function using action frames based on the first action frame sequence and corresponding predicted action frames in the second action frame sequence, and construct a differential loss function based on differences between two adjacent action frames in the second action frame sequence; parameters of the attitude completion model are adjusted using the absolute difference loss function and the differential loss function.

According to a second aspect of embodiments of the present disclosure, the pose completion model training unit is configured to construct a first loss function term based on an absolute difference between a partial motion frame in the first motion frame sequence and a corresponding predicted motion frame in the second motion frame sequence, and to assign a predetermined weight to the first loss function term, wherein the partial motion frame is a predetermined number of motion frames within a boundary range of a critical motion frame and a non-critical motion frame; constructing a second loss function term based on absolute differences between the rest of the motion frames in the first motion sequence except the part of motion frames and corresponding predicted motion frames in the second motion sequence; the absolute difference loss function is obtained based on a first loss function term and the second loss function term weighted by the predetermined weight.

According to a third aspect of embodiments of the present disclosure, there is provided a method of generating motion video based on speech, comprising: extracting audio features from an input speech signal; inputting the extracted audio features into a key action prediction model trained by the method according to the first aspect of the embodiments of the present disclosure, so as to obtain semantic categories of predicted key actions corresponding to voice data frames of the voice signal; obtaining a first key action frame sequence matched with a voice data frame of a voice signal from a video training set based on the semantic category of the predicted key action; inputting the obtained first key action frame sequence into a gesture complement model trained by the method according to the first aspect of the embodiment of the disclosure to obtain a complement action frame sequence; and generating an action video based on the complemented action frame sequence.

According to a third aspect of embodiments of the present disclosure, deriving a sequence of key action frames matching a speech data frame of a speech signal from a video training set based on a semantic class of a predicted key action comprises: based on the semantic category of the voice data frame output by the key action prediction model, obtaining the semantic category of the predicted key action, and a starting frame sequence number and an ending frame sequence number; retrieving a plurality of candidate key action frame sequences of the same category as the semantic category of the predicted key action from the video training set; determining the length of a predicted key action frame sequence according to the starting frame sequence and the ending frame sequence of the predicted key action, and selecting a candidate key action frame sequence which is most matched with the length of the key action frame sequence from the candidate key action frame sequences; and interpolating action frames in the candidate key action frame sequence to obtain a first key action frame sequence with the same length as the predicted key action frame sequence.

According to a fourth aspect of embodiments of the present disclosure, there is provided an apparatus for generating motion video based on speech, comprising: a feature extraction unit configured to extract an audio feature from an input speech signal; a key action prediction unit configured to input the extracted audio features into the trained key action prediction model according to the first aspect of the embodiments of the present disclosure, to obtain semantic categories of predicted key actions corresponding to speech data frames of a speech signal; a matching unit configured to obtain a first key action frame sequence matching a speech data frame of the speech signal from the video training set based on a semantic class of the predicted key action; a gesture completion unit configured to input the obtained first key action frame sequence into a gesture completion model trained according to the method described in the first aspect of the embodiments of the present disclosure, so as to obtain a completed action frame sequence; and a video generation unit configured to generate an action video based on the completed action frame sequence.

According to a fourth aspect of embodiments of the present disclosure, the matching unit is configured to: based on the semantic category of the voice data frame output by the key action prediction model, obtaining the semantic category of the predicted key action, and a starting frame sequence number and an ending frame sequence number; retrieving a plurality of candidate key action frame sequences of the same category as the semantic category of the predicted key action from the video training set; determining the length of a predicted key action frame sequence according to the starting frame sequence and the ending frame sequence of the predicted key action, and selecting a candidate key action frame sequence which is most matched with the length of the key action frame sequence from the candidate key action frame sequences; and interpolating action frames in the candidate key action frame sequence to obtain a first key action frame sequence with the same length as the predicted key action frame sequence.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method according to the first aspect of the embodiments of the present disclosure and/or the method according to the third aspect of the embodiments of the present disclosure.

According to a sixth aspect of the disclosed embodiments, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method according to the first aspect of the disclosed embodiments and/or the method according to the third aspect of the disclosed embodiments.

According to a seventh aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect of the disclosed embodiments and/or the method according to the third aspect of the disclosed embodiments.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the semantic category of human body actions is used as intermediate representation, the association between the human body actions and the input voice is constructed, the uncertainty mapping relation between the direct return voice and the human body actions is avoided, and the actions among key actions are complemented by taking the action sequence of the actual speaking video training set as guidance, so that the generated actions have good sense of reality, and the time sequence consistency and the matching property with voice of the limb actions of the speaker of the generated action video can be effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a basic structural diagram illustrating a model used in a speech-based video generation method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method of training a model for speech-based generation of motion video in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 is a labeled schematic diagram illustrating training samples for training a model based on speech generated motion video according to an example embodiment of the disclosure;

FIG. 4 is a schematic diagram illustrating a critical action prediction model according to an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a pose completion model according to an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a sequence of action frames according to an exemplary embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a method of generating video based on speech according to an exemplary embodiment of the present disclosure;

Fig. 8 is a block diagram illustrating a device for generating video based on voice according to an exemplary embodiment of the present disclosure;

FIG. 9 is a block diagram illustrating an apparatus for training a model for speech-based generation of motion video in accordance with an exemplary embodiment of the present disclosure;

fig. 10 is a schematic diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

Fig. 1 illustrates a basic structural diagram of a model used for a speech-based video generation method according to an exemplary embodiment of the present disclosure.

It should be appreciated that the method of generating video based on speech according to exemplary embodiments of the present disclosure may be implemented on a terminal device such as a cell phone, tablet, desktop, laptop, handheld computer, notebook, netbook, personal digital assistant (personal digital assistant, PDA), augmented reality (augmented reality, AR)/Virtual Reality (VR) device. Various video playing applications such as a short video application, a live application, a social application, a video conference application, an online education application, etc. may be run on the terminal, and the method of generating a video based on voice is implemented in the application. In addition, the method according to the exemplary embodiments of the present disclosure may be performed on a server, so as to connect with the server through a network and transmit a voice signal to the server when an application is run on a terminal device, and obtain a video generated from the voice signal from the server.

As shown in fig. 1, a model used for a speech-based video generation method according to an exemplary embodiment of the present disclosure includes two parts: a key motion prediction model and a posture completion model.

Considering the uncertainty of the mapping relation from the voice to the body gesture of the speaker, the scheme of the invention divides the generation of the body gesture from the voice to the speaker into two stages, wherein the first stage predicts the semantic category of the key action (such as the body gesture) of the speaker from the voice by the key action prediction model, thereby avoiding the direct fitting of the voice to the body gesture of the speaker; and generating all actions of the speaker by the gesture completion model according to the action category information predicted by the first stage in the second stage.

As shown in fig. 1, the critical action prediction model may identify semantic categories of 3 critical actions for a segment of speech signal. According to an exemplary embodiment of the present disclosure, the semantic categories of the key actions are predefined for expressing the specific semantics of the actions. The limb actions of a speaker may typically express specific semantics, so that the semantic categories of key actions may be identified for the speech signal. For example, the motions of the left and right hands of the speaker may be identified, each of the left and right hand motions recorded as a key motion, and the semantic category of the key motion determined in combination with the semantics expressed by the speaker's speech. The semantic categories may be predetermined, and may be, for example, various categories of express invitations, express numbers, express directions, and the like. The defined semantic categories may form a semantic category library, and then the speaker's key actions may be mapped to the semantic categories in the semantic category library correspondingly. Next, the gesture completion model may complete the actions (masking actions) of the blank portion between the critical actions to form a complete and fluent sequence of actions. A method for training the key motion prediction model and the posture completion model will be described with reference to fig. 2 to 6.

Fig. 2 is a flowchart illustrating a method for training a speech-based generation of action video according to an exemplary embodiment of the present disclosure.

First, in step S201, audio features of speech data frames of a speaking video in a video training set are extracted, semantic categories of predicted key actions are obtained from a key action prediction model based on the extracted audio features, and the key action prediction model is adjusted according to differences between the semantic categories of the predicted key actions and the semantic categories of the key actions marked for the speech data frames.

Here, the video training set is a real shot talking video (e.g., lecture video). For a talking video, the semantic category of each key action (e.g., limb action) that the speaker makes in the video, as well as the sequence numbers of the start and end frames of the key action, may be annotated.

Fig. 3 illustrates one key action screenshot of a spoken video of a training sample and corresponding annotation information, according to an exemplary embodiment of the present disclosure. As shown in fig. 3, a speaker in the video makes an action to raise the right hand while knowing that the speaker is calling with others based on the speaker's speaking content. Thus, the sequence numbers of the right-hand motion start and end frames can be determined from the lower right-hand motion curve in the graph of fig. 3, and this action can be labeled "A3:5-65", i.e. the action corresponds to the A3 semantic category (call on), the start frame of the action being the 5 th frame and the end frame being the 65 th frame.

According to an exemplary embodiment of the present disclosure, for audio signals of a spoken video in a video training set, specific audio features may be extracted therefrom as inputs to a key motion prediction model. Various audio feature extraction methods may be employed, such as mel-frequency cepstral coefficients (MFCCs), speech recognition model mid-layer features (e.g., deep spec), etc., without limitation.

According to an exemplary embodiment of the present disclosure, obtaining semantic categories of predicted key actions from a key action prediction model based on extracted audio features includes: the method comprises the steps of obtaining a semantic category predicted by a key action prediction model aiming at a previous voice data frame of a voice data frame, and inputting the extracted audio feature and the predicted semantic category of the previous voice data frame into the key action prediction model so as to output the semantic category of a predicted key action corresponding to the voice data frame, wherein the previous voice data frame is a voice data frame positioned before the voice data frame in a voice data frame sequence of a voice data video. By inputting the previous frame information of the voice data frame into the prediction model, the input dimension of the prediction model is increased, and more accurate output of the current frame can be obtained by taking the previous frame information as a reference.

The critical action prediction model according to an exemplary embodiment of the present disclosure may be a semantic recognition model with an attention mechanism. For example, the critical action prediction model may employ an auto-regressive sequence2sequence structure based on a transformer network. The transform network structure can enable the model to have the learning capability on a long sequence, and the autoregressive sequence2sequence structure can improve the time sequence consistency of the model. Thus, the input of the key action prediction model according to an exemplary embodiment of the present disclosure employs the extracted audio features of the audio frame and the model output of the last time step (i.e., the last audio frame), the output being the action semantic category for each time step.

As shown in fig. 4, the input of the key action prediction model with the transducer structure is the semantic category c of the frame previous to the current speech frame i _i-1 And audio feature a of the current speech frame _i I=0, 1,2 … T, where T represents the number of frames of the audio signal, output as semantic class c for each audio frame i _i 。

According to an exemplary embodiment of the present disclosure, in step S201, adjusting the key action prediction model according to a difference between the semantic category of the predicted key action and the semantic category of the key action for the voice data frame mark may include: parameters of the key action prediction model are adjusted by a cross entropy loss function constructed based on semantic categories of predicted key actions and semantic categories of key actions of the speech data frame markers, as follows:

Where N represents the number of samples, i.e., the number of speech frames, M is the semantic categoryNumber, y of _ic A sign function (0 or 1, 1 when the class of the sample i is c, otherwise 0); p is p _ic The confidence that the model is given for sample i class c. By constructing the cross entropy loss function for the key action prediction model for semantic classification by using the training samples (i.e., the key actions for the voice data frame markers) and the prediction results for the training samples (i.e., the semantic categories for predicting the key actions) in the manner described above, the key action prediction model can be quickly converged to the expected target, and the training efficiency is improved.

After the critical action prediction model is trained, step S203 may be performed to train the pose completion model according to an exemplary embodiment of the present disclosure. Specifically, a first action frame sequence representing actions of a speaker in a speaking video may be obtained, key action frames and partial non-key action frames corresponding to key actions in the first action frame sequence are input into a gesture completion model to output a completed second action frame sequence, and the gesture completion model is adjusted based on differences between the output second action frame sequence and the first action frame sequence to obtain a trained gesture completion model. It should be appreciated that steps S201 and S203 may be performed simultaneously or sequentially, that is, the gesture completion model may be trained first and then the key motion prediction model may be trained. According to an exemplary embodiment of the present disclosure, the pose completion model may be a full convolution network that performs image segmentation processing based on semantics. Fig. 5 is a schematic diagram illustrating a pose completion model according to an exemplary embodiment of the present disclosure. As shown in fig. 5, a full convolutional network of 2DUnet may be employed. According to an exemplary embodiment of the present disclosure, a speaking video is labeled with a semantic category of a key action and a start frame sequence and an end frame sequence of the key action, and accordingly, in step S203, a first partial frame of a partial non-key action frame may be randomly masked from a first action frame sequence based on the start frame sequence and the end frame sequence of the key action, and the remaining second partial frame of the partial non-key action frame and the key action frame may be input into a pose completion model for training. That is, the convolutional network of the pose completion model may be trained in the following manner: according to the labeling information of the key action starting frame and the key action ending frame in the training set, firstly, randomly shielding action frames except for the key action sequences in the training set as the input of a model, and outputting the model as a complete action sequence filling the shielded parts. The key action frames are taken as main bodies, and part of non-key action frames are taken as references for connecting the key action frames, so that the training gesture complement model can generate transition actions among the key actions while reflecting the key actions, and the generated action videos are more natural and smooth.

According to an exemplary embodiment of the present disclosure, an action in a video frame in a video training set may be represented in the form of human keypoint coordinates, so that a sequence of action video frames may be represented as a sequence of action frames in the form of a numerical value represented by the keypoint coordinates. In accordance with exemplary embodiments of the present disclosure, the manner in which 2D human keypoint coordinates are used may be used as a digital representation of a sequence of speaker action frames. Other representation modes such as 3D human body key point coordinates can be adopted, and the method is not limited. Fig. 6 shows a schematic diagram of a sequence of action frames according to an exemplary embodiment of the present disclosure. As shown in fig. 6, text, audio, keypoint coordinate representations and video frames corresponding to the talking video are shown from top to bottom, respectively. The text, the audio, the key point coordinates and the video frames are in one-to-one correspondence according to the frame numbers. By adopting such a numerical expression of the key point coordinates, a real motion can be rendered and generated in a simple numerical form in a subsequent video rendering, that is, a motion converted from a digital form of a sequence of motion frames into a motion of an actual/virtual human body of a video, and generation of a motion video can be realized simply and efficiently.

According to an example embodiment of the present disclosure, a pose completion model may adjust parameters of the pose completion model using an absolute difference loss function (e.g., an L1 loss function) constructed based on an original motion frame of a first motion frame sequence and a corresponding predicted motion frame of a second motion frame sequence and a differential loss function based on two neighboring motion frames of the second motion frame sequence. Constructing a first loss function term based on absolute differences between partial motion frames in the first motion frame sequence and corresponding predicted motion frames in the second motion frame sequence, and giving a predetermined weight to the first loss function term, wherein the partial motion frames are a predetermined number of motion frames within a boundary range of key motion frames and non-key motion frames; constructing a second loss function term based on absolute differences between the rest of the motion frames in the first motion sequence except the part of motion frames and corresponding predicted motion frames in the second motion sequence; the absolute difference loss function is obtained based on a first loss function term and the second loss function term weighted by the predetermined weight. For example, assuming that the length of a key action frame of a talking video tag is from the 5 th frame to the 65 th frame, frames with sequence numbers 0 to 10 and frames with sequence numbers 60 to 70 (hereinafter, referred to as original action frames) may be set as boundary ranges, thereby giving predetermined weights to loss function items of the original action frames with sequence numbers 0 to 10 and predicted action frames for the original action frames.

For example, the penalty function for the pose completion model is the L1 penalty L constructed based on the model output and the unmasked original and predicted motion frame sequences _recon And a differential loss L of a preceding frame and a following frame in time sequence _st As shown in the following equation:

L _recon ＝λ||M _gt -M _pred ||

L _recon ＝||M _i -M _i-1 ||

wherein M is _gt Representing the original sequence of action frames, M _pred Representing a predicted sequence of action frames, lambda representing weights for a predetermined number of action frames lying within the boundary range of key action frames and non-key action frames, M _i Representing frames in the predicted action frame sequence, i representing the frame number. The weight for the absolute difference loss strengthens the loss weight of the connection stage of the masking action and the unmasked action, so that the model can better complement the masked partial action, and the differential loss ensures that the output result of the model is continuous in time sequence.

After the key motion prediction model and the gesture completion model are trained in the manner described above, a model for generating a motion video based on speech may be obtained based on the trained key motion prediction model and the trained gesture completion model in step S205. The key action prediction model takes the semantic category of the human action as the intermediate representation, constructs the association between the human action and the input voice, and avoids the uncertainty mapping relation between the direct regression voice and the human action. The gesture complement model takes the action sequence of the actual speaking video training set as a guide to complement the actions among the key actions, so that the generated actions have good realism.

Fig. 7 is a flowchart illustrating a method of generating an action video based on speech according to an exemplary embodiment of the present disclosure. The method can be divided into a prediction phase and a video generation phase. In the prediction stage, a speech signal is input into the model, and the generated motion sequence is output. Based on the generated motion sequence, an actual/virtual human motion video may be generated from the motion sequence by a video rendering technique.

As shown in fig. 7, in step S701, an audio feature is extracted from an input speech signal. Here, for example, audio features may be extracted in a manner such as mel-frequency cepstral coefficient (MFCC), speech recognition model mid-layer features (e.g., deep), and the like.

In step S703, the extracted audio features are input into a key action prediction model trained according to the method described in the exemplary embodiment of the present disclosure to obtain semantic categories of predicted key actions corresponding to the speech data frames of the speech signal.

In step S705, a first sequence of key action frames matching the speech data frames of the speech signal is obtained from the video training set based on the semantic category of the predicted key action.

In step S707, the obtained first key action frame sequence is input into a pose completion model trained according to the method of the exemplary embodiment of the present disclosure, to obtain a completed action frame sequence.

In step S709, an action video is generated based on the completed action frame sequence.

According to an exemplary embodiment of the present disclosure, deriving a first key action frame sequence from the video training set that matches a speech data frame of the speech signal based on the semantic category of the predicted key action at step S705 may include: based on the semantic category of the voice data frame output by the key action prediction model, obtaining the semantic category of the predicted key action, a starting frame sequence number and an ending frame sequence number, searching a plurality of candidate key action frame sequences with the same category as the semantic category of the predicted key action from a video training set, determining the length of the predicted key action frame sequence according to the starting frame sequence number and the ending frame sequence number of the predicted key action, selecting a candidate key action frame sequence which is most matched with the length of the key action frame sequence from the candidate key action frame sequences, and interpolating action frames in the candidate key action frame sequences to obtain a first key action frame sequence with the same length as the predicted key action frame sequence. The operation can generate the action frame sequence matched with the key action frame sequence in the video training set, so that the subsequent gesture completion can be carried out as the input of the gesture completion model.

For example, if the semantic category of the key action corresponding to the voice signal is identified as "greeting" from the voice signal of "good night" and the start frame number of the action is 5 and the end frame number is 60, all the key action frame sequences corresponding to "greeting" can be retrieved from the training video set, and one key action frame sequence a closest to the length of 55 frames can be found. Assuming that the length of the sequence a is 50 frames, a sequence of action frames a' of 55 frames can be obtained by interpolation as input to the pose completion model.

Through the above process, a complete sequence of actions can be generated from the speech signal. This sequence of actions may be in the form of a numerical representation of 2D/3D human keypoint coordinates as described previously.

Fig. 8 is a block diagram illustrating an apparatus for generating motion video based on voice according to an exemplary embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 may include: a feature extraction unit 810, a key motion prediction unit 820, a matching unit 830, a gesture completion unit 840, and a video generation unit 850.

The feature extraction unit 810 is configured to extract audio features from an input speech signal.

The key action prediction unit 820 is configured to input the extracted audio features into the key action prediction model trained by the method as described above to obtain semantic categories of predicted key actions corresponding to frames of speech data of the speech signal.

The matching unit 830 is configured to obtain a first key-action frame sequence from the video training set that matches the speech data frames of the speech signal based on the semantic category of the predicted key-action.

The gesture completion unit 840 is configured to input the obtained first key action frame sequence into a gesture completion model trained by the method as described above, to obtain a completed action frame sequence.

The video generation unit 850 is configured to generate an action video based on the completed sequence of action frames.

According to an exemplary embodiment of the present disclosure, the matching unit 830 is configured to: based on the semantic category of the voice data frame output by the key action prediction model, obtaining the semantic category of the predicted key action, a starting frame sequence number and an ending frame sequence number, searching a plurality of candidate key action frame sequences with the same category as the semantic category of the predicted key action from a video training set, determining the length of the predicted key action frame sequence according to the starting frame sequence number and the ending frame sequence number of the predicted key action, selecting a candidate key action frame sequence which is most matched with the length of the key action frame sequence from the candidate key action frame sequences, and interpolating action frames in the candidate key action frame sequences to obtain a first key action frame sequence with the same length as the predicted key action frame sequence.

The process of generating motion video based on speech has been described above with reference to fig. 7 and will not be repeated here.

Fig. 9 is a block diagram illustrating a training apparatus of an action video generation model according to an exemplary embodiment of the present disclosure.

As shown in fig. 9, the training apparatus of the motion video generation model according to the exemplary embodiment of the present disclosure includes a key motion prediction model training unit 910, a posture completion model training unit 920, and a motion video generation model unit 930.

According to an exemplary embodiment of the present disclosure, the key-action prediction model training unit 910 is configured to extract audio features of speech data frames of a speaking video in a video training set, obtain semantic categories of predicted key actions from the key-action prediction model based on the extracted audio features, and adjust the key-action prediction model according to differences between the semantic categories of the predicted key actions and the semantic categories of the key actions tagged for the speech data frames to obtain a trained key-action prediction model, wherein the semantic categories of the key actions are predefined for expressing specific semantics of an action.

According to an exemplary embodiment of the present disclosure, the gesture completion model training unit 920 is configured to obtain a first motion frame sequence representing a motion of a speaker in the speaking video, input key motion frames and part of non-key motion frames corresponding to the key motion in the first motion frame sequence into the gesture completion model to output a completed second motion frame sequence, and adjust the gesture completion model based on a difference between the second motion frame sequence and the first motion frame sequence to obtain a trained gesture completion model.

According to an exemplary embodiment of the present disclosure, the motion video generation model unit 930 is configured to obtain the motion video generation model based on the trained key motion prediction model and the trained pose completion model.

According to an embodiment of the present disclosure, the speaking video is labeled with the semantic category of the key action and the start frame number and the end frame number of the key action, wherein the pose completion model training unit 920 is configured to: based on the starting frame sequence number and the ending frame sequence number of the key action, randomly shielding a first part of frames in the part of non-key action frames from the first action frame sequence, and inputting the rest of second part of frames in the part of non-key action frames and the key action frames into a gesture completion model.

According to an embodiment of the present disclosure, the key action prediction model training unit 910 is configured to obtain a semantic category predicted by the key action prediction model for a previous speech data frame of the speech data frame, and input the extracted audio feature and the predicted semantic category of the previous speech data frame to the key action prediction model to output a semantic category of a predicted key action corresponding to the speech data frame, wherein the previous speech data frame is a speech data frame preceding the speech data frame in a sequence of speech data frames of the speech video.

According to an embodiment of the present disclosure, the key action prediction model is a semantic recognition model with an attention mechanism, wherein the key action prediction training unit 910 is configured to: parameters of the critical action prediction model are adjusted by a cross entropy loss function constructed based on the semantic class of the predicted critical action and the semantic class of the critical action for the speech data frame markers.

According to an embodiment of the present disclosure, the pose completion model training unit 920 is configured to obtain key point coordinates of a human body in video frames in the spoken video and to generate the first sequence of action frames based on a sequence representing the obtained key point coordinates of the video frames.

According to an embodiment of the present disclosure, the gesture completion model is a full convolution network performing image segmentation processing based on semantics, and the gesture completion model training unit 920 is configured to construct an absolute difference loss function using motion frames based on the first motion frame sequence and corresponding predicted motion frames in the second motion frame sequence, and to construct a differential loss function based on differences between two neighboring motion frames in the second motion frame sequence; parameters of the attitude completion model are adjusted using the absolute difference loss function and the differential loss function.

According to an embodiment of the present disclosure, the gesture completion model training unit 920 is configured to construct a first loss function term based on an absolute difference between a partial motion frame in the first motion frame sequence and a corresponding predicted motion frame in the second motion frame sequence, and assign a predetermined weight to the first loss function term, wherein the partial motion frame is a predetermined number of motion frames within a boundary range of a key motion frame and a non-key motion frame; constructing a second loss function term based on absolute differences between the rest of the motion frames in the first motion sequence except the part of motion frames and corresponding predicted motion frames in the second motion sequence; the absolute difference loss function is obtained based on a first loss function term and the second loss function term weighted by the predetermined weight. The training method has been described in detail above with reference to fig. 2-6 and is not repeated here.

Fig. 10 is a block diagram of an electronic device. The electronic device 1000 may be, for example: smart phones, tablet computers, MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio layer 4) players, notebook computers or desktop computers. Electronic device 1000 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

Generally, the electronic device 1000 includes: a processor 1001 and a memory 1002.

The processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1001 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1001 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1001 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1001 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. Memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement the training method and/or video generation method provided by the method embodiments of the present disclosure as shown in fig. 2-7.

In some embodiments, the electronic device 1000 may further optionally include: a peripheral interface 1003, and at least one peripheral. The processor 1001, the memory 1002, and the peripheral interface 1003 may be connected by a bus or signal line. The various peripheral devices may be connected to the peripheral device interface 1003 via a bus, signal wire, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch display 1005, camera 1006, audio circuitry 1007, positioning component 1008, and power supply 1009.

Peripheral interface 1003 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1001 and memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1001, memory 1002, and peripheral interface 1003 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

Radio Frequency circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. Radio frequency circuitry 1004 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Radio frequency circuitry 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1004 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited by the present disclosure.

The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1005 is a touch screen, the display 1005 also has the ability to capture touch signals at or above the surface of the display 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this time, the display 1005 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1005 may be one, disposed on the front panel of the electronic device 1000; in other embodiments, display 1005 may be provided in at least two, separately provided on different surfaces of terminal 1000 or in a folded configuration; in still other embodiments, display 1005 may be a flexible display disposed on a curved surface or a folded surface of terminal 1000. Even more, the display 1005 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1005 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1006 is used to capture images or video. Optionally, camera assembly 1006 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing, or inputting the electric signals to the radio frequency circuit 1004 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each located at a different portion of terminal 1000. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 1007 may also include a headphone jack.

The location component 1008 is used to locate a current geographic location of the electronic device 1000 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1008 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, the grainer system of russia, or the galileo system of the european union.

The power supply 1009 is used to power the various components in the electronic device 1000. The power source 1009 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 1009 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 1000 also includes one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyroscope sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.

The acceleration sensor 1011 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1001 may control the touch display 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1012 may detect the body direction and the rotation angle of the terminal 1000, and the gyro sensor 1012 may collect the 3D motion of the user to the terminal 1000 in cooperation with the acceleration sensor 1011. The processor 1001 may implement the following functions according to the data collected by the gyro sensor 1012: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1013 may be disposed on a side frame of terminal 1000 and/or on an underlying layer of touch display 1005. When the pressure sensor 1013 is provided at a side frame of the terminal 1000, a grip signal of the terminal 1000 by a user can be detected, and the processor 1001 performs right-and-left hand recognition or quick operation according to the grip signal collected by the pressure sensor 1013. When the pressure sensor 1013 is provided at the lower layer of the touch display 1005, control of the operability control on the UI is realized by the processor 1001 according to the pressure operation of the user on the touch display 1005. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1014 may be provided on the front, back or side of the electronic device 1000. When a physical key or vendor Logo is provided on the electronic device 1000, the fingerprint sensor 1014 may be integrated with the physical key or vendor Logo.

The optical sensor 1015 is used to collect ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display 1005 based on the ambient light intensity collected by the optical sensor 1015. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 1005 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may dynamically adjust the shooting parameters of the camera module 1006 according to the ambient light intensity collected by the optical sensor 1015.

A proximity sensor 1016, also referred to as a distance sensor, is typically provided on the front panel of the electronic device 1000. The proximity sensor 1016 is used to capture the distance between the user and the front of the electronic device 1000. In one embodiment, when proximity sensor 1016 detects a gradual decrease in the distance between the user and the front face of terminal 1000, processor 1001 controls touch display 1005 to switch from the bright screen state to the off screen state; when the proximity sensor 1016 detects that the distance between the user and the front surface of the electronic device 1000 gradually increases, the touch display screen 1005 is controlled by the processor 1001 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting of the electronic device 1000 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the training method and/or the video generation method provided according to the method embodiments of the present disclosure as shown in fig. 2-7. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

In accordance with embodiments of the present disclosure, a computer program product may also be provided, instructions in which are executable by a processor of a computer device to perform the training method and/or the video generation method provided by the method embodiments shown in fig. 2-7.

According to the voice-based motion video generation method, the semantic category of the human motion is used as the intermediate representation, the association between the human motion and the input voice is constructed, the uncertainty mapping relation between the direct regression voice and the human motion is avoided, the motion sequence of the actual speaking video training set is used as the guide, the motion between the key motions is completed, the generated motion has good sense of reality, and therefore the time sequence consistency of the limb motion of a speaker of the generated motion video and the matching performance with the voice can be effectively improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of an action video generation model, wherein the action video generation model comprises a key action prediction model and a gesture completion model, the method comprising:

extracting audio features of speech data frames of a speaking video in a video training set, obtaining semantic categories of predicted key actions from the key action prediction model based on the extracted audio features, and adjusting the key action prediction model according to differences between the semantic categories of the predicted key actions and the semantic categories of the key actions marked for the speech data frames to obtain a trained key action prediction model, wherein the semantic categories of the key actions are predefined for expressing specific semantics of actions;

acquiring a first action frame sequence representing actions of a speaker in the speaking video, inputting key action frames and partial non-key action frames corresponding to the key actions in the first action frame sequence into the gesture completion model to output a second action frame sequence after completion, and adjusting the gesture completion model based on the difference between the second action frame sequence and the first action frame sequence to obtain a trained gesture completion model;

And obtaining the action video generation model based on the trained key action prediction model and the trained gesture completion model.

2. The method of claim 1, wherein the spoken video is tagged with a semantic category of the key action and a start frame number and an end frame number of the key action,

wherein the inputting the key action frame and the part of the non-key action frames corresponding to the key actions in the first action frame sequence into the gesture completion model comprises:

based on the starting frame sequence number and the ending frame sequence number of the key action, randomly shielding a first part of frames in the part of non-key action frames from the first action frame sequence, and inputting the rest of second part of frames in the part of non-key action frames and the key action frames into a gesture completion model.

3. The method of claim 1, wherein the obtaining semantic categories of predicted critical actions from the critical action prediction model based on the extracted audio features comprises:

and acquiring a semantic category predicted by the key action prediction model for a previous voice data frame of the voice data frame, and inputting the extracted audio features and the predicted semantic category of the previous voice data frame into the key action prediction model to output the semantic category of the predicted key action corresponding to the voice data frame, wherein the previous voice data frame is a voice data frame positioned before the voice data frame in the voice data frame sequence of the speaking video.

4. The method of claim 1, wherein the critical-motion prediction model is a semantic recognition model with an attention mechanism,

wherein said adjusting the key action prediction model based on differences in semantic categories of predicted key actions and semantic categories of key actions tagged for the frame of speech data comprises:

parameters of the critical action prediction model are adjusted by a cross entropy loss function constructed based on the semantic class of the predicted critical action and the semantic class of the critical action for the speech data frame markers.

5. The method of claim 1, wherein the obtaining a first sequence of action frames representing actions of a speaker in the speaking video comprises:

and acquiring the coordinates of key points of a human body in video frames in the speaking video, and generating the first action frame sequence based on a sequence representing the acquired coordinates of the key points of the video frames.

6. The method of claim 1, wherein the pose completion model is a full convolution network that performs image segmentation processing based on semantics, wherein the adjusting the pose completion model based on differences in the outputted second sequence of action frames and the first sequence of action frames comprises:

Constructing an absolute difference loss function by using action frames based on the first action frame sequence and corresponding predicted action frames in the second action frame sequence, and constructing a differential loss function based on the difference between two adjacent action frames in the second action frame sequence;

parameters of the attitude completion model are adjusted using the absolute difference loss function and the differential loss function.

7. The method of claim 6, wherein constructing an absolute difference loss function using motion frames based on the first sequence of motion frames and corresponding predicted motion frames in the second sequence of motion frames comprises:

constructing a first loss function term based on absolute differences between partial action frames in the first action frame sequence and corresponding predicted action frames in the second action frame sequence, and giving a preset weight to the first loss function term, wherein the partial action frames are a preset number of action frames in the boundary range of key action frames and non-key action frames;

constructing a second loss function term based on absolute differences between the rest of the motion frames in the first motion sequence except the part of motion frames and corresponding predicted motion frames in the second motion sequence;

The absolute difference loss function is obtained based on a first loss function term and the second loss function term weighted by the predetermined weight.

8. A method for generating motion video based on speech, comprising:

extracting audio features from an input speech signal;

inputting the extracted audio features into a key action prediction model trained by the method according to any one of claims 1-7 to obtain semantic categories of predicted key actions corresponding to speech data frames of a speech signal;

obtaining a first key action frame sequence matched with the voice data frame of the voice signal from a video training set based on the semantic category of the predicted key action;

inputting the first key action frame sequence into a gesture complement model trained by the method according to any one of claims 1-7 to obtain a complement action frame sequence;

and generating an action video based on the complemented action frame sequence.

9. The method of claim 8, wherein deriving a sequence of key action frames from a video training set that matches a frame of speech data of a speech signal based on a semantic category of predicted key actions comprises:

Based on the semantic category of the voice data frame output by the key action prediction model, obtaining the semantic category of the predicted key action, and a starting frame sequence number and an ending frame sequence number;

retrieving a plurality of candidate key action frame sequences of the same category as the semantic category of the predicted key action from the video training set;

determining the length of the predicted key action frame sequence according to the starting frame sequence and the ending frame sequence of the predicted key action, and selecting a candidate key action frame sequence which is most matched with the length of the key action frame sequence from candidate key action frame sequences of the video training set;

and interpolating action frames in the candidate key action frame sequence to obtain a first key action frame sequence with the same length as the predicted key action frame sequence.

10. A training device for motion video generation models, wherein the models include a key motion prediction model and a pose completion model, the training device comprising:

a key action prediction model training unit configured to extract audio features of a speech data frame of a speaking video in a video training set, obtain a semantic class of a predicted key action from the key action prediction model based on the extracted audio features, and adjust the key action prediction model according to a difference between the semantic class of the predicted key action and the semantic class of the key action marked for the speech data frame to obtain a trained key action prediction model, wherein the semantic class of the key action is predefined for expressing a specific semantic of an action;

The gesture completion model training unit is configured to acquire a first action frame sequence representing actions of a speaker in the speaking video, input key action frames and partial non-key action frames corresponding to the key actions in the first action frame sequence into the gesture completion model to output a second action frame sequence after completion, and adjust the gesture completion model based on differences between the second action frame sequence and the first action frame sequence to obtain a trained gesture completion model;

and the action video generation model unit is configured to obtain the action video generation model based on the trained key action prediction model and the trained gesture complement model.

11. An apparatus for generating motion video based on speech, comprising:

a feature extraction unit configured to extract an audio feature from an input speech signal;

a key action prediction unit configured to input the extracted audio features into a key action prediction model trained by the method according to any one of claims 1 to 7, so as to obtain semantic categories of predicted key actions corresponding to speech data frames of the speech signal;

A matching unit configured to obtain a first key action frame sequence matching a speech data frame of the speech signal from a video training set based on a semantic class of the predicted key action;

a gesture completion unit configured to input the first key action frame sequence into a gesture completion model trained by the method according to any one of claims 1-7, so as to obtain a completed action frame sequence;

and a video generation unit configured to generate an action video based on the completed action frame sequence.

12. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-7 and/or the method of any one of claims 8-9.

13. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1-7 and/or the method of any one of claims 8-9.