CN118250529A

CN118250529A - Voice-driven 2D digital human video generation method and readable storage medium

Info

Publication number: CN118250529A
Application number: CN202410658653.2A
Authority: CN
Inventors: 陈靖涵; 张鹏飞; 苏江
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2024-05-27
Filing date: 2024-05-27
Publication date: 2024-06-25

Abstract

The invention discloses a voice-driven 2D digital human video generation method and a readable storage medium, comprising the following steps: firstly, target voice is acquired, and a corresponding 3D gesture sequence is obtained through audio data processing. Next, a precise head motion video is generated in combination with the target speech and the 3D gesture sequence. At the same time, a body motion video is generated using the user image and the 3D pose sequence. Finally, the head and body action videos are seamlessly combined through an intelligent fusion technology, so that a smooth 2D digital human video is formed. By the design, accurate matching of voice and digital human actions is achieved, sense of reality and naturalness of digital human videos are improved, and an efficient and convenient solution is provided for related applications.

Description

Voice-driven 2D digital human video generation method and readable storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a voice-driven 2D digital human video generation method and a readable storage medium.

Background

In the digital age, the voice-driven 2D digital human video generation technology has wide application prospects in the fields of virtual character performance, game interaction, online education and the like. However, the prior art still faces challenges in generating digital human video with a high degree of realism and naturalness. In particular, how to match the target voice with the head and body movements of the digital person accurately while maintaining the smoothness and consistency of the video becomes a technical problem to be solved currently.

Disclosure of Invention

The invention aims to provide a voice-driven 2D digital human video generation method and a readable storage medium.

In a first aspect, an embodiment of the present invention provides a method for generating a voice-driven 2D digital human video, including:

Acquiring target voice;

acquiring a 3D gesture sequence corresponding to the audio data according to the audio data of the target voice;

obtaining a head action video corresponding to the target voice according to the target voice and the 3D gesture sequence;

according to the user image and the 3D gesture sequence, a body action video is obtained;

and according to the head action video and the body action video, fusing to obtain a 2D digital human video.

In a possible implementation manner, the obtaining, according to the audio data of the target voice, a 3D gesture sequence corresponding to the audio data includes:

encoding the audio data through an audio encoder to obtain an audio feature vector sequence corresponding to the audio data;

inputting the audio feature vector sequence into a gesture generator formed by a plurality of gating circulating units to obtain the 3D gesture sequence.

In one possible implementation manner, the audio encoder includes a first-order encoding unit and a higher-order encoding unit, the higher-order encoding unit is constructed by a plurality of linear network layers and a nonlinear activation function, the audio encoder encodes the audio data to obtain an audio feature vector sequence corresponding to the audio data, and the method includes:

encoding the audio data through the primary encoding unit to obtain a plurality of initial audio feature vectors, wherein each initial audio feature vector comprises a mel frequency spectrum logarithmic amplitude and energy of a current frame;

performing feature transformation on the plurality of initial audio feature vectors through the high-order coding unit to obtain the audio feature vector sequence, wherein the audio feature vector sequence comprises high-order audio feature vectors corresponding to each frame of audio data;

Inputting the audio feature vector sequence into a gesture generator formed by a plurality of gating circulating units to obtain the 3D gesture sequence, wherein the method comprises the following steps of:

Inputting a high-order audio feature vector corresponding to an ith frame and gesture data corresponding to an ith-1 frame into the gesture generator trained in advance to obtain gesture data corresponding to the ith frame, wherein i is a positive integer greater than 1, and when i=1, the gesture data corresponding to the ith-1 frame is randomly initialized gesture data, and the gesture data comprises absolute positions of human skeleton key points and rotation values of the human skeleton key points of the current frame;

And integrating the acquired multiple gesture data to obtain the 3D gesture sequence, wherein the 3D gesture sequence corresponds to the elements of the audio feature vector sequence one by one.

In one possible implementation manner, the obtaining, according to the target voice and the 3D gesture sequence, a head motion video corresponding to the target voice includes:

Extracting a 3D head gesture sequence from the 3D gesture sequence, and inputting the 3D head gesture sequence into a light processor to obtain 3D position information corresponding to the 3D head gesture sequence, wherein the 3D position information comprises 3D positions of light directions of sampling points corresponding to each frame of 3D head gesture in the 3D head gesture sequence;

Inputting the 3D position information into a position encoder to obtain a position encoding characteristic sequence;

Inputting the position coding feature sequence, the audio feature vector sequence and the light direction of the sampling points into a pre-trained nerve radiation field to obtain color and density information of each sampling point;

And performing volume rendering on the color and density information of each sampling point to obtain the head action video, wherein the head action video comprises expression and lip movement.

In one possible embodiment, the nerve radiation field is calculated by the formula: is trained as a loss function, wherein/( Refers to frames in training samples,/>Refers to a frame generated by a neural radiation field,/>Refers to the weight coefficient.

In a possible implementation manner, the obtaining a body action video according to the user image and the 3D gesture sequence includes:

And inputting the user image and the 3D gesture sequence into a preset diffusion model to obtain the body action video.

In a possible implementation manner, the fusing the head motion video and the body motion video to obtain a 2D digital human video includes:

dividing the body action video into a head action video to be determined and a body action video to be determined based on a preset portrait segmentation algorithm;

Inputting the head action video and the undetermined body action video into a fusion device trained by a discriminator, so that the head action video replaces the undetermined head action video to obtain the 2D digital human video.

In one possible embodiment, the loss function of the arbiter is:

；

Wherein, Representing a true sample,/>Representing the output of the arbiter,/>Representing the output of the fusion cage,/>Representing head motion video and pending body motion video of input fusion cage,/>The larger the more accurate the characterization discriminant is determined;

The loss function of the fusion device is as follows:

；

Wherein, Representing the output of the arbiter,/>Representing the output of the fusion cage,/>Representing head motion video and pending body motion video of input fusion cage,/>Representing hyper-parameters,/>Head region representing head motion video,/>Representing the header region of the video generated by the fusion cage,/>Smaller characterization fuses output more accurate.

In a second aspect, an embodiment of the present invention provides a voice-driven 2D digital human video generating apparatus, including:

The acquisition module is used for acquiring target voice;

The processing module is used for acquiring a 3D gesture sequence corresponding to the audio data according to the audio data of the target voice; obtaining a head action video corresponding to the target voice according to the target voice and the 3D gesture sequence; according to the user image and the 3D gesture sequence, a body action video is obtained;

And the generation module is used for obtaining the 2D digital human video through fusion according to the head action video and the body action video.

In a third aspect, an embodiment of the present invention provides a readable storage medium, where the readable storage medium includes a computer program, where the computer program controls a computer device where the readable storage medium is located to execute a voice-driven 2D digital human video generating method in at least one possible implementation manner of the first aspect.

Compared with the prior art, the invention has the beneficial effects that: the invention discloses a voice-driven 2D digital human video generation method and a readable storage medium, comprising the following steps: firstly, target voice is acquired, and a corresponding 3D gesture sequence is obtained through audio data processing. Next, a precise head motion video is generated in combination with the target speech and the 3D gesture sequence. At the same time, a body motion video is generated using the user image and the 3D pose sequence. Finally, the head and body action videos are seamlessly combined through an intelligent fusion technology, so that a smooth 2D digital human video is formed. By the design, accurate matching of voice and digital human actions is achieved, sense of reality and naturalness of digital human videos are improved, and an efficient and convenient solution is provided for related applications.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described. It is appreciated that the following drawings depict only certain embodiments of the invention and are therefore not to be considered limiting of its scope. Other relevant drawings may be made by those of ordinary skill in the art without undue burden from these drawings.

Fig. 1 is a schematic flow chart of steps of a voice-driven 2D digital human video generating method according to an embodiment of the present invention;

Fig. 2 is a schematic block diagram of a voice-driven 2D digital human video generating apparatus according to an embodiment of the present invention;

Fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The following describes specific embodiments of the present invention in detail with reference to the drawings.

In order to solve the foregoing technical problems in the background art, fig. 1 is a schematic flow chart of a voice-driven 2D digital human video generating method according to an embodiment of the present disclosure, and the following describes the voice-driven 2D digital human video generating method in detail.

Step S201, obtaining target voice;

Step S202, according to the audio data of the target voice, a 3D gesture sequence corresponding to the audio data is obtained;

Step S203, according to the target voice and the 3D gesture sequence, obtaining a head action video corresponding to the target voice;

step S204, according to the user image and the 3D gesture sequence, a body action video is obtained;

step S205, according to the head action video and the body action video, 2D digital human videos are obtained through fusion.

In an embodiment of the present invention, the server receives an audio file uploaded by a user, the audio file containing target speech that the user wants to express to a 2D digital person. For example, the user has uploaded a piece of audio that reads poetry by himself, which is the target speech that the server is to process next. The server analyzes the target voice by utilizing a built-in voice recognition and gesture estimation module. The server first recognizes each syllable and intonation in the speech, and then predicts the corresponding 3D gesture sequence based on these speech features. This 3D gesture sequence describes the head and body actions that the digital person should take when expressing the piece of speech. For example, when a "happy" intonation is identified, the server may generate a smiling expression gesture; when a "sad" intonation is identified, a low head jettison gesture may be generated. And the server renders and generates a head action video synchronous with the target voice according to the 3D gesture sequence generated in the last step and by combining a preset digital human head model. In this process, the server will ensure that the mouth shape, facial expression and head movements of the digital person remain consistent with the content of the target voice. For example, when the target voice speaks "hello," the mouth shape of the digital person accurately simulates the mouth motion that utters the word "hello. The user provides a photograph of himself or a whole body image to the server as a body template for the digital person. The server generates a digital human body motion video using the template and the 3D pose sequence. In the process, the server adjusts the posture of the digital human body template according to the body action data in the 3D posture sequence so as to match the emotion and rhythm of the target voice expression. For example, if the target voice is a section of a live speech, the generated body motion video will show a digital person waving an arm, standing straight, and a live gesture. And finally, the server performs fusion processing on the head action video and the body action video to generate a complete 2D digital human video. In the process, the server can ensure that the connection of the head and the body actions is natural and smooth, and the condition of dislocation or incompatibility can not occur. The generated 2D digital human video not only maintains the original emotion and rhythm of the voice of the user, but also enhances the visual expression effect through the vivid expression of the digital human. For example, the resulting 2D digital person video may be a scene of a digital person speaking on a podium, with both head expression and body motion highly consistent and vivid with the original speech content.

In the embodiment of the present invention, the foregoing step S202 may be implemented by the following example execution.

In the embodiment of the present invention, the server first processes the audio data by using the built-in audio encoder after receiving the target voice uploaded by the user. The function of the audio encoder is to convert the original audio signal into a more easily handled mathematical representation, i.e. a sequence of audio feature vectors. This process is similar to converting an article into a series of keywords for subsequent analysis and understanding. For example, the target speech received by the server is a piece of audio related to weather forecast, and the audio encoder analyzes the piece of audio piece by piece, extracts key sound features therein, such as pitch, intensity, timbre, and rhythm and intonation of the speech, and converts the features into a series of audio feature vectors. These vectors capture essential information in the audio, providing accurate input for subsequent gesture generation. The server then inputs the audio feature vector sequence obtained in the previous step into a gesture generator consisting of a plurality of gate-controlled loop units (GRUs). The pose generator is a deep learning model that learns, via a large amount of training data, how to predict a corresponding 3D pose sequence from the sequence of audio feature vectors. The gated loop unit is a variant of a loop neural network that is particularly suitable for processing sequence data, such as audio, text, etc. In the gesture generator, a plurality of gating circulation units are connected in series to form a deep network structure. When the sequence of audio feature vectors is input into the network, each gating loop unit calculates the state and output of the next unit based on the current input and the output of the previous unit. In this way, the gesture generator is able to gradually generate a 3D gesture sequence corresponding thereto from the audio feature vector sequence. Taking the audio of weather forecast as an example, the gesture generator predicts the head and body gestures that the digital person should take when expressing the sentence according to the information in the audio feature vector sequence, such as the voice content of 'today weather is clear'. The poses are encoded into a 3D sequence of poses containing information about the position, rotation angle and speed of motion of the digital person at different points in time. This 3D pose sequence provides an important data basis for the subsequent generation of head motion video and body motion video.

In the embodiment of the present invention, the audio encoder includes a first-order encoding unit and a higher-order encoding unit, where the higher-order encoding unit is constructed by multiple linear network layers and a nonlinear activation function, and the foregoing step of encoding the audio data by the audio encoder to obtain an audio feature vector sequence corresponding to the audio data may be implemented by the following example implementation.

In an embodiment of the present invention, the server first processes the uploaded audio data using a primary encoding unit. The first-order coding unit is mainly responsible for extracting basic acoustic features from the audio. In this process, the server frames the audio and calculates the log mel-spectrum amplitude and energy for each frame. These features capture the spectral structure and energy distribution of the audio, providing the basis for the subsequent encoding process. For example, assuming that the uploaded audio is a singing performance, the initial stage encoding unit analyzes the singer's voice frame by frame, and extracts the log-mel-spectrum amplitude of each frame, which reflects the spectral characteristics of the singer when speaking, such as pitch and timbre. Meanwhile, the energy characteristics reflect the intensity and dynamic change of singer pronunciation. The server then transmits the plurality of initial audio feature vectors extracted by the first-order encoding unit to the high-order encoding unit for further feature conversion. The high-order coding unit is constructed by a plurality of linear network layers and nonlinear activation functions, and can abstract and learn the representation of the initial characteristics in a deep level. Taking singing performance as an example, the high-order coding unit can perform complex transformation and combination on the initial audio feature vector to capture deeper information in the audio, such as pronunciation habits, emotion expressions and the like of the singer. Through the processing of the high-order coding unit, the server obtains a series of high-order audio feature vectors, and the vectors are more compact and have expressive force, so that powerful input is provided for subsequent gesture generation. The server now inputs the high-order audio feature vector sequence into a gesture generator that is composed based on a plurality of gating loop units (GRUs). This gesture generator has been pre-trained and is capable of predicting a corresponding 3D gesture sequence from the input audio feature vector sequence. In particular operation, the server processes audio data frame by frame. For the higher order audio feature vector corresponding to the i-th frame of audio data, the server inputs it into the pose generator along with the pose data of the previous frame (i.e., i-1 st frame). When processing the first frame (i.e., i=1), the server will use the randomly initialized pose data as input since no pose data of the previous frame is available for reference. The GRU network in the gesture generator can be combined with the audio characteristics of the current frame and the gesture data of the previous frame to predict the gesture data of the current frame. These data include detailed information such as absolute positions and rotation values of human skeletal keypoints, which together form a 3D pose representation of the current frame. As the process proceeds from frame to frame, the server accumulates a complete 3D pose sequence. This sequence is in one-to-one correspondence in time with the original audio feature vector sequence, ensuring synchronicity between audio and pose. For example, in the context of a singing performance, as the singer's singing voice fluctuates, the gesture generator predicts the physical actions and expression changes that match it. These predicted pose data are not only closely tied to the audio content, but also visually present realistic performance effects.

The aforementioned step S203 may be implemented by the following example execution.

In an embodiment of the present invention, the server first encodes the uploaded target speech using an audio encoder. This process is similar to the previous description in that the audio encoder converts the original audio data into a sequence of audio feature vectors that capture key sound characteristics in the audio, providing important input information for the subsequent generation of head motion video. For example, assuming that the target speech is a piece of audio telling a story, the audio encoder analyzes the piece of audio piece by piece, extracts key features such as speech speed, intonation, emotion, etc., of the telling person, and converts these features into a sequence of audio feature vectors. Next, the server will extract a 3D head pose sequence related to the head motion from the previously generated 3D pose sequence. This sequence describes the motion trajectory and posture change of the head when the digital person expresses the target voice. The server then inputs the 3D head pose sequence into the ray processor. The task of the ray processor is to calculate the 3D position information of the ray direction of the sampling point corresponding to each 3D head pose. This information is critical for the subsequent generation of head motion video with realistic lighting effects. Taking the audio of a story as an example, when the seminar describes a scene, his head may turn or tilt as the story progresses. The light processor calculates corresponding light directions according to the head movements, so as to ensure that the light and shadow changes in the generated head movement video are consistent with the real scene. The server then inputs the 3D position information output by the light processor into a position encoder. The role of the position encoder is to convert these 3D position information into a form that is easier to handle by the neural network, i.e. the position-encoded signature sequence. The feature sequence captures the spatial position relation of head motion and provides key input for the subsequent nerve radiation field treatment. The server now inputs the position-coded feature sequence, the audio feature vector sequence, and the ray direction of the sample points together into the pre-trained neural radiation field. The neural radiation field is a deep learning model that predicts the color and density information for each sample point based on the input eigenvectors and ray directions. This information is a key element in generating high quality head action video. Taking audio of the telling story as an example, when the teller describes a clear sky, the neural radiation field can predict bright blue as background color, and assign corresponding color and density values to each sampling point according to head motion and light direction. And finally, the server performs rendering processing on the color and density information output by the nerve radiation field by using a volume rendering technology to generate a final head action video. The video not only contains the head motion track and posture change of the digital person, but also presents vivid expression and lip movement effect. Taking the audio of the telling story as an example, after the volume rendering processing, the server generates a high-quality head action video, wherein the head of the digital person rotates and inclines along with the progress of the story, and brings the immersive viewing experience to the audience along with vivid expression and accurate lip movement effect.

In an embodiment of the present invention, the neural radiation field passes through a loss function: Training results, wherein/( Refers to frames in training samples,/>Refers to a frame generated by a neural radiation field,/>Refers to the weight coefficient.

In the embodiment of the present invention, the weight coefficient is defined in such a way that the area value belonging to the background is 0, the area value belonging to the head is 1, and the area value belonging to the mouth is 2. In summary, the head motion generation module inputs as speech and 3D gestures, and outputs as head motion video.

In the embodiment of the present invention, the foregoing step S204 may be implemented by the following example execution.

In the embodiment of the present invention, the server, after receiving the user image and the 3D gesture sequence uploaded by the user, inputs the data into a preset diffusion model. This diffusion model is a deep learning model that has learned through extensive training data how to generate coherent and natural body motion videos from static user images and dynamic 3D pose sequences. For example, suppose that the user has uploaded a full-body photograph of himself as the user image, while providing a 3D gesture sequence describing the motion profile of a dance motion. The server will first enter this photo into the diffusion model along with the 3D pose sequence. Inside the diffusion model, the model will first analyze the user's image to extract important elements such as the user's physical features, clothing patterns, and background information. The model then combines dynamic information in the 3D gesture sequence, such as the position change and rotation values of skeletal key points, to predict the body shape and motion trajectory of the user while performing dance movements. The diffusion model will then use this predictive information to generate each frame of body motion video. In this process, the model ensures that the generated video frames remain visually consistent with the user image while exhibiting dynamic changes in the 3D pose sequence. To achieve this, the model may employ advanced image processing techniques such as image fusion, motion blur, and shadow rendering to enhance the consistency and realism of the video. Finally, the server outputs a complete body movement video showing the general movement of the user when performing dance movements. The user may view this video to check whether the generated dance motion meets his own expectations and make subsequent adjustments and optimizations accordingly. It should be noted that the above process is automatically completed in the server, and the user only needs to upload corresponding data and wait for a period of time to obtain the customized body action video. The technology provides a brand new and highly personalized video generation mode for users, and has wide application prospect and market potential.

In the embodiment of the present invention, the aforementioned step S205 may be implemented by the following example execution.

In the embodiment of the invention, the server firstly processes the body action video by using a preset portrait segmentation algorithm. The portrait segmentation algorithm can intelligently identify the portrait region in the video and separate the portrait region from the background. In this process, the server will divide the body motion video into two parts: a pending head motion video and a pending body motion video. For example, suppose that the body motion video is a video of dance performance, which includes the general motion of the dancer. Through the portrait segmentation algorithm, the server can accurately identify the head region of the dancer and take the head region as a pending head action video, and other parts of the dancer body are taken as pending body action videos. Next, the server inputs the head motion video and the pending body motion video into a fusion engine trained by the arbiter. The fusion device can seamlessly fuse head actions and body actions through special training, and generates a coherent 2D digital human video. In the process, the fusion device can judge the matching degree of the head action video and the body action video to be determined by using the guiding information provided by the discriminator, and intelligently fuses the two videos according to the information. Specifically, the cage replaces the head region in the head motion video with the corresponding region in the pending head motion video while retaining other body parts in the pending body motion video. Taking dance performance video as an example, assume that head motion video shows head motions of a dancer making various expressions, and body motion video to be determined is a body motion of the dancer making dance performance. Through the processing of the fusion device, the server can generate a new 2D digital human video, wherein the head movements and the body movements of the dancer are perfectly fused, and the dancing performance effect is vivid and coherent. Finally, the server outputs the fused 2D digital human video. The video not only reserves the body dynamic information in the original body action video, but also fuses the rich expression and the accurate lip movement effect in the head action video, thereby providing a highly realistic virtual character performance experience for the user. The user can watch this video to enjoy the digital person's wonderful performance or share it with other people for presentation and communication.

The loss function of the discriminator in the embodiment of the invention is as follows:

；

The loss function of the fusion device is as follows:

；

In order to more clearly describe the solutions provided by the embodiments of the present application, a more complete implementation is provided below.

In the embodiment of the invention, a section of voice is taken as input, and the voice is respectively a 3D gesture generating module, a head action generating module, a body action generating module and a fusion module through four modules, and finally the voice is output as a 2D digital human video which is consistent with the duration and the frame rate of the voice. The specific implementation steps are as follows:

Step 1, a section of voice with any duration is obtained, and the voice can be the voice of a user, or can be the voice generated by a Text-To-Speech (TTS) algorithm.

Step 2, inputting the voice in step 1 into a 3D gesture generating module, which generates a 3D gesture sequence conforming to the rhythm according to the audio, wherein the 3D gesture comprises the absolute position of the key points of the human skeleton and the rotation value (usually expressed in the form of quaternion) of the key points of the human skeleton.

And 3, inputting the 3D gesture sequence obtained in the step 2 and the voice obtained in the step 1 into a head motion generating module to obtain a head motion video, wherein the head motion video refers to a video of a part above a neck, comprises facial expressions and lip movements, and can be understood as a digital human video which is talking and only comprises a head, and the lip movements are matched with the voice.

And 4, inputting the 3D gesture sequence obtained in the step 2 and the preset user image into a body motion generation module to generate a body motion video, wherein the limb motion in the body motion video is consistent with the motion represented by the 3D gesture sequence, but the expression and lips are unchanged.

And 5, inputting the head motion video obtained in the step 3 and the body motion video obtained in the step 4 into a fusion optimization module to obtain a final 2D digital human video, wherein the video comprises body motion, head motion, expression and lip motion.

In order to implement the above solution, the present invention includes four modules, each of which is functionally independent, and details related to each module will be described below.

A 3D gesture generation module. The function of the module is to generate a 3D gesture sequence corresponding to the rhythm according to the voiceWherein the 3D pose includes an absolute position of a human skeletal keypoint and a rotation value of the human skeletal keypoint, the obtained result being used by a subsequent module. One embodiment of this module is shown in fig. 2. Mainly comprises an audio encoder and a gesture generator. The function of the audio encoder is to convert the original audio into feature vectors, facilitating model understanding. Frame number of audio here/>Generally depending on the sampling rate of the audio and the number of samples per frame. Given a piece of audio, an audio encoder needs to first acquire features of the audio itself, called initial feature vectors/>Wherein/>. Initial feature vector/>The method consists of two parts, wherein one part is the logarithmic amplitude of the mel spectrum of the current frame, the other part is the energy of the current frame, and the calculation process of the two parts is not repeated. The next step is to/>, the initial feature vectorConversion to a more abstract representation, called eigenvector/>The aim is to generate feature vectors that are more understandable to the subsequent gesture generator. One embodiment of transforming the above feature vectors may be through a neural network model, which may be composed of multiple layers of linear network layers plus nonlinear activation functions. The final audio encoder will output a sequence of feature vectors of length n/>. The pose generator is used to generate a 3D pose sequence where it is necessary to ensure that the generated 3D pose sequence remains rhythmically consistent with the input audio. One embodiment of the pose generator may be a neural network model consisting of a plurality of gated loop units (Gated Recurrent Units, GRU) that are used here because of their good timing, which can ensure consistency between poses of consecutive frames. Wherein generation of i-th frame pose and audio feature vector/>In relation to the pose generated in the i-1 th frame, when the first frame pose is generated, since there is no preface pose input, one embodiment is to randomly initialize a frame pose as an initial input. The final gesture generation module will generate a 3D gesture sequence/>The number of sequence frames is n. Next, how to construct the training samples of the 3D gesture generation module is described. Each training sample consists of a section of audio and a corresponding 3D gesture sequence, the frame numbers and frame rates of the audio and the gesture sequences are required to be consistent, and the rhythm of the 3D gesture sequence is consistent with the rhythm of the section of audio, so that manual intervention and auxiliary labeling are required. And calculating loss between the output and the target 3D gesture sequence in training by using an L1 loss function, and gradually optimizing the neural network parameters. To sum up, the 3D gesture generating module inputs as voice and outputs as 3D gesture sequence/>。

A head motion generation module. The function of this module is to output a head action video corresponding to speech. The head motion video refers to video of the part above the neck, including facial expression and lip movement, and can be understood as digital human video which is talking and only contains the head, and the lip movement is matched with voice. The head motion generation module may include an audio encoder, a light processor, a position encoder, a neural radiation field, and a volume rendering. The audio encoder here is similar to the audio encoder in the 3D gesture generation module, and the main function is to convert the original audio into feature vectors. The generation of video, which is primarily accomplished by the ray processor, neural radiation fields (Neural RADIANCE FIELDS, NERF), and volume rendering, needs to be described herein. The nerve radiation field is a computer vision technology, the trained nerve radiation field can construct an implicit mapping from an object in a three-dimensional space to a two-dimensional image, the position of a camera and the view angle of the camera are input, the nerve radiation field outputs the color and the density of each point on each light ray under the view angle, and finally, the pixel value of each point on a 2D image is obtained through volume rendering, so that the nerve radiation field is a technology which is more mature in the field of computer vision. The determination of each ray and the sampling of the point on the ray are completed by a ray processor, and the implementation of the ray processor refers to a ray tracing (RAY MATCHING) algorithm, which belongs to a relatively mature algorithm, inputs the camera position and the camera view angle, and outputs the ray direction corresponding to each pixel on the imaging 2D plane and the 3D position of the sampling point on the ray. The position encoder encodes the 3D position output by the light processor for mapping it into a high dimensional space, which is beneficial for training of the neural radiation field. One implementation of a position encoder is a multi-layer linear network layer plus a nonlinear activation function. The 3D positions of the coded sampling points and the coded audio are combined into a feature vector, the feature vector and the light direction of the current sampling points are input into a nerve radiation field together to obtain color and density information, the color and density information of all the sampling points can be obtained in the same manner, and then the pixel value of each point on the 2D image is obtained through volume rendering. It should be noted that the input 3D gesture sequence comes from the 3D gesture generation module during reasoning, uses the head gesture in the 3D gesture, uses the head gesture in the training sample during training, and then introduces how to construct the training sample; The head pose may reflect the position and view angle of the camera, although theoretically the generated video should be the result of a fixed camera position and a fixed view angle, the motion is relative, and the motion of the head in the camera frame may be understood as the head is actually stationary, and the camera in motion, such as turning the head right, may be converted into turning the camera left, so that as long as the head pose is acquired, including the position and rotation information of the head, the position and view angle of the camera are acquired. When a training sample is constructed, firstly, a video which only contains the head of a target person is needed, the duration is generally longer than 1 minute, a human image segmentation algorithm such as BiSeNet is used for separating the head from the background, only the generation of the head is concerned during training, and the region of the mouth can be obtained through the human image segmentation algorithm for subsequent optimization; There is also a need to estimate the pose of the head, including 3D position and rotation information, one embodiment being to employ a head pose estimation algorithm, such as OpenFace. The loss function per frame during training can be defined as:

；

Wherein the method comprises the steps of Refers to frames in training samples,/>Refers to a frame generated by a neural radiation field,/>The size of the weight coefficient is the same as that of the frame, and the weight coefficient is defined in such a way that the area value belonging to the background is 0, the area value belonging to the head is 1, and the area value belonging to the mouth is 2. In summary, the head motion generation module inputs as speech and 3D gestures, and outputs as head motion video.

A body motion generation module. The function of this module is to generate body motion video from the 3D pose sequence and the user image. Wherein the user image may be a frame of an image containing the whole body extracted from the training sample of the head motion generation module. One embodiment of the module may use DreamPose algorithm, a diffusion model-based method, capable of generating more stable body motion. The DreamPose algorithm inputs the 3D gesture sequence and the whole body image of the user, can generate a motion video consistent with the number of frames of the gesture sequence, and the generated motion is the same as the motion represented by the 3D gesture, and the body motion video also comprises a head, but the head does not have any expression or lip movement. In summary, the body motion generation module inputs as a user image and a 3D gesture sequence, and outputs as a body motion video.

And a fusion optimization module. The function of the module is to fuse the body motion video generated by the body motion module and the head motion video generated by the head motion module together to generate a final 2D digital human video. The fusion optimization module may include a head detection module, a fusion cage, and a arbiter. The head detection module is used for detecting the region of the head in the body motion video and the head motion video, outputting the region in the form of a mask (mask), and one embodiment can use an image segmentation algorithm, such as BiSeNet, to identify the head region. The head area of the body motion video is replaced by the head area in the head motion video, so that a 2D digital human video to be optimized is obtained, unnatural conditions can be unavoidably generated in the operation of directly replacing the head area, for example, the condition that pixels are missing or cannot be aligned can exist, and therefore a fusion device and a discriminant are added to optimize the 2D digital human video to be optimized. The whole of the fusion device and the discriminator is similar to a countermeasure generation Network (GENERATIVE ADVERSARIAL Network, GAN), the fusion device inputs the 2D digital human video to be optimized during reasoning, a more real and natural image is generated at the part with flaws of the replacement area, other areas are kept unchanged, the optimized 2D digital human video is output, and the output optimized video is delivered to the discriminator during training to discriminate whether the optimized video is real enough or not. One implementation mode of the fusion device is to use a U-Net network structure, and the input and output resolutions are consistent and are commonly used for image enhancement; one embodiment of the arbiter is a multi-layer two-dimensional convolution layer plus a nonlinear activation function. During training, only a replacement area, namely a head area, is needed to be intercepted, the replacement area is input into a network in the fusion device, the input of the discriminator comprises a true sample and a false sample, the false sample is output of the fusion device, and the true sample is a target user video used during training of the head action generation module. In order to maintain frame-to-frame continuity, while combining the conventional counter-generated penalty functions, the penalty functions of the arbiter herein may be defined as:

；

Wherein the method comprises the steps of Representing a true sample,/>Representing the output of the arbiter, typically a probability value, the closer 1 represents the more realistic,Representing the output of the fusion device, i.e. the false samples,/>Image representing input fusion device,/>The larger the better. The loss function of the cage can be defined as:

；

Wherein the method comprises the steps of Representing the output of the arbiter,/>Representing the output of the fusion device, i.e. the false samples,/>Representing the image input to the fusion cage. /(I)Representing custom superparameters, a set value may be 0.2,/>Head region representing head motion video,/>Representing the header region of the video generated by the fusion cage,/>The smaller the better. Plus/>The purpose of this is to focus the fusion cage on solving the flaws caused by the splicing without disrupting the continuity of the original video. In summary, the fusion optimization module inputs as a body motion video and a head motion video, and outputs as a final 2D digital human video.

The four modules together form a complete voice-driven 2D digital human video generation method.

So designed, a complete 2D digital human video containing head and body movements can be generated from speech. Many current methods focus only on the generation of head movements, or mouth movements, and are not substantially tied to input speech even though body movements are involved, and the method is more consistent with speech, whether body movements or head movements. The labor cost can be reduced. The traditional method for recording the video on the mouth of a target user can only be used by preparing a field, preparing a script, recording for many times and editing the video on the mouth, and can directly generate the video by only inputting the voice of the user, thereby greatly saving time and labor cost. When in actual use, only one section of voice is needed to be input, the 2D digital human video of the target user can be directly obtained, and the method is very convenient and quick for the user.

Referring to fig. 2 in combination, fig. 2 is a voice-driven 2D digital human video generating apparatus 110 according to an embodiment of the present invention, including:

an acquisition module 1101, configured to acquire a target voice;

A processing module 1102, configured to obtain a 3D gesture sequence corresponding to the audio data according to the audio data of the target voice; obtaining a head action video corresponding to the target voice according to the target voice and the 3D gesture sequence; according to the user image and the 3D gesture sequence, a body action video is obtained;

And the generating module 1103 is configured to fuse the head motion video and the body motion video to obtain a 2D digital personal video.

It should be noted that, the implementation principle of the foregoing voice-driven 2D digital human video generating device 110 may refer to the implementation principle of the foregoing voice-driven 2D digital human video generating method, which is not described herein again. It should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated when actually implemented. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the voice-driven 2D digital human video generating device 110 may be a processing element that is set up separately, may be implemented as an integrated chip of the device, may be stored in a memory of the device in the form of program codes, and may be called up by a processing element of the device to execute the functions of the voice-driven 2D digital human video generating device 110. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors (DIGITAL SIGNAL processors, DSP), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGA), etc. For another example, when a module above is implemented in the form of processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (centralprocessing unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

An embodiment of the present invention provides a computer device 100, where the computer device 100 includes a processor and a nonvolatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device 100 executes the foregoing voice-driven 2D digital human video generating apparatus 110. As shown in fig. 3, fig. 3 is a block diagram of a computer device 100 according to an embodiment of the present invention. The computer device 100 comprises a voice-driven 2D digital human video generating means 110, a memory 111, a processor 112 and a communication unit 113.

For data transmission or interaction, the memory 111, the processor 112 and the communication unit 113 are electrically connected to each other directly or indirectly. For example, the elements may be electrically connected to each other via one or more communication buses or signal lines. The voice-driven 2D digital human video generating apparatus 110 includes at least one software function module that may be stored in the memory 111 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the computer device 100. The processor 112 is configured to execute the voice-driven 2D digital human video generating device 110 stored in the memory 111, for example, software functional modules and computer programs included in the voice-driven 2D digital human video generating device 110.

The embodiment of the invention provides a readable storage medium, which comprises a computer program, wherein the computer program controls computer equipment where the readable storage medium is located to execute the voice-driven 2D digital human video generation method when running.

The foregoing description, for purpose of explanation, has been presented with reference to particular embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical application, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for voice-driven 2D digital human video generation, comprising:

Acquiring target voice;

2. The method according to claim 1, wherein the obtaining the 3D gesture sequence corresponding to the audio data according to the audio data of the target voice includes:

3. The method of claim 2, wherein the audio encoder comprises a first-order encoding unit and a higher-order encoding unit, the higher-order encoding unit is constructed by a plurality of linear network layers and a nonlinear activation function, the audio data is encoded by the audio encoder to obtain a sequence of audio feature vectors corresponding to the audio data, and the method comprises:

4. The method according to claim 1, wherein the obtaining a head motion video corresponding to the target voice according to the target voice and the 3D gesture sequence includes:

5. The method of claim 4, wherein the nerve radiation field is calculated by the formula:

is trained as a loss function, wherein/( Refers to frames in training samples,/>Refers to a frame generated by a neural radiation field,/>Refers to the weight coefficient.

6. The method of claim 1, wherein the deriving a body motion video from the user image and the 3D pose sequence comprises:

7. The method of claim 1, wherein the fusing the 2D digital human video from the head motion video and the body motion video comprises:

8. The method of claim 7, wherein the loss function of the arbiter is:

；

The loss function of the fusion device is as follows:

；

9. A voice-driven 2D digital human video generating apparatus, comprising:

The acquisition module is used for acquiring target voice;

10. A readable storage medium, characterized in that the readable storage medium comprises a computer program, which when run controls a computer device in which the readable storage medium is located to perform the voice-driven 2D digital human video generation method according to any one of claims 1-8.