CN117119123A

CN117119123A - Method and system for generating digital human video based on video material

Info

Publication number: CN117119123A
Application number: CN202211698480.4A
Authority: CN
Inventors: 司马华鹏; 刘杰; 杜萍萍
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2022-05-23
Filing date: 2022-12-28
Publication date: 2023-11-24
Also published as: CN117156199A; CN117119207A

Abstract

Some embodiments of the present application provide a method and a system for generating a digital human video based on a video material, where the method may further detect a driving feature of the video material by acquiring the video material input by a user. Wherein the driving feature is a video feature or an audio feature selected by a user. If the driving characteristic is a video characteristic, extracting an audio characteristic of the audio data, and driving the target person based on the audio characteristic to generate a target person video; if the driving feature is an audio feature, detecting the motion of the video data according to a preset motion migration model, and migrating the motion to the target person to generate a target person video. The method can automatically drive the target person to perform corresponding actions according to the characteristics of the video material so as to generate the target person video according to the requirements of the user and improve the user experience.

Description

Method and system for generating digital human video based on video material

Technical Field

The application relates to the technical field of digital human image processing, in particular to a method and a system for generating digital human video based on video materials.

Background

The digital person is a human-like image produced by computer technology or a result of software production. They have the appearance or behavior pattern of humans, but are not video recordings of someone in the real world, and can operate and exist independently. The digital person's ontology exists in a computing device (e.g., computer, cell phone) and is presented through a display device. The system can realize the works such as voice interaction, online live broadcast, customer service communication, tour guide and wizard, shopping guide and the like.

The digital person can simulate the expression and action of the person in the setting template, and then the personalized digital person picture is presented. In the using process of the digital person, the user can select the template and the digital person, namely, after selecting the digital person, the user selects the designated template and edits the text. The digital person can generate corresponding audio data from the text, and the digital person is driven to execute corresponding actions according to the selected template and the audio data to generate digital person video.

However, the manner in which the user drives the digital person by manually selecting the template is cumbersome. Moreover, the user also needs to edit the text corresponding to the voice, so that the generation process of the digital human video is complex. Therefore, the mode of manually selecting the template has the problem that the digital human video generation process is complex, so that the generation efficiency of the digital human video is low, and the user experience is reduced.

Disclosure of Invention

The application provides a method and a system for generating a digital human video based on video materials, which are used for solving the problem of low generation efficiency of the digital human video.

In a first aspect, some embodiments of the present application provide a method of generating digital human video based on video material, the method comprising:

acquiring driving video data, wherein the driving video data is video material, and the video material comprises video data and audio data corresponding to the video material;

detecting a driving form of the video material, wherein the driving form is a video driving or an audio driving selected by a user;

extracting audio features of the audio data if the driving form is an audio driving, and driving a target person based on the audio features to generate a target person video;

if the driving form is a video driving, extracting video characteristics of the video data to obtain video actions, and migrating the video actions to the target person to generate the target person video.

In some embodiments of the present application, the step of driving the target person based on the audio feature further includes: acquiring a target person image and extracting image features of the target person image; encoding the audio features and the image features to obtain audio encoded data and image encoded data; splicing the audio coding data and the image coding data to obtain image-sound data; synthesizing the picture and sound data to generate a dynamic image coding result; and decoding the dynamic image encoding result to generate a target dynamic image, wherein the target dynamic image is a video image of a corresponding action when the target person executes the audio data.

In some embodiments of the present application, the step of encoding the audio feature and the image feature includes: performing causal convolution calculation on the audio features to obtain causal convolution data; performing expansion convolution calculation on the causal convolution data to obtain expansion convolution data; obtaining a target residual error of convolution calculation, wherein the target residual error is generated in the process of causal convolution calculation and expansion convolution calculation; and generating the audio coding data according to the expansion convolution data and the target residual error.

In some embodiments of the application, the method further comprises: acquiring text information and a sound sample input by a user, wherein the sound sample comprises tone characteristics selected by the user; converting the text information into driving audio data based on the sound samples; extracting audio features of the driving audio data, and driving a target person based on the audio features.

In some embodiments of the application, the step of migrating the action to the target person includes: analyzing the driving video data to obtain a source image and a driving image of the driving video data; an action migration model is trained using the source image and the driving image.

In some embodiments of the present application, the parsing the driving video data includes: analyzing global video frames of the driving video data, wherein the global video frames are video frames which are sequenced in time sequence in the driving video data; extracting two video frames from the global video frame, and marking one of the two video frames as a source image and the other as a driving image.

In some embodiments of the present application, before the training of the motion migration model using the source image and the driving image, the training method includes: acquiring an effective area of the source image, and marking the effective area of the source image as a first area; calculating a first area ratio of the first area in the source image; and if the first area duty ratio is smaller than a first preset value, preprocessing the source image so that the area duty ratio of the first area in the source image is larger than or equal to the first preset value.

In some embodiments of the present application, before the training of the motion migration model using the source image and the driving image, the method further includes: acquiring the effective area of the driving image, and marking the effective area of the driving image as a second area; calculating a second area ratio of the second area in the driving image; and if the second area ratio is smaller than a second preset value, preprocessing the driving image so that the area ratio of the second area in the driving image is larger than or equal to the second preset value.

In some embodiments of the present application, after obtaining the driving video data, the method further includes: detecting attribute information of the driving video data, wherein the attribute information is domain information of the driving video data; extracting a target training sample according to the attribute information, wherein the target training data is the training sample of the field information; and training a target character model according to the target training sample.

In a second aspect, some embodiments of the present application further provide a system for generating a digital human video based on video material, for performing the method for generating a digital human video based on video material. The system comprises an application unit configured to perform the following program steps:

According to the technical scheme, the method and the system for generating the digital human video based on the video material provided by the embodiments of the application can be used for acquiring the video material input by the user and then detecting the driving characteristics of the video material. Wherein the driving feature is a video feature or an audio feature selected by a user. If the driving characteristic is a video characteristic, extracting an audio characteristic of the audio data, and driving the target person based on the audio characteristic to generate a target person video; if the driving feature is an audio feature, detecting the motion of the video data according to a preset motion migration model, and migrating the motion to the target person to generate a target person video. The method can automatically drive the target person to perform corresponding actions according to the characteristics of the video material so as to generate the target person video according to the requirements of the user and improve the user experience.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a digital human video system according to some embodiments of the present application;

fig. 2 is a flowchart of a method for generating a digital human video based on video material according to some embodiments of the present application;

FIG. 3 is a flowchart of acquiring a source image and a driving image according to some embodiments of the present application;

FIG. 4 is a flowchart of extracting a target training sample according to attribute information according to some embodiments of the present application;

fig. 5 is a flowchart of a system for generating a digital human video based on video material according to some embodiments of the present application.

Detailed Description

Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the examples below do not represent all embodiments consistent with the application. Merely exemplary of systems and methods consistent with aspects of the application as set forth in the claims.

The digital person is the result of the humanoid image or software production produced by the computer technology, and the activities of virtual anchor, virtual even image, virtual staff and the like can be realized by the digital person technology. The digital person is composed of preset images, sounds and scenes, and in some embodiments, the digital person can drive the digital person to display different actions or expressions according to voice or other forms of contents input by a user so as to generate digital person videos.

The demand for digital persons varies in different usage scenarios for different user groups. Therefore, the user needs to customize the digital personal service according to his own needs, that is, the user needs to repeatedly communicate with the digital personal service party about the scene, content, etc. of the digital personal video. And the digital person service side establishes a digital person model according to the description of the user to generate a digital person video. However, the method of customizing communication not only needs to consume a long time, but also has a high probability that the generated digital human video does not coincide with the video content expected by the user, so that the generation efficiency of the digital human video is low, and the experience of the user is reduced.

As shown in fig. 1, some embodiments of the present application provide a digital human video system including a model unit, a template unit, a sound unit, and a table unit. Wherein the model unit is used for displaying digital human images; the template unit is used for editing the scene of the digital human video, and can further edit characters, images, audio and the like in the scene to form the background of the digital human video, other displayed contents and the like; the sound module is used for displaying sounds with different timbres and tones so as to drive the digital person to pronounce according to the sound characteristics selected by the user; the table book unit is used for determining the theme of the digital person video so as to drive the digital person to explain and the like according to the table book selected by the user. The user can customize the image of the digital person and the content of the digital person video based on the digital person video system provided by the embodiment, so as to obtain the digital person video required by the user.

It should be noted that, the digital human video system of the above embodiment may be a combination of one or more of the above units, or may include other more unit modules. In addition, the digital person video system is realized based on the pre-trained neural network model, and a digital person can form various actions according to the output of the neural network model so as to generate a digital person video.

Obviously, the digital person video system can reduce the communication process between the user and the digital person service side, so that the user can customize the image and the content of the digital person video, and the generation efficiency of the digital person video is improved. However, the process of editing text, images, audio and the like is still complicated, so that the generation process of the digital human video is not simple enough, and the generation efficiency of the digital human video is affected.

Based on the application scenario, in order to improve the experience of the user and solve the problem of low digital human video generation efficiency, some embodiments of the present application provide a method for generating digital human video based on video material, as shown in fig. 2, the method specifically includes the following contents:

s100: drive video data is acquired.

The driving video data are video materials, can be video materials input by a user, and can also be videos shot by the user based on the digital human video system. The video material comprises video data and audio data corresponding to the video material.

S200: the drive form of the video material is detected.

The drive is in the form of a user selected video drive or audio drive. If the user only needs to generate digital human video based on the audio content corresponding to the video material, selecting a driving mode as audio driving, and driving a target person based on the audio part of the video material to realize corresponding action; if the user needs to generate digital personal video based on the video content corresponding to the video material, for example: and selecting a video driver as a driving mode for dance, gymnastics and the like, and migrating actions of the characters in the video to the target characters so that the target characters generate target character videos according to the actions in the video materials. The target person is the digital person image selected by the user.

S300: if the driving form is an audio driving, audio features of the audio data are extracted, and the target person is driven based on the audio features to generate a target person video.

After detecting that the driving mode of the video material is audio driving, namely proving that the user needs to drive the target person according to the audio data in the video material, extracting the audio characteristics of the audio data in the video material. And then driving the target person through the extracted audio features, and further generating a target person video required by the user.

In some embodiments, the audio characteristics of the audio data in the video material include: time domain features, frequency domain features, energy features, volume size (loudness), etc. The audio data may also be voice recorded by the user through the recording function of other mobile terminals, or voice after sound processing, such as acceleration, deceleration, rising, falling, or other processed voice data.

To facilitate corresponding actions by driving the target person with the audio feature, in some embodiments, the step of driving the target person based on the audio feature further comprises: acquiring a target person image and extracting image features of the target person image. And then carrying out coding processing on the audio features and the image features to obtain audio coding data and image coding data. And splicing the audio coding data and the image coding data to obtain image-sound data. And synthesizing the picture-sound data to generate a dynamic image coding result. And decoding the dynamic image encoding result to generate a target dynamic image, wherein the target dynamic image is a video image of a corresponding action when the target person executes the audio data.

That is, the image of the target person and the audio feature of the audio data may be encoded, so as to generate a target moving image in which the target person performs the corresponding action of the audio data, and a plurality of target moving images may constitute the corresponding target person video.

In order to ensure the quality and smoothness of the video generated by the target person, the time sequence of the audio features needs to be strictly maintained during the encoding process.

In some embodiments, the target moving image may be a video image for instructing the target person to perform a mouth shape action corresponding to the audio data. The target dynamic image is a character image with the target character, and performs the mouth shape action of the content corresponding to the audio data through the mouth area of the character image.

In some embodiments, the target dynamic image may be personalized according to the user's requirement, that is, the dynamic effect or expression, barrage, sticker, text, etc. with modification function may be added into the target dynamic image according to the user's requirement, so that the target character video formed by the target dynamic image is more vivid.

In some embodiments, conventional convolutional encoding, such as convolutional neural networks, deep neural networks, end-to-end neural network encoders, etc., may be employed for feature processing of the target person image.

Additionally, in some embodiments, causal convolution calculations may also be performed on the audio features to obtain causal convolution data. And performing expansion convolution calculation on the causal convolution data to obtain expansion convolution data. And then obtaining a target residual error of convolution calculation, wherein the target residual error is generated in the process of causal convolution calculation and expansion convolution calculation. And generating the audio coding data according to the expansion convolution data and the target residual error.

Illustratively, causal convolution calculations are performed on the audio features to obtain causal convolution data. In the causal convolution process, the input of each layer is obtained from the output of the previous layer, and the encoding of the audio features can be completed through the calculation of a plurality of convolution layers. However, if one output corresponds to more all inputs and the inputs and outputs are farther apart, the number of convolution layers is greater. The problems of gradient disappearance, complex training, poor fitting effect and the like can occur when the number of convolution layers is increased. Therefore, the causal convolution data also needs to be subjected to the expansion convolution calculation, so as to obtain the expansion convolution data. The expansion convolution can skip part of input, so that the model has larger receptive field under the condition of lower layer number, and the problem of gradient disappearance can be solved. Meanwhile, a residual error generated in the process of performing convolution calculation needs to be calculated, and the residual error and the expanded convolution data obtained after the expanded convolution calculation are synthesized to obtain audio coding data.

Based on the above embodiments, when the driving feature is an audio feature, the selected target character image and a piece of audio data may be input into the audio driving character model. For example, the audio-driven character model may be a training model that drives a character model mouth shape, or the like. And respectively extracting the image characteristics of the target person and the audio characteristics of the audio data, and carrying out coding processing on the image characteristics and the audio characteristics to obtain audio coding data and image coding data. And then splicing the audio coding data and the image coding data to obtain image sound data, and synthesizing and decoding the image sound data to generate the target dynamic image.

In some embodiments, text information entered by the user is also obtained along with a sound sample that includes a user-selected tone color feature. The text information is then converted into driving audio data based on the sound samples. And extracting audio characteristics of the driving audio data, and driving the target person based on the audio characteristics.

That is, the user may also customize the generation of audio data by entering text. I.e. the user can input text information and select the desired sound sample based on the digital human video system. The digital human video system can automatically generate a section of audio data as the driving audio of the target person according to the operation of the user.

Illustratively, the sound samples include female announcements, male announcements, tong Yin, dialects for each region, and the like. The user can select a proper sound sample according to the own requirement.

In order to meet the personalized needs of the user, in some embodiments, the sound sample can be obtained according to the voice data training input by the user. I.e. the voice input by the user is input into the voice training model to train a specific voice sample according to the voice characteristics of the voice input by the user.

S400: if the driving form is a video driving, extracting video characteristics of the video data to obtain video actions, and migrating the video actions to the target person to generate the target person video.

After detecting that the driving mode of the video material is video driving, namely proving that a user needs to drive a target person according to video data in the video material, extracting video characteristics in the video data, namely image characteristics of video frames in the video data. And then migrating the actions in the video features to the target person, wherein the target person can copy the actions of the person in the video material to generate the target person video.

In some embodiments, the video data includes, but is not limited to, face video, person motion video, animal motion video, object motion video, animated video, and other video material containing moving persons or objects.

For example, the duration of the video material may be selected to be about 10 seconds, and the resolution is 720P and 1080P. And, the frame rate of the video material may be selected to be 25 frames/second or 30 frames/second.

In some embodiments, the step of migrating the action to the target person further comprises: and analyzing the driving video data to obtain a source image and a driving image of the driving video data. And training an action migration model by using the source image and the driving image so as to migrate actions in the driving video data.

In some embodiments, the drive video data comprises a plurality of video frames ordered in time sequence, i.e. each video frame is a still picture of the video at a certain instant in time. In the process of moving, each time of moving, the trained moving model sequentially extracts a video frame from the driving video data as a driving image.

To facilitate acquisition of source and drive images, in some embodiments, global video frames of the drive video data are also parsed, the global video frames being chronologically ordered video frames in the drive video data. And then extracting two video frames from the global video frame, and marking one of the two video frames as a source image and the other as a driving image.

For example: as shown in fig. 3, the duration of the video material is 10S, and the frame rate is 30 frames/second, then the global video frame of the video material is parsed, and the total global video frame is 200 frames. Then randomly extracting two frames of images from 200 frames of video frames, as shown in fig. 3, marking one frame of images as a source image and the other frame of images as a driving image.

In addition, since the motion of a person or object in the video data needs to be migrated, the person or object in the video data is driven to move is a valid image feature. Thus, in some embodiments, the video frame duty cycle of a moving object in the drive video data is also detected, the moving object being a moving person or object. If the video frame duty ratio of the moving object in the driving video data is larger than the proportional threshold value, not processing the video frame; and if the video frame duty ratio of the moving object in the driving video data is smaller than or equal to the proportional threshold value, cutting the video frame according to the moving object of the driving video data so as to generate new driving video data for action migration.

The manner of cropping the video frames may be an automatic cropping method, that is, in some embodiments, by using a face, human body, animal or object detection algorithm, automatically detecting the region where the moving object is located in each frame of video frame of the driving video data, and calculating the target region according to the region, so that the detection regions of all video frames are a subset of the target region, and further generating a new driving video data only including the region where the moving object is located.

In some embodiments, an effective area of the source image is acquired and marked as a first area. And then calculating a first area ratio of the first area in the source image, and if the first area ratio is smaller than a first preset value, preprocessing the source image so that the area ratio of the first area in the source image is larger than or equal to the first preset value.

Similarly, in some embodiments, before training the action migration model using the source image and the driving image, further comprising: and acquiring the effective area of the driving image, and marking the effective area of the driving image as a second area. And then calculating a second area ratio of the second area in the driving image, and if the second area ratio is smaller than a second preset value, preprocessing the driving image so that the area ratio of the second area in the driving image is larger than or equal to the second preset value.

It should be noted that the method of driving the target person through the audio feature and the video feature in the above embodiment is only an exemplary method, and other methods may be adopted. The application is not limited in this regard.

In addition, for some specific fields, due to the specificity of application scenes or the specificity of user requirements, the model trained by using the universal training samples is not enough to realize the driving of digital human actions more ideally in the field. For example, a user expects to generate astronomical related science popularization videos, but audio or text contents related in astronomical related fields contain more rare terms, and general training samples cannot fully cover the contents in the fields; as another example, users expect that the emotional expressions of target characters in digital human videos are richer, and the models trained by the general training samples are not enough to support the needs of the emotional expressions.

Thus, as shown in fig. 4, in some embodiments, attribute information of the driving video data, which is domain information to which the driving video data belongs, is also detected. And extracting a target training sample according to the attribute information, wherein the target training data is the training sample of the field information. And training the target character model according to the target training sample.

That is, according to the embodiment of the application, the personalized training process can be performed on the target character model according to the special training sample in the specific field, namely, the background neural network model, so that the actions of the target character can be more attached to the application scene of the target character video.

Based on the method for generating the digital human video based on the video material, some embodiments of the present application further provide a system for generating the digital human video based on the video material, as shown in fig. 5, where the system includes an application unit. Wherein, as shown in fig. 2, the application unit is configured to perform the following program steps:

s100: acquiring driving video data, wherein the driving video data is video material, and the video material comprises video data and audio data corresponding to the video material;

s200: detecting a driving form of the video material, wherein the driving form is a video driving or an audio driving selected by a user;

s300: extracting audio features of the audio data if the driving form is an audio driving, and driving a target person based on the audio features to generate a target person video;

Reference throughout this specification to "an embodiment," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, component, or characteristic described in connection with the embodiment is included in at least one embodiment, and thus the phrases "in embodiments," "in some embodiments," "in at least another embodiment," or "in embodiments," etc., appearing throughout the specification do not necessarily all refer to the same embodiment. Furthermore, the particular features, components, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, component, or characteristic shown or described in connection with one embodiment may be combined, in whole or in part, with features, components, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.

The above-provided detailed description is merely a few examples under the general inventive concept and does not limit the scope of the present application. Any other embodiments which are extended according to the solution of the application without inventive effort fall within the scope of protection of the application for a person skilled in the art.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method for generating digital human video based on video material, comprising:

2. The method for generating digital human video based on video material according to claim 1, wherein the step of driving a target person based on the audio feature further comprises:

acquiring a target person image and extracting image features of the target person image;

encoding the audio features and the image features to obtain audio encoded data and image encoded data;

splicing the audio coding data and the image coding data to obtain image-sound data;

synthesizing the picture and sound data to generate a dynamic image coding result;

and decoding the dynamic image encoding result to generate a target dynamic image, wherein the target dynamic image is a video image of a corresponding action when the target person executes the audio data.

3. The method of generating digital human video based on video material of claim 2, wherein the step of encoding the audio features and the image features comprises:

performing causal convolution calculation on the audio features to obtain causal convolution data;

performing expansion convolution calculation on the causal convolution data to obtain expansion convolution data;

obtaining a target residual error of convolution calculation, wherein the target residual error is generated in the process of causal convolution calculation and expansion convolution calculation;

and generating the audio coding data according to the expansion convolution data and the target residual error.

4. The method for generating digital human video based on video material of claim 1, further comprising:

acquiring text information and a sound sample input by a user, wherein the sound sample comprises tone characteristics selected by the user;

converting the text information into driving audio data based on the sound samples;

extracting audio features of the driving audio data, and driving a target person based on the audio features.

5. The method of generating digital human video based on video material of claim 1, wherein the step of migrating the action to the target person comprises:

analyzing the driving video data to obtain a source image and a driving image of the driving video data;

an action migration model is trained using the source image and the driving image.

6. The method of generating digital human video based on video material of claim 5, wherein the step of parsing the driving video data comprises:

analyzing global video frames of the driving video data, wherein the global video frames are video frames which are sequenced in time sequence in the driving video data;

extracting two video frames from the global video frame, and marking one of the two video frames as a source image and the other as a driving image.

7. The method for generating digital human video based on video material of claim 5, wherein before training an action migration model using the source image and the driving image, comprising:

acquiring an effective area of the source image, and marking the effective area of the source image as a first area;

calculating a first area ratio of the first area in the source image;

and if the first area duty ratio is smaller than a first preset value, preprocessing the source image so that the area duty ratio of the first area in the source image is larger than or equal to the first preset value.

8. The method for generating digital human video based on video material of claim 6, wherein before training an action migration model using the source image and the driving image, further comprising:

acquiring the effective area of the driving image, and marking the effective area of the driving image as a second area;

calculating a second area ratio of the second area in the driving image;

and if the second area ratio is smaller than a second preset value, preprocessing the driving image so that the area ratio of the second area in the driving image is larger than or equal to the second preset value.

9. The method for generating digital human video based on video material of claim 1, further comprising, after obtaining the driving video data:

detecting attribute information of the driving video data, wherein the attribute information is domain information of the driving video data;

extracting a target training sample according to the attribute information, wherein the target training data is the training sample of the field information;

and training a target character model according to the target training sample.

10. A system for generating digital human video based on video material for performing the method of generating digital human video based on video material as claimed in any one of claims 1-9, the system comprising an application unit configured to: