CN114071204A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN114071204A
CN114071204A CN202111354441.8A CN202111354441A CN114071204A CN 114071204 A CN114071204 A CN 114071204A CN 202111354441 A CN202111354441 A CN 202111354441A CN 114071204 A CN114071204 A CN 114071204A
Authority
CN
China
Prior art keywords
data
mouth shape
sequence data
picture
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111354441.8A
Other languages
Chinese (zh)
Other versions
CN114071204B (en
Inventor
向钊豫
范贤武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan MgtvCom Interactive Entertainment Media Co Ltd
Original Assignee
Hunan MgtvCom Interactive Entertainment Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan MgtvCom Interactive Entertainment Media Co Ltd filed Critical Hunan MgtvCom Interactive Entertainment Media Co Ltd
Priority to CN202111354441.8A priority Critical patent/CN114071204B/en
Publication of CN114071204A publication Critical patent/CN114071204A/en
Application granted granted Critical
Publication of CN114071204B publication Critical patent/CN114071204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Studio Circuits (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a data processing method and a data processing device, which can obtain text sequence data of a target video in a predefined time period, obtain mouth shape sequence data output by a mouth shape generator based on picture sequence data and audio sequence data, match the picture sequence data, the audio sequence data and the text sequence data, determine a synchronization loss value of the text sequence data and the mouth shape sequence data as a first synchronization loss value, and update the mouth shape generator based on the first synchronization loss value. According to the method and the device, after the text sequence data and the mouth shape graph sequence data are obtained, the first synchronization loss value of the text sequence data and the audio sequence data is calculated, the mouth shape generator is updated based on the first synchronization loss value, the model performance of the mouth shape generator is optimized, the mouth shape graph generation performance of the mouth shape generator based on the audio and the picture is improved, the matching degree, namely the synchronism of the mouth shape graph generated by the mouth shape generator and the audio is improved, and therefore the sound and picture synchronization effect of the target video is optimized.

Description

Data processing method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.
Background
With the development of scientific technology, the technology of Artificial Intelligence (AI) is continuously improved, and the variety of Artificial Intelligence models is continuously increased.
Currently, researchers have designed an AI model that can be fit to a mouth shape. Specifically, when a segment of audio and a segment of video are obtained, the AI model may perform corresponding conversion on the mouth shape of the target character in the segment of video based on the segment of audio, and then output the video with the mouth shape of the target character matching the audio. For example, when the AI obtains a first audio and a first video in which an animation character is speaking, the AI may perform corresponding conversion on a mouth shape of the animation character in the first video based on a first segment of audio, and then output the mouth shape converted, that is, the animation character is speaking the first video of the first audio.
However, the existing AI model is weak in performance of converting the mouth shape of the target character in the video based on the audio, and the mouth shape conversion accuracy is low.
Disclosure of Invention
In view of the above problems, the present invention provides a data processing method and apparatus for overcoming the above problems or at least partially solving the above problems, and the technical solution is as follows:
a method of data processing, the method comprising:
obtaining text sequence data of a target video in a predefined period;
obtaining mouth shape graph sequence data output by a mouth shape generator based on picture sequence data and audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
updating the mouth shape generator based on the first synchronization loss value.
Optionally, the determining the synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value includes:
inputting the text sequence data and the mouth shape figure sequence data to a first discriminator, wherein the first discriminator is a model used for calculating a synchronization loss value of the text and the mouth shape figure;
and obtaining the first synchronization loss value output by the first discriminator.
Optionally, the picture sequence data and the audio sequence data are matched at the same sequence position; the mouth shape generator outputs the mouth shape figure sequence data based on the picture sequence data and the audio sequence data, including:
the mouth shape generator generates mouth shape maps of frames in the mouth shape map sequence data based on the video data, wherein the video data comprises data of the picture sequence data and the audio sequence data on the same sequence position.
Optionally, the mouth shape generator generates the first mouth shape graph based on the first video data, and includes:
the mouth shape generator extracts picture features and audio features from the picture and audio sampling point data in the first video data respectively;
the mouth shape generator processes the audio features and the picture features by means of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data;
and the mouth shape generator performs up-sampling on the processed data, and generates and outputs the first mouth shape graph.
Optionally, the mouth shape generator includes: the audio frequency characteristic extraction layer, the picture characteristic extraction layer, the audio frequency sampling layer, first picture sampling layer, second picture sampling layer, merging layer, first upsampling layer, first connecting layer, second upsampling layer, second connecting layer and third upsampling layer.
Optionally, the mouth shape generator extracts the picture feature and the audio feature from the picture and the audio sample point data in the first video data, respectively, and includes:
the picture feature extraction layer extracts picture features from pictures in the first video data;
the audio feature extraction layer extracts audio features from the audio sample point data in the first video data.
Optionally, the mouth shape generator performs data processing on the audio feature and the picture feature by using a multi-down-sampling, multi-up-sampling and multi-residual-error connection manner, to obtain processed data, including:
the audio sampling layer performs downsampling on the audio features to obtain first downsampled data, and performs downsampling on the first downsampled data to obtain second downsampled data;
the first picture sampling layer performs downsampling on the picture characteristics to obtain third downsampled data;
the second picture sampling layer performs downsampling on the third downsampled data to obtain fourth downsampled data;
the merging layer merges the second downsampled data and the fourth downsampled data to obtain merged data;
the first up-sampling layer up-samples the merged data to obtain first up-sampled data;
the first connection layer performs residual error connection on the first up-sampling data and the third down-sampling data to obtain first connection data;
the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
and the second connection layer performs residual error connection on the second up-sampling data and the fourth down-sampling data to obtain the processed data.
Optionally, the method further includes:
inputting the complete picture sequence data and the mouth shape figure sequence data to a second discriminator to obtain a picture loss value output by the second discriminator, wherein the complete picture sequence data comprises complete pictures respectively corresponding to all pictures in the picture sequence data;
inputting the audio sequence data and the mouth shape graph sequence data to a third discriminator to obtain a second synchronization loss value output by the third discriminator;
the updating the mouth shape generator based on the first synchronization loss value includes:
obtaining a final loss value based on the first synchronization loss value, the picture loss value and the second synchronization loss value;
updating the mouth shape generator based on the final loss value.
A data processing apparatus, the apparatus comprising: the device comprises a first obtaining unit, a second obtaining unit, a first determining unit and a first updating unit; wherein:
the first obtaining unit is used for obtaining text sequence data of a target video in a predefined time period;
the second obtaining unit is used for obtaining the mouth shape graph sequence data output by the mouth shape generator based on picture sequence data and audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
the first determination unit is used for determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
the first updating unit is used for updating the mouth shape generator based on the first synchronization loss value.
Optionally, the first determining unit includes: a first input unit and a third obtaining unit;
the first input unit is used for inputting the text sequence data and the mouth shape graph sequence data to a first discriminator, and the first discriminator is a model used for calculating a synchronization loss value of a text and a mouth shape graph;
the third obtaining unit is configured to obtain the first synchronization loss value output by the first discriminator.
Optionally, the picture sequence data and the audio sequence data are matched at the same sequence position; the mouth shape generator outputs the mouth shape figure sequence data based on the picture sequence data and the audio sequence data, and is set to:
the mouth shape generator generates mouth shape maps of frames in the mouth shape map sequence data based on the video data, wherein the video data comprises data of the picture sequence data and the audio sequence data on the same sequence position.
Optionally, the mouth shape generator generates a first mouth shape graph based on the first video data, and is configured to:
the mouth shape generator extracts picture features and audio features from the picture and audio sampling point data in the first video data respectively;
the mouth shape generator processes the audio features and the picture features by means of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data;
and the mouth shape generator performs up-sampling on the processed data, and generates and outputs the first mouth shape graph.
Optionally, the mouth shape generator includes: the audio frequency characteristic extraction layer, the picture characteristic extraction layer, the audio frequency sampling layer, first picture sampling layer, second picture sampling layer, merging layer, first upsampling layer, first connecting layer, second upsampling layer, second connecting layer and third upsampling layer.
Optionally, the mouth shape generator extracts the picture feature and the audio feature from the picture and audio sample point data in the first video data, and sets the extracted picture feature and audio feature as:
the picture feature extraction layer extracts picture features from pictures in the first video data;
the audio feature extraction layer extracts audio features from the audio sample point data in the first video data.
Optionally, the mouth shape generator performs data processing on the audio feature and the picture feature by using a mode of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data, and sets the data as:
the audio sampling layer performs downsampling on the audio features to obtain first downsampled data, and performs downsampling on the first downsampled data to obtain second downsampled data;
the first picture sampling layer performs downsampling on the picture characteristics to obtain third downsampled data;
the second picture sampling layer performs downsampling on the third downsampled data to obtain fourth downsampled data;
the merging layer merges the second downsampled data and the fourth downsampled data to obtain merged data;
the first up-sampling layer up-samples the merged data to obtain first up-sampled data;
the first connection layer performs residual error connection on the first up-sampling data and the third down-sampling data to obtain first connection data;
the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
and the second connection layer performs residual error connection on the second up-sampling data and the fourth down-sampling data to obtain the processed data.
Optionally, the apparatus further comprises: a second input unit, a fourth obtaining unit, a third input unit, and a fifth obtaining unit; the first updating unit includes: a sixth obtaining unit and a second updating unit; wherein:
the second input unit is configured to input the complete picture sequence data and the mouth shape sequence data to a second discriminator, where the complete picture sequence data includes complete pictures corresponding to respective pictures in the picture sequence data;
the fourth obtaining unit is configured to obtain the picture loss value output by the second discriminator;
the input unit is used for inputting the audio sequence data and the mouth shape graph sequence data to a third discriminator;
the fifth obtaining unit is configured to obtain a second synchronization loss value output by the third discriminator;
the sixth obtaining unit is configured to obtain a final loss value based on the first synchronization loss value, the picture loss value, and the second synchronization loss value;
the second updating unit is used for updating the mouth shape generator based on the final loss value.
According to the data processing method and device provided by the embodiment, the text sequence data of the target video in the predefined time period can be obtained, the mouth shape sequence data output by the mouth shape generator based on the picture sequence data and the audio sequence data is obtained, the picture sequence data and the audio sequence data are matched with the text sequence data, the synchronization loss value of the text sequence data and the mouth shape sequence data is determined as the first synchronization loss value, and the mouth shape generator is updated based on the first synchronization loss value. According to the method and the device, after the text sequence data and the mouth shape graph sequence data are obtained, the first synchronization loss value of the text sequence data and the audio sequence data is calculated, the mouth shape generator is updated based on the first synchronization loss value, the model performance of the mouth shape generator is optimized, the mouth shape graph generation performance of the mouth shape generator based on the audio and the picture is improved, the matching degree, namely the synchronism of the mouth shape graph generated by the mouth shape generator and the audio is improved, and therefore the sound and picture synchronization effect of the target video is optimized.
The foregoing description is only an overview of the technical solutions of the present invention, and the following detailed description of the present invention is provided to enable the technical means of the present invention to be more clearly understood, and to enable the above and other objects, features, and advantages of the present invention to be more clearly understood.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart illustrating a first data processing method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a process of generating a mouth shape diagram by a mouth shape generator according to an embodiment of the present invention;
FIG. 3 is a flow chart of a second data processing method provided by the embodiment of the invention;
fig. 4 is a schematic structural diagram illustrating a first data processing apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As shown in fig. 1, the present embodiment proposes a first data processing method, which may include the steps of:
s101, obtaining text sequence data of a target video in a predefined time period;
it should be noted that the present invention can be applied to electronic devices such as tablet computers and desktop computers.
The target video may be a video in which one or more roles speak within a certain time period.
Optionally, the role may be a real role, such as a real character or an animal, and the target video may be a video generated by shooting the role with a camera, such as a video for broadcasting news by a news anchor; optionally, the character may also be a virtual character, such as a virtual character or an animal, and in this case, the target video may be a video produced by a special movie or television effect.
Optionally, the target video may include text, pictures and audio that can be played synchronously in time.
Wherein the predefined period may be a certain period in the target video. Alternatively, the predefined period may be the whole period of the target video, or may be a partial period of the target video.
The text sequence data may be text data of the target video within a predefined time period, and the sequence data is generated after being arranged in a time sequence.
Optionally, the text sequence data can be extracted by extracting data from the target video; optionally, the invention can convert the audio in the target video into the corresponding text by using an audio-to-text tool; alternatively, the invention can also directly obtain the extracted text sequence data, such as obtaining the extracted text sequence data from other electronic devices.
Specifically, in the process of extracting data from the target video, the invention can extract all text data of the target video within a predefined time period, and then determine all the text data as text sequence data.
S102, obtaining mouth shape graph sequence data output by a mouth shape generator based on the picture sequence data and the audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
wherein, the mouth shape generator can be a model which can output a corresponding mouth shape graph based on the picture and the audio. Alternatively, the mouth shape generator may be an existing mouth shape generator; optionally, the shape generator may be a shape generator in the training process, or may be a trained shape generator, which is not limited in the present invention.
The mouth shape graph may be a picture after mouth shape matching is performed on pictures in the picture sequence data.
It is understood that when the mouth shape conversion performance of the AI model is weak, the problem that the picture of the video output by the AI model is not synchronous with the audio may be caused. Specifically, the method disclosed by the invention can update the structural parameters of the mouth shape generator, optimize the model structure of the mouth shape generator, improve the mouth shape generation performance of the mouth shape generator based on audio and pictures, and improve the matching degree, namely the synchronism of the mouth shape generated by the mouth shape generator and the audio, thereby optimizing the sound and picture synchronization effect of the target video.
The picture sequence data may be sequence data generated based on each frame of a complete picture of the target video in a predefined period.
Optionally, the pictures in the picture sequence data may be pictures with the character mouth shape being blocked. Optionally, the picture that blocks the mouth shape of the character may be a picture of the face of the character that blocks the mouth part; optionally, the picture with the character mouth shape blocked may also be a picture only containing the upper half of the character face.
Optionally, the invention may obtain each frame of complete picture of the target video in the predefined time period in advance, then extract the picture that has blocked the role mouth from each frame of complete picture, arrange each extracted picture that has blocked the role mouth according to the time sequence, thereby generating picture sequence data; specifically, the present invention can also directly obtain the generated picture sequence data, such as obtaining the generated picture sequence data from other electronic devices.
Wherein the audio sequence data may be sequence data generated based on audio data of the target video in a predefined period.
Optionally, the present invention may obtain all audio data of the target video in a predefined time period in advance, then perform audio sampling on all the audio data according to a preset audio sampling rate to obtain a corresponding number of audio sampling point data, and then sort the number of audio sampling point data according to a time sequence, thereby obtaining audio sequence data. In particular, the invention can also directly obtain the audio sequence data, such as directly obtaining the audio sequence data from other electronic devices.
It should be noted that, the data corresponding to the text sequence data, the picture sequence data, and the audio sequence data at the same time in the predefined time period may be the text data, the picture data, and the audio data corresponding to the target video at the same time, respectively. Therefore, data corresponding to the text sequence data, the picture sequence data, and the audio sequence data at the same time may be matched, and data at the same sequence position of the text sequence data, the picture sequence data, and the audio sequence data may be matched.
Specifically, the mouth shape generator may output the mouth shape sequence data based on the picture sequence data and the audio sequence data. It should be noted that the data of the mouth shape sequence data, the picture sequence data, and the audio sequence data at the same sequence position may be matched.
Optionally, in another data processing method proposed in this embodiment, the picture sequence data and the audio sequence data are matched at the same sequence position; the mouth shape generator outputs mouth shape picture sequence data based on the picture sequence data and the audio sequence data, including:
the mouth shape generator generates mouth shape images of frames in the mouth shape image sequence data respectively based on the video data, and the video data comprises data of the picture sequence data and the audio sequence data on the same sequence position respectively.
It should be noted that the mouth shape generator used in the present invention can generate the mouth shape diagram by performing multiple times of picture feature sampling and residual error connection, and improve the matching degree between the mouth shape diagram and the audio, so as to improve the model performance of the mouth shape generator and improve the accuracy of the mouth shape diagram generated by the mouth shape generator.
Optionally, the mouth shape generator generates the first mouth shape graph based on the first video data, including:
the mouth shape generator extracts audio features and picture features from the picture and audio sampling point data in the first video data respectively;
the mouth shape generator processes data of the audio features and the picture features by means of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data;
the mouth shape generator performs up-sampling on the processed data, and generates and outputs a first mouth shape graph.
The first video data may include a frame of picture and an audio sample point data at the same sequence position of the picture sequence data and the audio sequence data, respectively. For example, the first video data may include a frame of picture of the picture sequence data at the first sequence position, and an audio sample point data of the audio sequence data at the first sequence position.
The present invention is not limited to the manner of extracting the picture feature and the audio feature. For example, the invention may extract audio features based on the mel-frequency spectrum.
Optionally, the mouth shape generator comprises: the audio frequency characteristic extraction layer, the picture characteristic extraction layer, the audio frequency sampling layer, first picture sampling layer, second picture sampling layer, merging layer, first upsampling layer, first connecting layer, second upsampling layer, second connecting layer and third upsampling layer.
Optionally, the mouth shape generator extracts the picture feature and the audio feature from the picture and the audio sample point data in the first video data, respectively, and includes:
the picture feature extraction layer extracts picture features from pictures in the first video data;
the audio feature extraction layer extracts audio features from the audio sample point data in the first video data.
Optionally, the mouth shape generator performs data processing on the audio features and the picture features by using a mode of multiple down-sampling, multiple up-sampling and multiple residual connection, to obtain processed data, including:
the audio sampling layer performs down-sampling on the audio features to obtain first down-sampling data, and performs down-sampling on the first down-sampling data to obtain second down-sampling data;
the first picture sampling layer performs downsampling on picture characteristics to obtain third downsampled data;
the second picture sampling layer performs downsampling on the third downsampled data to obtain fourth downsampled data;
the merging layer merges the second down-sampling data and the fourth down-sampling data to obtain merged data;
the first up-sampling layer up-samples the combined data to obtain first up-sampled data;
the first connection layer carries out residual error connection on the first up-sampling data and the third down-sampling data to obtain first connection data;
the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
and the second connection layer performs residual error connection on the second up-sampling data and the fourth down-sampling data to obtain processed data.
Optionally, the upsampling the processed data by the mouth shape generator to generate and output the first mouth shape graph may include:
and the third up-sampling layer up-samples the processed data to generate and output a first bite diagram.
To better describe the process of generating the mouth diagram by the mouth builder in the embodiment, the embodiment is proposed and explained with reference to fig. 2. In fig. 2, the invention may input the matched first audio sampling point data and the first picture to the mouth shape generator, and an audio sampling layer and a picture sampling layer in the mouth shape generator may extract audio features and picture features from the first audio sampling point data and the first picture, respectively;
then, the audio sampling layer can output the audio features to the audio sampling layer, the audio sampling layer can perform downsampling on the audio features to obtain first downsampled data, and the audio sampling layer can continue downsampling on the first downsampled data to obtain second downsampled data; the picture sampling layer can output the picture characteristics to the first picture sampling layer, and the first picture sampling layer can perform downsampling on the picture characteristics to obtain third downsampled data and output a second picture sampling layer; the second picture sampling layer may perform downsampling on the third downsampled data to obtain fourth downsampled data;
then, the audio sampling layer may output the second downsampled data to the merging layer, the second picture sampling layer may output the fourth downsampled data to the merging layer, and the merging layer may merge the second downsampled data and the fourth downsampled data to obtain merged data and output the merged data to the first upsampling layer; the first up-sampling layer may up-sample the merged data to obtain first up-sampled data;
then, the first connection layer may obtain first up-sampling data and third down-sampling data from the first up-sampling layer and the first picture sampling layer, respectively, perform residual error connection on the first up-sampling data and the third down-sampling data, obtain first connection data, and output the first connection data to the second up-sampling layer; the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
then, the second connection layer may obtain fourth down-sampling data and second up-sampling data from the second picture sampling layer and the second up-sampling layer, respectively, perform residual connection on the fourth down-sampling data and the second up-sampling data, obtain processed data, and output the processed data to the third up-sampling layer; the third up-sampling layer up-samples the processed data, so that a frame of mouth shape graph can be generated and output.
S103, determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
wherein, the synchronization loss value can be used for measuring the difference of the text and the mouth shape graph in the synchronization.
It should be noted that since data of the text sequence data, the picture sequence data, and the audio sequence data at the same sequence position can be matched and data of the mouth shape sequence data, the picture sequence data, and the audio sequence data at the same sequence position can be matched, the mouth shape sequence data and the text sequence data at the same sequence position can be matched.
Alternatively, the present invention may calculate synchronization loss values of the mouth shape graph sequence data and the text sequence data at each sequence position respectively through a predefined loss function, then add the calculated synchronization loss values, and determine a value obtained by the addition as the first synchronization loss value. For example, when two sequence positions are included in the mouth pattern sequence data and the text sequence data, a synchronization loss value of the mouth pattern sequence data and the text sequence data at the first sequence position is calculated, a synchronization loss value of the mouth pattern sequence data and the text sequence data at the second sequence position is calculated, then the calculated two synchronization loss values are added, and a value obtained by the addition is determined as a first synchronization loss value.
Optionally, step S103 may include:
inputting the text sequence data and the mouth shape figure sequence data into a first discriminator, wherein the first discriminator is a model used for calculating a synchronous loss value of the text and the mouth shape figure;
and obtaining a first synchronization loss value output by the first discriminator.
The first discriminator may be a model for measuring the difference between the text and the mouth shape chart in synchronicity.
It should be noted that, the present invention may first train the first discriminator through machine learning, train the first discriminator to the model satisfying the requirement, and then calculate the first synchronization loss value by using the trained first discriminator.
Optionally, in the process of training the first discriminator, the text and the audio with high synchronicity or complete synchronization may be used as a positive sample, and marked as 1, to train the first discriminator; the text and audio with low synchronicity can also be used as negative samples and marked as 0, and the first discriminator is trained.
Specifically, the first discriminator may calculate, after receiving the text sequence data and the mouth shape figure sequence data, synchronization loss values of the mouth shape figure sequence data and data of the text sequence data at respective sequence positions, respectively, and then add the calculated respective synchronization loss values, and determine a value obtained by the addition as the first synchronization loss value.
And S104, updating the mouth shape generator based on the first synchronization loss value.
Specifically, the method and the device can update the structural parameters of the mouth shape generator based on the first synchronous loss value, optimize the model structure of the mouth shape generator, optimize the model performance of the mouth shape generator and improve the synchronism of the mouth shape and the audio frequency in the mouth shape graph generated by the mouth shape generator.
Optionally, after the mouth shape generator is updated, the method shown in fig. 1 may be executed to update the mouth shape generator until the model performance of the mouth shape generator meets the requirement.
Optionally, the invention may combine the updated mouth shape graph sequence data and the audio sequence data output by the mouth shape generator to output the video, so as to improve the sound and picture synchronization effect of the target video.
Optionally, the invention may set a convolution layer at the structure of the mouth shape generator that receives the input data, where the convolution layer may receive data with larger dimensions (e.g., from the data that can receive 96 × 96 to the data that can receive 132 × 132), so as to increase the receivable data amount of the data, and thus, the mouth shape generation diagram may be generated by using more data, and the accuracy of the mouth shape generation diagram may be improved.
Optionally, the invention may further set a picture enhancer in the mouth shape generator, and improve the effects of the mouth shape diagram, such as definition and resolution, through the picture enhancer.
Specifically, the mouth shape generator updated by the method shown in fig. 1 can be applied to an existing mouth shape-fitting AI model (such as wav2lip model), so that the accuracy of the mouth shape diagram generated by the AI model is improved, the mouth shape conversion capability of the AI model based on the target role in the audio and the video is improved, and the synchronism of the picture and the audio of the video output by the AI model is optimized.
The invention can calculate the first synchronization loss value of the text sequence data and the audio sequence data after obtaining the text sequence data matched with the picture sequence data and the audio sequence data and the mouth shape figure sequence data matched with the picture sequence data and the audio sequence data, update the structure parameters of the mouth shape generator based on the first synchronization loss value, optimize the model structure and the model performance of the mouth shape generator, improve the mouth shape figure generation performance of the mouth shape generator based on the audio and the picture, and improve the matching degree, namely the synchronism of the mouth shape figure generated by the mouth shape generator and the audio, thereby optimizing the sound and picture synchronization effect of the target video.
The data processing method provided by the embodiment can obtain text sequence data of a target video in a predefined period, obtain mouth shape sequence data output by a mouth shape generator based on picture sequence data and audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data, determine a synchronization loss value of the text sequence data and the mouth shape sequence data as a first synchronization loss value, and update the mouth shape generator based on the first synchronization loss value. According to the method and the device, after the text sequence data matched with the picture sequence data and the audio sequence data and the mouth shape figure sequence data matched with the picture sequence data and the audio sequence data are obtained, the first synchronization loss value of the text sequence data and the audio sequence data is calculated, the structural parameters of the mouth shape generator are updated based on the first synchronization loss value, the model structure and the model performance of the mouth shape generator are optimized, the mouth shape generator performance based on the audio and the picture is improved, the matching degree, namely the synchronism of the mouth shape figure generated by the mouth shape generator and the audio is improved, and therefore the sound and picture synchronization effect of the target video is optimized.
Based on the steps shown in fig. 1, the present embodiment proposes a second data processing method, as shown in fig. 3. The method may comprise the steps of:
s201, obtaining text sequence data of a target video in a predefined time period;
s202, obtaining mouth shape graph sequence data output by a mouth shape generator based on the picture sequence data and the audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
s203, determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
it should be noted that steps S201, S202, and S203 are consistent with the contents of steps S101, S102, and S103, respectively, and are not described herein again.
S204, inputting the complete picture sequence data and the mouth shape picture sequence data into a second discriminator to obtain a picture loss value output by the second discriminator, wherein the complete picture sequence data comprises complete pictures corresponding to all pictures in the picture sequence data;
specifically, the complete picture corresponding to each picture in the picture sequence data can be the complete picture in the target video.
The complete picture sequence data may be sequence data in which complete pictures of respective frames are arranged in time order. It is understood that the data of the complete picture sequence data and the mouth shape sequence data at the same sequence position may be matched.
The second discriminator can be used for measuring the difference between the complete picture and the mouth shape picture.
Wherein the picture loss value may be a loss value of the complete picture sequence data and the mouth shape sequence data.
It should be noted that the invention can introduce complete picture sequence data and a second discriminator, calculate the picture loss value, and update the mouth shape generator by combining the picture loss value.
Specifically, the second discriminator may calculate loss values of the mouth shape sequence data and data of the full picture sequence data at respective sequence positions after receiving the full picture sequence data and the mouth shape sequence data, respectively, and then add the calculated respective loss values, and determine a value obtained by the addition as a picture loss value.
S205, inputting the audio sequence data and the mouth shape graph sequence data to a third discriminator to obtain a second synchronization loss value output by the third discriminator;
wherein, the third discriminator can be used for measuring the synchronicity gap between the audio frequency and the mouth shape graph.
Wherein the second synchronization loss value may be a loss value of the audio sequence data and the mouth shape graph sequence data.
It should be noted that the invention may introduce the audio sequence data and the third discriminator to calculate the second synchronization loss value, and update the mouth shape generator by combining the second synchronization loss value.
Specifically, the third discriminator may calculate, after receiving the audio sequence data and the mouth pattern sequence data, values of synchronization loss of the mouth pattern sequence data and data of the audio sequence data at respective sequence positions, respectively, and then add the calculated respective values of synchronization loss, and determine a value obtained by the addition as the second value of synchronization loss.
Optionally, the first discriminator, the second discriminator, and the third discriminator may be the same discriminator, and at this time, the present invention may complete the calculation of the first synchronization loss value, the picture loss value, and the second synchronization loss value by the same discriminator, and at this time, the present invention may input text sequence data, audio sequence data, picture sequence data, and mouth shape sequence data to the same discriminator together, and calculate the first synchronization loss value, the picture loss value, and the second synchronization loss value by the same discriminator, respectively; optionally, the first discriminator, the second discriminator, and the third discriminator may be different discriminators.
S206, obtaining a final loss value based on the first synchronization loss value, the picture loss value and the second synchronization loss value;
specifically, the present invention may perform a mathematical operation based on the first synchronization loss value, the picture loss value, and the second synchronization loss value after calculating the first synchronization loss value, the picture loss value, and the second synchronization loss value, for example, the first synchronization loss value, the picture loss value, and the second synchronization loss value may be directly added, and then a value obtained after performing the mathematical operation may be determined as a final loss value.
Optionally, in the mathematical operation process of the present invention, the first synchronization loss value, the picture loss value, and the second synchronization loss value may be weighted in advance, and then the summation operation may be performed based on the first synchronization loss value, the picture loss value, the second synchronization loss value, and each weight. It should be noted that the present invention is not limited to the specific operation manner and operation process of the mathematical operation.
And S207, updating the mouth shape generator based on the final loss value.
It should be noted that the steps S206 and S207 may be an embodiment of the step S104.
Specifically, the method can update the structural parameters of the mouth shape generator based on the final loss value after the final loss value is obtained, and optimize the model structure of the mouth shape generator.
It should be noted that, by introducing the picture loss value and the second synchronization loss value, the invention can further improve the model performance of the mouth shape generator, further improve the mouth shape diagram generation performance of the mouth shape generator based on the audio and the picture, and improve the matching degree, i.e. the synchronicity, of the mouth shape diagram generated by the mouth shape generator and the audio, thereby optimizing the sound and picture synchronization effect of the target video.
The data processing method provided by this embodiment can further improve the model performance of the mouth shape generator, further improve the mouth shape diagram generation performance of the mouth shape generator based on the audio and the picture, and improve the matching degree, i.e., the synchronicity, of the mouth shape diagram generated by the mouth shape generator with the audio by introducing the picture loss value and the second synchronization loss value, thereby optimizing the sound and picture synchronization effect of the target video.
Corresponding to the method shown in fig. 1, as shown in fig. 4, the present embodiment proposes a first data processing apparatus, which may include: a first obtaining unit 101, a second obtaining unit 102, a first determining unit 103, and a first updating unit 104; wherein:
a first obtaining unit 101, configured to obtain text sequence data of a target video in a predefined period;
a second obtaining unit 102 configured to obtain the mouth shape sequence data output by the mouth shape generator based on the picture sequence data and the audio sequence data, the picture sequence data, the audio sequence data, and the text sequence data being matched;
a first determination unit 103 for determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
a first updating unit 104 for updating the mouth shape generator based on the first synchronization loss value.
It should be noted that, the specific processes of the first obtaining unit 101, the second obtaining unit 102, the first determining unit 103 and the first updating unit 104 and the technical effects brought by the specific processes can refer to steps S101, S102, S103 and S104 in fig. 1, respectively.
Optionally, the first determining unit 103 includes: a first input unit and a third obtaining unit;
a first input unit for inputting the text sequence data and the mouth shape figure sequence data to a first discriminator, the first discriminator being a model for calculating a synchronization loss value of the text and the mouth shape figure;
and a third obtaining unit configured to obtain the first synchronization loss value output by the first discriminator.
Optionally, the picture sequence data and the audio sequence data are matched at the same sequence position; the mouth shape generator outputs the mouth shape picture sequence data based on the picture sequence data and the audio sequence data, and is set as follows:
the mouth shape generator generates mouth shape images of frames in the mouth shape image sequence data respectively based on the video data, and the video data comprises data of the picture sequence data and the audio sequence data on the same sequence position respectively.
Optionally, the mouth shape generator generates a first mouth shape graph based on the first video data, and is configured to:
the mouth shape generator extracts picture characteristics and audio characteristics from picture and audio sampling point data in the first video data respectively;
the mouth shape generator processes data of the audio features and the picture features by means of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data;
the mouth shape generator performs up-sampling on the processed data, and generates and outputs a first mouth shape graph.
Optionally, the mouth shape generator comprises: the audio frequency characteristic extraction layer, the picture characteristic extraction layer, the audio frequency sampling layer, first picture sampling layer, second picture sampling layer, merging layer, first upsampling layer, first connecting layer, second upsampling layer, second connecting layer and third upsampling layer.
Optionally, the mouth shape generator extracts the picture feature and the audio feature from the picture and audio sample point data in the first video data, and sets the extracted picture feature and audio feature as:
the picture feature extraction layer extracts picture features from pictures in the first video data;
the audio feature extraction layer extracts audio features from the audio sample point data in the first video data.
Optionally, the mouth shape generator performs data processing on the audio features and the picture features by using a mode of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data, and sets the data as:
the audio sampling layer performs down-sampling on the audio features to obtain first down-sampling data, and performs down-sampling on the first down-sampling data to obtain second down-sampling data;
the first picture sampling layer performs downsampling on picture characteristics to obtain third downsampled data;
the second picture sampling layer performs downsampling on the third downsampled data to obtain fourth downsampled data;
the merging layer merges the second down-sampling data and the fourth down-sampling data to obtain merged data;
the first up-sampling layer up-samples the combined data to obtain first up-sampled data;
the first connection layer carries out residual error connection on the first up-sampling data and the third down-sampling data to obtain first connection data;
the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
and the second connection layer performs residual error connection on the second up-sampling data and the fourth down-sampling data to obtain processed data.
Optionally, the apparatus further comprises: a second input unit, a fourth obtaining unit, a third input unit, and a fifth obtaining unit; the first updating unit 104 includes: a sixth obtaining unit and a second updating unit; wherein:
the second input unit is used for inputting the complete picture sequence data and the mouth shape picture sequence data to the second discriminator, and the complete picture sequence data comprises complete pictures corresponding to all pictures in the picture sequence data;
a fourth obtaining unit, configured to obtain a picture loss value output by the second discriminator;
an input unit for inputting the audio sequence data and the mouth shape figure sequence data to a third discriminator;
a fifth obtaining unit configured to obtain a second synchronization loss value output by the third discriminator;
a sixth obtaining unit, configured to obtain a final loss value based on the first synchronization loss value, the picture loss value, and the second synchronization loss value;
and the second updating unit is used for updating the mouth shape generator based on the final loss value.
The data processing apparatus according to this embodiment can obtain text sequence data matching the picture sequence data and the audio sequence data, and mouth shape sequence data matching the picture sequence data and the audio sequence data, then calculate a first synchronization loss value of the text sequence data and the audio sequence data, update a structural parameter of the mouth shape generator based on the first synchronization loss value, optimize a model structure and a model performance of the mouth shape generator, improve mouth shape generation performance of the mouth shape generator based on the audio and the picture, improve a degree of matching, i.e., synchronization, between the mouth shape generated by the mouth shape generator and the audio, and optimize a sound and picture synchronization effect of a target video.
Based on fig. 4, the present embodiment proposes a second data processing apparatus. The device also includes: a second input unit, a fourth obtaining unit, a third input unit, and a fifth obtaining unit; the first updating unit 104 includes: a sixth obtaining unit and a second updating unit; wherein:
the second input unit is used for inputting the complete picture sequence data and the mouth shape picture sequence data to the second discriminator, and the complete picture sequence data comprises complete pictures corresponding to all pictures in the picture sequence data;
a fourth obtaining unit, configured to obtain a picture loss value output by the second discriminator;
an input unit for inputting the audio sequence data and the mouth shape figure sequence data to a third discriminator;
a fifth obtaining unit configured to obtain a second synchronization loss value output by the third discriminator;
a sixth obtaining unit, configured to obtain a final loss value based on the first synchronization loss value, the picture loss value, and the second synchronization loss value;
and the second updating unit is used for updating the mouth shape generator based on the final loss value.
It should be noted that the second input unit, the fourth obtaining unit, the third input unit, the fifth obtaining unit, the sixth obtaining unit, and the second updating unit may refer to the steps in fig. 3, and are described herein again.
The data processing apparatus provided in this embodiment can further improve the model performance of the mouth shape generator by introducing the picture loss value and the second synchronization loss value, further improve the mouth shape graph generation performance of the mouth shape generator based on the audio and the picture, and improve the matching degree, i.e., the synchronicity, of the mouth shape graph generated by the mouth shape generator and the audio, thereby optimizing the sound and picture synchronization effect of the target video.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method of data processing, the method comprising:
obtaining text sequence data of a target video in a predefined period;
obtaining mouth shape graph sequence data output by a mouth shape generator based on picture sequence data and audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
updating the mouth shape generator based on the first synchronization loss value.
2. The data processing method of claim 1, wherein the determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value comprises:
inputting the text sequence data and the mouth shape figure sequence data to a first discriminator, wherein the first discriminator is a model used for calculating a synchronization loss value of the text and the mouth shape figure;
and obtaining the first synchronization loss value output by the first discriminator.
3. The data processing method according to claim 1, wherein the picture sequence data matches data of the audio sequence data at the same sequence position; the mouth shape generator outputs the mouth shape figure sequence data based on the picture sequence data and the audio sequence data, including:
the mouth shape generator generates mouth shape maps of frames in the mouth shape map sequence data based on the video data, wherein the video data comprises data of the picture sequence data and the audio sequence data on the same sequence position.
4. A data processing method according to claim 3, wherein the mouth shape generator generates a first mouth shape graph based on the first video data, comprising:
the mouth shape generator extracts picture features and audio features from the picture and audio sampling point data in the first video data respectively;
the mouth shape generator processes the audio features and the picture features by means of multiple down-sampling, multiple up-sampling and multiple residual connection to obtain processed data;
and the mouth shape generator performs up-sampling on the processed data, and generates and outputs the first mouth shape graph.
5. The data processing method of claim 4, wherein the mouth shape generator comprises: the audio frequency characteristic extraction layer, the picture characteristic extraction layer, the audio frequency sampling layer, first picture sampling layer, second picture sampling layer, merging layer, first upsampling layer, first connecting layer, second upsampling layer, second connecting layer and third upsampling layer.
6. The data processing method of claim 5, wherein the mouth shape generator extracts picture features and audio features from picture and audio sample point data in the first video data, respectively, comprising:
the picture feature extraction layer extracts picture features from pictures in the first video data;
the audio feature extraction layer extracts audio features from the audio sample point data in the first video data.
7. The data processing method of claim 5, wherein the mouth shape generator performs data processing on the audio feature and the picture feature by means of multiple downsampling, multiple upsampling and multiple residual connection to obtain processed data, and comprises:
the audio sampling layer performs downsampling on the audio features to obtain first downsampled data, and performs downsampling on the first downsampled data to obtain second downsampled data;
the first picture sampling layer performs downsampling on the picture characteristics to obtain third downsampled data;
the second picture sampling layer performs downsampling on the third downsampled data to obtain fourth downsampled data;
the merging layer merges the second downsampled data and the fourth downsampled data to obtain merged data;
the first up-sampling layer up-samples the merged data to obtain first up-sampled data;
the first connection layer performs residual error connection on the first up-sampling data and the third down-sampling data to obtain first connection data;
the second up-sampling layer up-samples the first connection data to obtain second up-sampled data;
and the second connection layer performs residual error connection on the second up-sampling data and the fourth down-sampling data to obtain the processed data.
8. The data processing method of claim 1, wherein the method further comprises:
inputting the complete picture sequence data and the mouth shape figure sequence data to a second discriminator to obtain a picture loss value output by the second discriminator, wherein the complete picture sequence data comprises complete pictures respectively corresponding to all pictures in the picture sequence data;
inputting the audio sequence data and the mouth shape graph sequence data to a third discriminator to obtain a second synchronization loss value output by the third discriminator;
the updating the mouth shape generator based on the first synchronization loss value includes:
obtaining a final loss value based on the first synchronization loss value, the picture loss value and the second synchronization loss value;
updating the mouth shape generator based on the final loss value.
9. A data processing apparatus, characterized in that the apparatus comprises: the device comprises a first obtaining unit, a second obtaining unit, a first determining unit and a first updating unit; wherein:
the first obtaining unit is used for obtaining text sequence data of a target video in a predefined time period;
the second obtaining unit is used for obtaining the mouth shape graph sequence data output by the mouth shape generator based on picture sequence data and audio sequence data, wherein the picture sequence data and the audio sequence data are matched with the text sequence data;
the first determination unit is used for determining a synchronization loss value of the text sequence data and the mouth shape graph sequence data as a first synchronization loss value;
the first updating unit is used for updating the mouth shape generator based on the first synchronization loss value.
10. The apparatus of claim 9, wherein the first determining unit comprises: a first input unit and a third obtaining unit;
the first input unit is used for inputting the text sequence data and the mouth shape graph sequence data to a first discriminator, and the first discriminator is a model used for calculating a synchronization loss value of a text and a mouth shape graph;
the third obtaining unit is configured to obtain the first synchronization loss value output by the first discriminator.
CN202111354441.8A 2021-11-16 2021-11-16 Data processing method and device Active CN114071204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111354441.8A CN114071204B (en) 2021-11-16 2021-11-16 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111354441.8A CN114071204B (en) 2021-11-16 2021-11-16 Data processing method and device

Publications (2)

Publication Number Publication Date
CN114071204A true CN114071204A (en) 2022-02-18
CN114071204B CN114071204B (en) 2024-05-03

Family

ID=80272588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111354441.8A Active CN114071204B (en) 2021-11-16 2021-11-16 Data processing method and device

Country Status (1)

Country Link
CN (1) CN114071204B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109168067A (en) * 2018-11-02 2019-01-08 深圳Tcl新技术有限公司 Video timing correction method, correction terminal and computer readable storage medium
CN111401101A (en) * 2018-12-29 2020-07-10 上海智臻智能网络科技股份有限公司 Video generation system based on portrait
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109168067A (en) * 2018-11-02 2019-01-08 深圳Tcl新技术有限公司 Video timing correction method, correction terminal and computer readable storage medium
CN111401101A (en) * 2018-12-29 2020-07-10 上海智臻智能网络科技股份有限公司 Video generation system based on portrait
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Also Published As

Publication number Publication date
CN114071204B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN111381909B (en) Page display method and device, terminal equipment and storage medium
EP3889912A1 (en) Method and apparatus for generating video
CN111933110A (en) Video generation method, generation model training method, device, medium and equipment
CN113487618B (en) Portrait segmentation method, portrait segmentation device, electronic equipment and storage medium
CN113901894A (en) Video generation method, device, server and storage medium
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN112364838B (en) Method for improving handwriting OCR performance by utilizing synthesized online text image
CN112785670B (en) Image synthesis method, device, equipment and storage medium
CN114495916B (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN113111812A (en) Mouth action driving model training method and assembly
CN115546575A (en) Training method of driving model, driving method, device, readable medium and equipment
CN112785669B (en) Virtual image synthesis method, device, equipment and storage medium
CN111128131B (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN114071204A (en) Data processing method and device
CN111916050A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113658046B (en) Super-resolution image generation method, device, equipment and medium based on feature separation
CN111695670A (en) Neural network model training method and device
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN114418835A (en) Image processing method, apparatus, device and medium
CN111524090A (en) Depth prediction image-based RGB-D significance detection method
CN113593527A (en) Acoustic feature generation, voice model training and voice recognition method and device
CN113077536B (en) Mouth action driving model training method and component based on BERT model
CN117373455B (en) Audio and video generation method, device, equipment and storage medium
CN113129925B (en) VC model-based mouth motion driving model training method and component

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant