CN112562720A - Lip-synchronization video generation method, device, equipment and storage medium - Google Patents

Lip-synchronization video generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN112562720A
CN112562720A CN202011372011.4A CN202011372011A CN112562720A CN 112562720 A CN112562720 A CN 112562720A CN 202011372011 A CN202011372011 A CN 202011372011A CN 112562720 A CN112562720 A CN 112562720A
Authority
CN
China
Prior art keywords
data
lip
image
network
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011372011.4A
Other languages
Chinese (zh)
Inventor
李�权
王伦基
叶俊杰
成秋喜
胡玉针
李嘉雄
朱杰
***
韩蓝青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Original Assignee
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CYAGEN BIOSCIENCES (GUANGZHOU) Inc, Research Institute Of Tsinghua Pearl River Delta filed Critical CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Priority to CN202011372011.4A priority Critical patent/CN112562720A/en
Publication of CN112562720A publication Critical patent/CN112562720A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L21/14Transforming into visible information by displaying frequency domain information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a lip-shaped synchronous video generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: after original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip-shaped synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip-shaped generating model is built according to the generating network, the lip-shaped synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip-shaped generating model to generate lip-shaped synchronous image data. The method can accurately generate the lip-shaped image of the person speaking in the video, and can be widely applied to the technical field of video data processing.

Description

Lip-synchronization video generation method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of video data processing, in particular to a lip-sync video generation method, a lip-sync video generation device, lip-sync video generation equipment and a lip-sync video generation storage medium.
Background
With the continuous increase of rich and diverse video contents, new requirements are put forward on the creation mode of the video contents, and the key problem that the videos can be watched through different languages is urgently needed to be solved. Such as a series of lectures, or a large news lecture, a very nice looking movie, or even a very interesting animation. This allows viewers in more different language environments to better see the video if they are translated into the desired target language. By interpreting a video of a spoken face or creating a new video in this way, the key problem to be solved is to correct the mouth shape and match it to the target speech.
Some current techniques require that there be no complex changes in the motion and background of a particular character's still image or video character seen in training to enable character lip formation. However, in a complex dynamic background, unlimited speaking face video, lip movements of any identity cannot be accurately changed, so that the lip part of the character in the video is not synchronized with new audio.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for generating a lip-sync video with high accuracy.
One aspect of the present invention provides a lip-synchronized video generation method, including:
acquiring original video data, wherein the original video data comprises voice data and image data of a character in different scenes;
performing character marking on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each section of voice data in a video image;
performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
training to obtain a generating network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the lip synchronization judging network is used for judging the synchronization of the lip shape of the person and the voice frequency of the person, and the image quality judging network is used for judging the truth and the quality of the generated image;
constructing a figure lip shape generation model according to the generation network, the lip shape synchronization judging network and the image quality judging network;
and processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, the method further comprises pre-processing the voice data and the image data in the raw video data;
specifically, the preprocessing the voice data in the original video data includes:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum;
the preprocessing of the image data in the original video data comprises:
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
In some embodiments, the generation network comprises a voice encoder, an image decode generator;
the voice encoder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
the image encoder is used for extracting image characteristics from a sequence frame of the image data obtained by preprocessing through convolution coding;
and the image decoding generator is used for generating a lip image of the person according to the sound characteristic and the image characteristic.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
wherein S iswJudging the influence of the network on the overall loss value for lip synchronization; sgJudging the influence of the network on the overall loss value for the image quality; loss is the overall Loss function value of the generated model of the lip shape of the human object; l is1The mean square error loss value of the real image and the generated image; l issyncGenerating loss values of the synchronous rate of the lip-shaped videos and the audios of the characters; l isgenA loss value is determined for the real image and the generated image for the image determination network.
In some embodiments, the input sequence of pictures is provided with a tag constraint;
the label constraints include a variable size edge pixel outline constraint, a face lip keypoint outline constraint, a head outline constraint, and a background constraint.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring original video data, and the original video data comprises voice data and image data of people in different scenes;
the voice labeling module is used for performing character labeling on voice data in the original video data to obtain first data, and the first data is used for determining the position of a face corresponding to each section of voice data in a video image;
the face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
the training module is used for training to obtain a generation network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the lip synchronization judging network is used for judging the synchronization of the lip shape of the person and the voice frequency of the person, and the image quality judging network is used for judging the truth and the quality of the generated image;
the building module is used for building a figure lip shape generating model according to the generating network, the lip shape synchronous judging network and the image quality judging network;
and the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a pre-processing module is further included;
the preprocessing module is configured to:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum;
and the number of the first and second groups,
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
Another aspect of the present invention also provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the present invention provides a computer readable storage medium storing a program for execution by a processor to implement the method as described above.
According to the embodiment of the invention, after original video data is obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip-shaped synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip-shaped generating model is constructed according to the generating network, the lip-shaped synchronous judging network and the image quality judging network, and finally an input sequence picture is processed through the character lip-shaped generating model to generate lip-shaped synchronous image data. The invention can accurately generate lip-shaped images of people in the video when speaking.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is an overall step diagram of a lip-sync video generation method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Aiming at the problems in the prior art, the invention researches the problems of human lip shape generation and voice matching, and the human face lip shape of any speaker can be matched with any target voice, including real voice and synthesized voice. And real-world videos contain rapidly changing pose, scale and lighting changes, the generated face results must also be seamlessly fused into the original target video.
The invention firstly adopts an end-to-end model to carry out encode on the sound and the video image, and then generates a lip-shaped image matched with the sound through decode decoding. Meanwhile, the invention adopts a strong lip synchronization discriminator, which can accurately judge the synchronization accuracy and the realistic lip movement of the generated lip and the voice and is used for guiding the generation of a more synchronous lip; in addition, the invention adopts a high-quality image quality discriminator, can accurately judge the truth and the falseness of the image and the quality of the image, and is used for guiding the generation of a more vivid lip image. The present invention performs extensive quantitative and subjective human assessment and is vastly superior to current methods on many benchmarks.
An embodiment of the present invention provides a lip-sync video generation method, as shown in fig. 1, the method includes:
s1, acquiring original video data, wherein the original video data comprises voice data and image data of a person in different scenes;
the voice data in the video is multi-user and multi-language mixed voice data, the image data in the video is speaking face data of various scenes, proportions and illumination, and the video resolution is above 1080p as far as possible.
S2, performing character annotation on the voice data in the original video data to obtain first data, wherein the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
specifically, the embodiment of the invention divides the video into a plurality of small segments with voice matched with the video of the speaker through marking and stores the small segments. And matching and labeling the voice and the speaker to the collected data, marking the position of the face of the speaker corresponding to each section of voice in the video image, and simultaneously ensuring the time length synchronization of the voice and the video.
S3, performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
specifically, the embodiment of the invention performs face detection on each frame of the marked video segment, obtains the position of the face in each frame through the face detection, and extends the obtained face position information to the chin direction by 5-50 pixels, thereby ensuring that a face detection frame can cover the whole face. And then intercepting and storing each frame of face image through the optimized face detection frame, and simultaneously storing the voice data of the video clip.
S4, training to obtain a generating network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a person lip image, the lip synchronization judging network is used for judging the synchronism of the person lip and the person audio, and the image quality judging network is used for judging the truth and the quality of the generated image;
s5, constructing a character lip shape generation model according to the generation network, the lip shape synchronization judging network and the image quality judging network;
the embodiment of the invention constructs a high-definition character lip shape generation model based on a conditional GAN (generation countermeasure network), the whole model structure is divided into two parts of a high-definition character image generation network and a judgment network, the generation network is mainly used for generating a high-definition character lip shape image, input data is a preprocessed conditional Mask, a reference frame and audio, and the output is a high-definition character lip shape image frame synchronous with the audio. The judgment network is used in model training and has the functions of judging whether the generated character image is real and synchronous with the lip shape and the audio, and feeding back loss to the generation network after calculating the difference value between the generated image and the real image and between the generated lip shape and the real lip shape, so as to optimize the quality of the generated image and the lip shape synchronization quality of the generation network.
S6, processing the input sequence picture by the character lip shape generation model, and generating lip-synchronized image data.
In some embodiments, before the training step of step S4, the method further includes: preprocessing voice data and image data in original video data;
specifically, the preprocessing the voice data in the original video data includes:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum;
the preprocessing of the image data in the original video data comprises:
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
In the embodiment of the invention, before the sound and the image are input into the conditional GAN network model, the sound and the image are respectively preprocessed. The acoustic preprocessing is to normalize the audio data and then convert the audio waveform data into an acoustic spectrogram, including but not limited to a mel-frequency spectrum, a linear spectrum, etc. The image data preprocessing is to set the lower half part of each frame image containing the lip shape in the video sequence frame to be generated to be 0, to generate a network to generate a complementary lip shape image, and to select the same number of reference frames as the generated video sequence for coding the character characteristic information, thereby providing better generation effect. Meanwhile, in order to ensure the correlation of the front frame and the rear frame of the generated video, different video series frame inputs are set during training, the generation network learns the correlation relation of the front frame and the rear frame of the video in the training process, so that the generated video is more smooth and natural, and the number of the generated video sequence frames can be selected to be 1, 3, 5, 7, 9 and the like according to the generation requirements of different video scenes and characters.
In some embodiments, the generation network comprises a voice encoder, an image decode generator;
the voice encoder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
the image encoder is used for extracting image characteristics from a sequence frame of the image data obtained by preprocessing through convolution coding;
and the image decoding generator is used for generating a lip image of the person according to the sound characteristic and the image characteristic.
Specifically, the generation network according to the embodiment of the present invention may be divided into a sound encoder, an image encoder, and an image decoding generator. Firstly, inputting a preprocessed sound spectrogram into a sound coder, and extracting sound characteristics through convolutional coding. The preprocessed image sequence data is also input into an image encoder, and the image features are extracted through convolutional coding, wherein the input image resolution includes, but is not limited to, 96x96, 128x128, 256x256, 512x512 and the like. The extracted sound and image features are then input into an image decoding generator, and finally, a lip image of the person synchronized with the sound is generated, wherein the generated image can include, but is not limited to 96x96, 128x128, 256x256, 512x512 and the like according to different generation requirements.
Specifically, the discrimination network can be divided into a lip-shaped synchronous discrimination network and an image quality discrimination network, and is used for training, detecting the image quality generated by the generation network and lip-shaped synchronization, giving an image quality discrimination value and a lip-shaped synchronous discrimination value, and guiding the generation network to generate a higher-definition real image and a more real synchronous lip shape. The lip synchronization judging network is a pre-training network, the audio of the current frame and the image frame generated correspondingly are input, the synchronous matching degree of each frame of lip image generated and the corresponding audio is output, and the discriminator guides the optimization and improvement during network training through judging and giving a feedback value to generate the lip image which is more synchronous with the sound. The image quality judging network and the generating network are trained simultaneously, the input is a generated image and a real image, the output is a probability value of image truth, the probability value is used for judging the quality of the generated image, and the generating network is guided to generate a more vivid image in the training process.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
wherein S iswJudging the influence of the network on the overall loss value for lip synchronization; sgJudging the influence of the network on the overall loss value for the image quality; loss is the overall Loss function value of the generated model of the lip shape of the human object; l is1The mean square error loss value of the real image and the generated image; l issyncGenerating loss values of the synchronous rate of the lip-shaped videos and the audios of the characters; l isgenDiscriminating loss values for real images and generated images for image discrimination networks
Specifically, the overall loss function in the formula is obtained by weighted summation of the loss of the image L1, the loss of lip video and audio synchronization, and the loss of image quality. Sw and Sg are weight coefficients of the influence of the lip-shaped synchronous discriminator and the image quality discriminator on the overall loss respectively, and the weight of the influence of the discriminator in the overall image generation can be adjusted according to requirements. In the GAN loss, the network D is judged to continuously carry out iteration to maximize an objective function, and the network G is generated to continuously carry out iteration to minimize the loss of the image L1, the loss of lip video and audio synchronization and the loss of image quality, so that the lip image with clearer details is generated.
In some embodiments, the input sequence of pictures is provided with a tag constraint;
the label constraints include a variable size edge pixel outline constraint, a face lip keypoint outline constraint, a head outline constraint, and a background constraint.
Specifically, in order to generate a realistic lip image of a person, the embodiment of the present invention inputs the data as a sequence picture with a label limitation condition, where the limitation condition may be a variable size edge pixel outline, a face lip key point outline limitation, a head outline, a background, and the like. By including the restriction conditions in the picture, the generated content can be more finely controlled, and a more controllable high-definition image can be generated. And new input limiting conditions can be added according to new generation requirements generated in subsequent use, so that the generated content is expanded more abundantly according to the requirements.
In summary, the invention can generate the high-definition character video matched with the sound only by inputting the sound and the video to be translated, and can be used as a general high-definition video translation generation framework. Specifically, the invention trains an accurate lip synchronization discriminator to be used for guiding the generation network to generate accurate and natural lip movement. And generating high-definition human face images which have different images and are matched with sound aiming at different application fields (open news, lecture education, movie and television drama and the like). The invention is generated completely in an intelligent mode from scratch, does not need to record each video segment by real people, and has faster production efficiency and richer expansion forms.
Compared with the prior art, the invention provides a novel video character lip shape generation and synchronization model, which can generate a face synchronization lip shape video of any speaker by using any voice, and is more accurate and better in generalization than lip shapes generated by other current works.
The invention also provides a novel lip synchronization judging model so as to accurately judge lip synchronization in various complex environment videos.
The model of the invention is independent of specific data training, is a speaker-independent generation model, and can generate lip shape matched with voice even if the lip shape data of the person does not appear in the training.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring original video data, and the original video data comprises voice data and image data of people in different scenes;
the voice labeling module is used for performing character labeling on voice data in the original video data to obtain first data, and the first data is used for determining the position of a face corresponding to each section of voice data in a video image;
the face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
the training module is used for training to obtain a generation network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, and the judging network is used for judging the synchronism of the figure lip and the figure audio;
the building module is used for building a figure lip shape generating model according to the generating network, the lip shape synchronous judging network and the image quality judging network;
and the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a pre-processing module is further included;
the preprocessing module is configured to:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum;
and the number of the first and second groups,
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
Another aspect of the present invention also provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the present invention provides a computer readable storage medium storing a program for execution by a processor to implement the method as described above.
The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A lip-synchronized video generation method, comprising:
acquiring original video data, wherein the original video data comprises voice data and image data of a character in different scenes;
performing character marking on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each section of voice data in a video image;
performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
training to obtain a generating network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a person lip image, the lip synchronization judging network is used for judging the synchronism of the person lip and the person audio, and the image quality judging network is used for judging the truth and the quality of the generated image;
constructing a figure lip shape generation model according to the generation network, the lip shape synchronization judging network and the image quality judging network;
and processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
2. A lip-synchronized video generation method according to claim 1, further comprising preprocessing the voice data and image data in the raw video data;
specifically, the preprocessing the voice data in the original video data includes:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum;
the preprocessing of the image data in the original video data comprises:
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
3. A lip-synchronized video generation method according to claim 2, wherein the generation network includes a voice encoder, an image decoding generator;
the voice encoder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
the image encoder is used for extracting image characteristics from a sequence frame of the image data obtained by preprocessing through convolution coding;
and the image decoding generator is used for generating a lip image of the person according to the sound characteristic and the image characteristic.
4. A lip-synchronized video generation method according to claim 1, wherein the objective loss function of the human lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
wherein S iswJudging the influence of the network on the overall loss value for lip synchronization; sgJudging the influence of the network on the overall loss value for the image quality; loss is the overall Loss function value of the generated model of the lip shape of the human object; l is1The mean square error loss value of the real image and the generated image; l issyncGenerating loss values of the synchronous rate of the lip-shaped videos and the audios of the characters; l isgenA loss value is determined for the real image and the generated image for the image determination network.
5. A lip-synchronized video generation method according to claim 1, wherein the input sequence of pictures is provided with a tag constraint;
the label constraints include a variable size edge pixel outline constraint, a face lip keypoint outline constraint, a head outline constraint, and a background constraint.
6. A lip-synchronized video generating device, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring original video data, and the original video data comprises voice data and image data of people in different scenes;
the voice labeling module is used for performing character labeling on voice data in the original video data to obtain first data, and the first data is used for determining the position of a face corresponding to each section of voice data in a video image;
the face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
the training module is used for training to obtain a generation network, a lip synchronization judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a person lip image, the lip synchronization judging network is used for judging the synchronism of the person lip and the person audio, and the image quality judging network is used for judging the truth and the quality of the generated image;
the building module is used for building a figure lip shape generating model according to the generating network, the lip shape synchronous judging network and the image quality judging network;
and the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
7. The lip-synchronized video generating device of claim 6, further comprising a pre-processing module;
the preprocessing module is configured to:
carrying out normalization processing on the voice data to obtain audio waveform data;
converting the audio waveform data into a sound spectrogram, including but not limited to a mel spectrum and a linear spectrum; and the number of the first and second groups,
setting 0 for the lower half part pixel point of each frame image containing the lip shape in the sequence frame of the image data so as to enable the generated network to generate a complete lip shape image;
and determining the same number of reference frames as the number of the sequence frames, wherein the reference frames are used for coding the character feature information.
8. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program realizes the method according to any one of claims 1-5.
9. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1-5.
CN202011372011.4A 2020-11-30 2020-11-30 Lip-synchronization video generation method, device, equipment and storage medium Pending CN112562720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011372011.4A CN112562720A (en) 2020-11-30 2020-11-30 Lip-synchronization video generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011372011.4A CN112562720A (en) 2020-11-30 2020-11-30 Lip-synchronization video generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112562720A true CN112562720A (en) 2021-03-26

Family

ID=75045329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011372011.4A Pending CN112562720A (en) 2020-11-30 2020-11-30 Lip-synchronization video generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112562720A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113179449A (en) * 2021-04-22 2021-07-27 清华珠三角研究院 Method, system, device and storage medium for driving image by voice and motion
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113242361A (en) * 2021-07-13 2021-08-10 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113362471A (en) * 2021-05-27 2021-09-07 深圳市木愚科技有限公司 Virtual teacher limb action generation method and system based on teaching semantics
CN113380269A (en) * 2021-06-08 2021-09-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113542624A (en) * 2021-05-28 2021-10-22 阿里巴巴新加坡控股有限公司 Method and device for generating commodity object explanation video
CN113628635A (en) * 2021-07-19 2021-11-09 武汉理工大学 Voice-driven speaking face video generation method based on teacher and student network
CN114071204A (en) * 2021-11-16 2022-02-18 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video synthesis method, system medium and application
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN114663962A (en) * 2022-05-19 2022-06-24 浙江大学 Lip-shaped synchronous face forgery generation method and system based on image completion
CN115345968A (en) * 2022-10-19 2022-11-15 北京百度网讯科技有限公司 Virtual object driving method, deep learning network training method and device
CN115376211A (en) * 2022-10-25 2022-11-22 北京百度网讯科技有限公司 Lip driving method, lip driving model training method, device and equipment
CN115580743A (en) * 2022-12-08 2023-01-06 成都索贝数码科技股份有限公司 Method and system for driving human mouth shape in video
WO2023035969A1 (en) * 2021-09-09 2023-03-16 马上消费金融股份有限公司 Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN116188637A (en) * 2023-04-23 2023-05-30 世优(北京)科技有限公司 Data synchronization method and device
CN116248974A (en) * 2022-12-29 2023-06-09 南京硅基智能科技有限公司 Video language conversion method and system
CN116741198A (en) * 2023-08-15 2023-09-12 合肥工业大学 Lip synchronization method based on multi-scale dictionary
CN117150089A (en) * 2023-10-26 2023-12-01 环球数科集团有限公司 Character artistic image changing system based on AIGC technology

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767325A (en) * 2017-09-12 2018-03-06 深圳市朗形网络科技有限公司 Method for processing video frequency and device
WO2018133825A1 (en) * 2017-01-23 2018-07-26 腾讯科技(深圳)有限公司 Method for processing video images in video call, terminal device, server, and storage medium
US20180336471A1 (en) * 2017-05-19 2018-11-22 Mehdi Rezagholizadeh Semi-supervised regression with generative adversarial networks
CN109819313A (en) * 2019-01-10 2019-05-28 腾讯科技(深圳)有限公司 Method for processing video frequency, device and storage medium
CN110610534A (en) * 2019-09-19 2019-12-24 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN110706308A (en) * 2019-09-07 2020-01-17 创新奇智(成都)科技有限公司 GAN-based steel coil end face edge loss artificial sample generation method
CN111261187A (en) * 2020-02-04 2020-06-09 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN111783603A (en) * 2020-06-24 2020-10-16 有半岛(北京)信息科技有限公司 Training method for generating confrontation network, image face changing method and video face changing method and device
US20200349682A1 (en) * 2019-05-03 2020-11-05 Amazon Technologies, Inc. Video enhancement using a generator with filters of generative adversarial network

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018133825A1 (en) * 2017-01-23 2018-07-26 腾讯科技(深圳)有限公司 Method for processing video images in video call, terminal device, server, and storage medium
US20180336471A1 (en) * 2017-05-19 2018-11-22 Mehdi Rezagholizadeh Semi-supervised regression with generative adversarial networks
CN107767325A (en) * 2017-09-12 2018-03-06 深圳市朗形网络科技有限公司 Method for processing video frequency and device
CN109819313A (en) * 2019-01-10 2019-05-28 腾讯科技(深圳)有限公司 Method for processing video frequency, device and storage medium
US20200349682A1 (en) * 2019-05-03 2020-11-05 Amazon Technologies, Inc. Video enhancement using a generator with filters of generative adversarial network
CN110706308A (en) * 2019-09-07 2020-01-17 创新奇智(成都)科技有限公司 GAN-based steel coil end face edge loss artificial sample generation method
CN110610534A (en) * 2019-09-19 2019-12-24 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111261187A (en) * 2020-02-04 2020-06-09 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN111783603A (en) * 2020-06-24 2020-10-16 有半岛(北京)信息科技有限公司 Training method for generating confrontation network, image face changing method and video face changing method and device

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video synthesis method, system medium and application
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN113194348A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113179449A (en) * 2021-04-22 2021-07-27 清华珠三角研究院 Method, system, device and storage medium for driving image by voice and motion
CN113362471A (en) * 2021-05-27 2021-09-07 深圳市木愚科技有限公司 Virtual teacher limb action generation method and system based on teaching semantics
CN113542624A (en) * 2021-05-28 2021-10-22 阿里巴巴新加坡控股有限公司 Method and device for generating commodity object explanation video
CN113380269A (en) * 2021-06-08 2021-09-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113380269B (en) * 2021-06-08 2023-01-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113242361B (en) * 2021-07-13 2021-09-24 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113242361A (en) * 2021-07-13 2021-08-10 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113628635B (en) * 2021-07-19 2023-09-15 武汉理工大学 Voice-driven speaker face video generation method based on teacher student network
CN113628635A (en) * 2021-07-19 2021-11-09 武汉理工大学 Voice-driven speaking face video generation method based on teacher and student network
WO2023035969A1 (en) * 2021-09-09 2023-03-16 马上消费金融股份有限公司 Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN114071204A (en) * 2021-11-16 2022-02-18 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device
CN114071204B (en) * 2021-11-16 2024-05-03 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN114419702B (en) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 Digital person generation model, training method of model, and digital person generation method
CN114663962A (en) * 2022-05-19 2022-06-24 浙江大学 Lip-shaped synchronous face forgery generation method and system based on image completion
CN115345968A (en) * 2022-10-19 2022-11-15 北京百度网讯科技有限公司 Virtual object driving method, deep learning network training method and device
CN115345968B (en) * 2022-10-19 2023-02-07 北京百度网讯科技有限公司 Virtual object driving method, deep learning network training method and device
CN115376211A (en) * 2022-10-25 2022-11-22 北京百度网讯科技有限公司 Lip driving method, lip driving model training method, device and equipment
CN115580743A (en) * 2022-12-08 2023-01-06 成都索贝数码科技股份有限公司 Method and system for driving human mouth shape in video
CN116248974A (en) * 2022-12-29 2023-06-09 南京硅基智能科技有限公司 Video language conversion method and system
CN116188637B (en) * 2023-04-23 2023-08-15 世优(北京)科技有限公司 Data synchronization method and device
CN116188637A (en) * 2023-04-23 2023-05-30 世优(北京)科技有限公司 Data synchronization method and device
CN116741198A (en) * 2023-08-15 2023-09-12 合肥工业大学 Lip synchronization method based on multi-scale dictionary
CN116741198B (en) * 2023-08-15 2023-10-20 合肥工业大学 Lip synchronization method based on multi-scale dictionary
CN117150089A (en) * 2023-10-26 2023-12-01 环球数科集团有限公司 Character artistic image changing system based on AIGC technology
CN117150089B (en) * 2023-10-26 2023-12-22 环球数科集团有限公司 Character artistic image changing system based on AIGC technology

Similar Documents

Publication Publication Date Title
CN112562720A (en) Lip-synchronization video generation method, device, equipment and storage medium
CN113192161B (en) Virtual human image video generation method, system, device and storage medium
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
CN112562721B (en) Video translation method, system, device and storage medium
Cao et al. Expressive speech-driven facial animation
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
Zhou et al. An image-based visual speech animation system
WO2021023869A1 (en) Audio-driven speech animation using recurrent neutral network
CN113077537A (en) Video generation method, storage medium and equipment
CN114419702A (en) Digital human generation model, training method of model, and digital human generation method
Yu et al. Multimodal learning for temporally coherent talking face generation with articulator synergy
US11928767B2 (en) Method for audio-driven character lip sync, model for audio-driven character lip sync and training method therefor
CN115761075A (en) Face image generation method, device, equipment, medium and product
Bigioi et al. Speech driven video editing via an audio-conditioned diffusion model
Wang et al. Attention-based lip audio-visual synthesis for talking face generation in the wild
CN113395569B (en) Video generation method and device
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN114793300A (en) Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN116828129A (en) Ultra-clear 2D digital person generation method and system
Wang et al. Talking faces: Audio-to-video face generation
Jha et al. Cross-language speech dependent lip-synchronization
Wang et al. Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
CN114155321B (en) Face animation generation method based on self-supervision and mixed density network
Ravichandran et al. Synthesizing photorealistic virtual humans through cross-modal disentanglement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination