CN110347867B - Method and device for generating lip motion video - Google Patents

Method and device for generating lip motion video Download PDF

Info

Publication number
CN110347867B
CN110347867B CN201910640823.3A CN201910640823A CN110347867B CN 110347867 B CN110347867 B CN 110347867B CN 201910640823 A CN201910640823 A CN 201910640823A CN 110347867 B CN110347867 B CN 110347867B
Authority
CN
China
Prior art keywords
lip
key point
pronunciation unit
point sequence
sequence corresponding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910640823.3A
Other languages
Chinese (zh)
Other versions
CN110347867A (en
Inventor
龙翔
李鑫
刘霄
赵翔
王平
李甫
张赫男
孙昊
文石磊
丁二锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910640823.3A priority Critical patent/CN110347867B/en
Publication of CN110347867A publication Critical patent/CN110347867A/en
Application granted granted Critical
Publication of CN110347867B publication Critical patent/CN110347867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the application discloses a method and a device for generating lip motion videos. One embodiment of the method comprises: acquiring a target text; determining a lip key point sequence corresponding to each pronunciation unit of the target text; generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text. This embodiment improves the efficiency of generating lip motion video.

Description

Method and device for generating lip motion video
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating lip motion videos.
Background
The lip motion video generation technology is a lip motion video which is synthesized by a computer technology and has specified content, complete correspondence in time and natural smoothness.
At present, a common lip motion video generation method is to record lip motion videos corresponding to all possible pronunciation units, split a sentence to be synthesized into a sequence of pronunciation units, scale the lip motion video corresponding to each pronunciation unit according to a specified time, and splice and synthesize the lip motion video.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating lip motion videos.
In a first aspect, an embodiment of the present application provides a method for generating a lip motion video, including: acquiring a target text; determining a lip key point sequence corresponding to each pronunciation unit of the target text; generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text.
In some embodiments, the method further comprises: synthesizing the voice corresponding to the target text by utilizing a voice synthesis technology; and fusing the voice corresponding to the target text into the lip action video corresponding to the target text.
In some embodiments, determining a lip keypoint sequence corresponding to each pronunciation unit of the target text comprises: acquiring lip action videos of continuous sentences pre-recorded by a target person and original lip action videos of each pronunciation unit; for each pronunciation unit, determining a lip key point sequence corresponding to a lip action video segment similar to an original lip action video of the pronunciation unit in the lip action videos of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit; and determining the lip key point sequence corresponding to the pronunciation unit from the candidate lip key point sequence set corresponding to the pronunciation unit.
In some embodiments, determining a lip key point sequence corresponding to a lip motion video segment similar to the original lip motion video of the pronunciation unit in the lip motion videos of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit includes: extracting lip key points of the lip action video of the continuous sentences to obtain a lip key point sequence of the continuous sentences; extracting lip key points of the original lip motion video of the pronunciation unit to obtain an original lip key point sequence of the pronunciation unit; and determining lip key point sequences similar to the original lip key point sequence of the pronunciation unit from the lip key point sequences of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit.
In some embodiments, determining a sequence of lip keypoints that is similar to the original sequence of lip keypoints for the pronunciation unit from a sequence of lip keypoints for a continuous sentence comprises: determining the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit based on the original lip key points in the original lip key point sequence of the pronunciation unit and the lip key points in the lip key point sequence of the continuous sentence; and performing path backtracking based on the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit, and determining the lip key point sequence similar to the original lip key point sequence of the pronunciation unit.
In some embodiments, determining the lip keypoint sequence corresponding to the pronunciation unit from the set of candidate lip keypoint sequences corresponding to the pronunciation unit includes: calculating the similarity of each candidate lip key point sequence corresponding to the pronunciation unit and each candidate lip key point sequence corresponding to the adjacent pronunciation unit of the pronunciation unit; determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity; and performing path backtracking based on the ending position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
In some embodiments, generating a lip keypoint sequence corresponding to the target text based on the lip keypoint sequence corresponding to each pronunciation unit includes: determining the starting and stopping time of each pronunciation unit based on the voice corresponding to the target text; and matching the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit to generate the lip key point sequence corresponding to the target text.
In some embodiments, matching the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit comprises: and performing linear interpolation on the lip key point sequence corresponding to each pronunciation unit in time sequence, and matching the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
In some embodiments, after matching the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit, the method further comprises: and smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
In some embodiments, smoothing the lip keypoint sequence corresponding to the adjacent pronunciation unit includes: selecting a lip key point sequence segment with a later preset time length corresponding to a previous pronunciation unit and a lip key point sequence segment with a former preset time length corresponding to a next pronunciation unit in adjacent pronunciation units; and based on the selected lip key points, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
In some embodiments, the image synthesis network is trained by: obtaining a training sample, wherein the training sample comprises sample lip key points and sample lip action images; and training to obtain the image synthesis network by taking the sample lip key points as input and the sample lip action images as output.
In some embodiments, the sample lip motion image is an image extracted from a lip motion video of a continuous sentence pre-recorded by the target person, and the sample lip keypoints are lip keypoints obtained by performing lip keypoint extraction on the extracted image.
In a second aspect, the present application provides an apparatus for generating a lip motion video, including: a text acquisition unit configured to acquire a target text; the sequence determining unit is configured to determine a lip key point sequence corresponding to each pronunciation unit of the target text; the sequence generating unit is configured to generate a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; the image synthesis unit is configured to input the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and the video generation unit is configured to splice the lip action image sequence corresponding to the target text and generate a lip action video corresponding to the target text.
In some embodiments, the apparatus further comprises: a voice synthesis unit configured to synthesize a voice corresponding to the target text using a voice synthesis technique; and the voice fusion unit is configured to fuse the voice corresponding to the target text into the lip action video corresponding to the target text.
In some embodiments, the sequence determination unit comprises: the video acquisition subunit is configured to acquire lip action videos of continuous sentences pre-recorded by a target person and original lip action videos of each pronunciation unit; the set generation subunit is configured to determine lip key point sequences corresponding to lip action video segments similar to the original lip action video of the pronunciation unit in the lip action videos of the continuous sentences and generate candidate lip key point sequence sets corresponding to the pronunciation unit for each pronunciation unit; and the sequence determining subunit is configured to determine the lip key point sequence corresponding to the pronunciation unit from the candidate lip key point sequence set corresponding to the pronunciation unit.
In some embodiments, the set generation subunit comprises: the first extraction module is configured to extract lip key points of the lip action video of the continuous sentences to obtain lip key point sequences of the continuous sentences; the second extraction module is configured to extract lip key points of the original lip motion video of the pronunciation unit to obtain an original lip key point sequence of the pronunciation unit; and the set generation module is configured to determine a lip key point sequence similar to the original lip key point sequence of the pronunciation unit from the lip key point sequences of the continuous sentences, and generate a candidate lip key point sequence set corresponding to the pronunciation unit.
In some embodiments, the set generation module is further configured to: determining the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit based on the original lip key points in the original lip key point sequence of the pronunciation unit and the lip key points in the lip key point sequence of the continuous sentence; and performing path backtracking based on the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit, and determining the lip key point sequence similar to the original lip key point sequence of the pronunciation unit.
In some embodiments, the sequence determination subunit is further configured to: calculating the similarity of each candidate lip key point sequence corresponding to the pronunciation unit and each candidate lip key point sequence corresponding to the adjacent pronunciation unit of the pronunciation unit; determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity; and performing path backtracking based on the ending position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
In some embodiments, the sequence generation unit comprises: the time determining subunit is configured to determine the start-stop time of each pronunciation unit based on the corresponding voice of the target text; and the sequence generating subunit is configured to match the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit, and generate the lip key point sequence corresponding to the target text.
In some embodiments, the sequence generation subunit comprises: and the linear interpolation module is configured to perform linear interpolation on the lip key point sequence corresponding to each pronunciation unit in time sequence, and match the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
In some embodiments, the sequence generation subunit further comprises: and the smoothing processing module is configured to smooth the lip key point sequences corresponding to the adjacent pronunciation units.
In some embodiments, the smoothing module is further configured to: selecting a lip key point sequence segment with a later preset time length corresponding to a previous pronunciation unit and a lip key point sequence segment with a former preset time length corresponding to a next pronunciation unit in adjacent pronunciation units; and based on the selected lip key points, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
In some embodiments, the image synthesis network is trained by: obtaining a training sample, wherein the training sample comprises sample lip key points and sample lip action images; and training to obtain the image synthesis network by taking the sample lip key points as input and the sample lip action images as output.
In some embodiments, the sample lip motion image is an image extracted from a lip motion video of a continuous sentence pre-recorded by the target person, and the sample lip keypoints are lip keypoints obtained by performing lip keypoint extraction on the extracted image.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
According to the method and the device for generating the lip motion video, firstly, a lip key point sequence corresponding to each pronunciation unit of the obtained target text is determined; then generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; then inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and finally, the lip action image sequence corresponding to the target text is spliced to generate a lip action video corresponding to the target text. Thereby improving the efficiency of generating lip motion video.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture to which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for generating lip motion video according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a method for generating lip motion video according to the present application;
FIG. 4 is a flow diagram of another embodiment of a method for generating lip motion video according to the present application;
FIG. 5 is a schematic diagram illustrating the structure of one embodiment of an apparatus for generating lip motion video according to the present application;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for generating lip motion video or apparatus for generating lip motion video may be applied.
As shown in fig. 1, a system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various client software, such as a video generation application, etc., may be installed on the terminal device 101.
The terminal apparatus 101 may be hardware or software. When the terminal apparatus 101 is hardware, it may be various electronic apparatuses having a display screen and supporting video playback. Including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatus 101 is software, it can be installed in the above-described electronic apparatus. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.
The server 103 may be a server that provides various services. Such as a video generation server. The video generation server may perform processing such as analysis on data such as the target text, generate a processing result (for example, a lip motion video corresponding to the target text), and push the processing result to the terminal device 101.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the method for generating the lip motion video provided by the embodiment of the present application is generally performed by the server 103, and accordingly, the apparatus for generating the lip motion video is generally disposed in the server 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating lip motion video in accordance with the present application is shown. The method for generating the lip motion video comprises the following steps:
step 201, obtaining a target text.
In the present embodiment, an executing subject (e.g., the server 103 shown in fig. 1) of the method for generating a lip motion video may acquire a target text from a terminal device (e.g., the terminal device 101 shown in fig. 1) communicatively connected thereto. In practice, a user can open a video generation application installed on the terminal device, input a target text, and submit the target text to the execution main body.
Step 202, determining a lip key point sequence corresponding to each pronunciation unit of the target text.
In this embodiment, the execution subject may determine a lip keypoint sequence corresponding to each pronunciation unit of the target text.
Here, the pronunciation unit may be a certain granularity of pronunciation units for human speech. For example, for Chinese, the pronunciation unit can be a Pinyin combination, a single vowel, and the like. Generally, the smaller the granularity at which a pronunciation unit is split, the fewer its number, and the lower the difficulty of labeling a pronunciation unit based on text; the larger the granularity at which the pronunciation unit is split, the larger the number thereof, and the smoother the lip motion video synthesized based on the pronunciation unit. Here, the execution main body may select a suitable splitting granularity according to a requirement, and split the target text into a pronunciation unit sequence.
In general, the lips of the target person are varied according to the lip shape during the process of uttering different pronunciation units. The lip key point sequence corresponding to each pronunciation unit may be a sequence of lip key point combinations at each time point in the process of the lip change of each pronunciation unit issued by the target person. The target person may be a person in the lip motion video that is desired to be synthesized, and the person may be any designated person. Lip keypoints may include lip center and lip angle, etc. The lip key points can be represented by vectors, namely one-dimensional vectors formed by sequentially splicing the coordinates of all key points corresponding to a single lip after the center angle is normalized.
And step 203, generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit.
In this embodiment, the execution subject may generate a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit. For example, the execution main body may sequentially splice the lip keypoint sequence corresponding to each pronunciation unit according to the position of each pronunciation unit in the target text, so as to obtain the lip keypoint sequence corresponding to the target text.
And 204, inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text.
In this embodiment, the executing entity may input the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text. The image synthesis network can be used for synthesizing the lip action images and representing the corresponding relation between the lip key points and the lip action images.
In some optional implementations of the present embodiment, the image synthesis network may be obtained by performing supervised training on an existing machine learning model using a machine learning method and a training sample. In general, the image synthesis network may employ a pix2pixHD neural network structure to generate high-resolution lip motion images.
Here, the image synthesis network may be trained by:
first, training samples are obtained.
Wherein the training samples may include sample lip keypoints and sample lip motion images. The sample lip motion image may be an image extracted from a lip motion video of consecutive sentences pre-recorded by the target person. The sample lip key points may be lip key points obtained by performing lip key point extraction on the extracted image.
And then, training to obtain an image synthesis network by taking the sample lip key points as input and the sample lip motion images as output.
And step 205, splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text.
In this embodiment, the execution subject may stitch the lip motion image sequence corresponding to the target text to generate a lip motion video corresponding to the target text.
In some optional implementation manners of this embodiment, the execution main body may also first synthesize a voice corresponding to the target text by using a voice synthesis technology; and then, the voice corresponding to the target text is fused into the lip action video corresponding to the target text.
According to the method for generating the lip motion video, firstly, a lip key point sequence corresponding to each pronunciation unit of the obtained target text is determined; then generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; then inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and finally, the lip action image sequence corresponding to the target text is spliced to generate a lip action video corresponding to the target text. Thereby improving the efficiency of generating lip motion video.
With further reference to fig. 3, a flow 300 of yet another embodiment of a method for generating lip motion video according to the present application is shown. The method for generating the lip motion video comprises the following steps:
step 301, obtaining a target text.
In this embodiment, the specific operation of step 301 has been described in detail in step 201 in the embodiment shown in fig. 2, and is not described herein again.
Step 302, obtaining a lip motion video of continuous sentences pre-recorded by the target person and an original lip motion video of each pronunciation unit.
In the present embodiment, an executing subject (e.g., the server 103 shown in fig. 1) of the method for generating a lip motion video may acquire a lip motion video of consecutive sentences pre-recorded by a target person and an original lip motion video of each pronunciation unit.
Generally, the present embodiment requires two kinds of data to be prerecorded. First, a lip motion video of a continuous sentence of a target person. That is, a video in which a target person speaks a lip motion video of continuous sentences in a target language and normalizes the center and angle of the lip is recorded. The target language may be a language type spoken by a character in the lip motion video desired to be synthesized, and may be any single language type or a set of a plurality of language types. Multiple language classes will have more pronunciation units, with higher requirements for data labeling and computing power. Second, the original lip motion video of each pronunciation unit of the target character. Specifically, a video in which the lip motion video of each pronunciation unit spoken by the target person is recorded, the pronunciation start time and the pronunciation end time of each pronunciation unit are marked, the part from the start time to the pronunciation end time is cut, and the center and the angle of the lip are planned is obtained.
Step 303, for each pronunciation unit, determining a lip keypoint sequence corresponding to a lip motion video segment similar to the original lip motion video of the pronunciation unit in the lip motion videos of the continuous sentences, and generating a candidate lip keypoint sequence set corresponding to the pronunciation unit.
In this embodiment, for each pronunciation unit, the execution subject may determine a lip keypoint sequence corresponding to a lip motion video segment similar to an original lip motion video of the pronunciation unit in the lip motion videos of the continuous sentence, so as to generate a candidate lip keypoint sequence set corresponding to the pronunciation unit.
In some optional implementations of this embodiment, for each pronunciation unit, the execution subject may find a large number of lip motion video segments similar to the original lip motion video of the pronunciation unit from the lip motion videos of the consecutive sentences, and perform lip keypoint extraction on the found lip motion video segments respectively to generate a candidate lip keypoint sequence set corresponding to the pronunciation unit.
In some optional implementation manners of this embodiment, the execution body may first perform lip key point extraction on the lip motion video of the continuous sentence to obtain a lip key point sequence of the continuous sentence; then for each pronunciation unit, extracting lip key points of the original lip motion video of the pronunciation unit to obtain an original lip key point sequence of the pronunciation unit; and finally, determining a lip key point sequence similar to the original lip key point sequence of the pronunciation unit from the lip key point sequences of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit.
Step 304, determining the lip key point sequence corresponding to the pronunciation unit from the candidate lip key point sequence set corresponding to the pronunciation unit.
In this embodiment, the execution subject may determine the lip keypoint sequence corresponding to the pronunciation unit from the set of candidate lip keypoint sequences corresponding to the pronunciation unit. In general, the execution subject may select a candidate lip keypoint sequence that can be naturally connected and transitioned to a lip keypoint sequence corresponding to an adjacent pronunciation unit from the set of candidate lip keypoint sequences corresponding to the pronunciation unit, as the lip keypoint sequence corresponding to the pronunciation unit.
And 305, synthesizing the voice corresponding to the target text by using a voice synthesis technology.
In this embodiment, the executing body may synthesize the speech corresponding to the target text by using a speech synthesis technique.
And step 306, determining the starting and stopping time of each pronunciation unit based on the voice corresponding to the target text.
In this embodiment, the execution subject may determine the start-stop time of each pronunciation unit based on the speech corresponding to the target text. In general, the start-stop time of the pronunciation unit can be determined by a speech synthesis system, and the start-stop time of the pronunciation unit can be obtained by a conventional speech synthesis system.
And 307, matching the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
In this embodiment, the execution subject may match the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit. In general, the execution subject may expand or compress the lip keypoint sequence corresponding to each pronunciation unit into a corresponding start-stop time. For example, the execution subject may perform linear interpolation on the lip keypoint sequence corresponding to each pronunciation unit in time sequence, and match the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
And 308, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit to generate the lip key point sequence corresponding to the target text.
In this embodiment, the execution subject may perform smoothing processing on the lip keypoint sequence corresponding to the adjacent pronunciation unit to generate the lip keypoint sequence corresponding to the target text. For example, the executing body may first select a lip key sequence segment with a preset time length after the lip key sequence segment corresponds to a previous pronunciation unit in adjacent pronunciation units and a lip key sequence segment with a preset time length before the lip key sequence segment corresponds to a next pronunciation unit; and then based on the selected lip key points, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
For example, if a transition is to be made between two adjacent pronunciation units, then it is necessary to takeLip keypoint sequence segment (x) at beta ms after the previous pronunciation unit (beta is properly selected according to the length of pronunciation unit, for example, 30 ms can be selected for Chinese Pinyin)0,...,xL) And first beta ms of the last pitch unit of the lip keypoint sequence fragment (y)0,...,yL) Smoothing is performed using the following formula:
xi=xi+i(y0-xL) L, wherein i is 0,1, L;
yj=yj-(L-j)(y0-xL) and/2L, wherein j is 0, 1.
Step 309, inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text.
And 310, splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text.
In the present embodiment, the specific operations of step 309-.
And 311, fusing the voice corresponding to the target text into the lip action video corresponding to the target text.
In this embodiment, the execution subject may fuse the speech corresponding to the target text into the lip motion video corresponding to the target text.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the flow 300 of the method for generating a lip motion video in this embodiment highlights a step of matching the lip keypoint sequence corresponding to the pronunciation unit with the start-stop time corresponding to the pronunciation unit. Therefore, the scheme described in the embodiment improves the matching degree of the lip motion video corresponding to the target text and the voice corresponding to the target text, so that the generated lip motion video is more natural and smooth.
With further reference to fig. 4, a flow 400 of another embodiment of a method for generating lip motion video in accordance with the present application is shown. The method for generating the lip motion video comprises the following steps:
step 401, obtaining a target text.
Step 402, obtaining lip motion videos of continuous sentences pre-recorded by the target person and original lip motion videos of each pronunciation unit.
In the present embodiment, the specific operations of steps 401-402 have been described in detail in steps 301-302 in the embodiment shown in fig. 3, and are not described herein again.
And 403, extracting lip key points of the lip motion video of the continuous sentence to obtain a lip key point sequence of the continuous sentence.
In this embodiment, an execution subject (for example, the server 103 shown in fig. 1) of the method for generating lip motion videos may perform lip keypoint extraction on lip motion videos of consecutive sentences to obtain a lip keypoint sequence of the consecutive sentences. For example, the sequence of lip keypoints for a continuous sentence can be (c)1,...,cM)。
In step 404, for each pronunciation unit, lip keypoint extraction is performed on the original lip motion video of the pronunciation unit to obtain an original lip keypoint sequence of the pronunciation unit.
In this embodiment, for each pronunciation unit, the execution subject may perform lip keypoint extraction on the original lip motion video of the pronunciation unit to obtain an original lip keypoint sequence of the pronunciation unit. For example, the original lip keypoint sequence of the pronunciation unit may be (a)1,...,aN)。
Step 405, determining an end position of the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit based on the original lip keypoints in the original lip keypoint sequence of the pronunciation unit and the lip keypoints in the lip keypoint sequence of the continuous sentence.
In this embodiment, the execution subject may determine the end position of the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit based on the original lip keypoints of the pronunciation unit and the lip keypoints of the lip keypoint sequence of the continuous sentence.
Step 406, performing path backtracking based on the end position of the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit, determining the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit, and generating a candidate lip keypoint sequence set corresponding to the pronunciation unit.
In this embodiment, the executing body may perform path backtracking based on the end position of the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit, and determine the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit, so as to generate the candidate lip keypoint sequence set corresponding to the pronunciation unit.
For example, the lip keypoint sequence similar to the original lip keypoint sequence of the pronunciation unit is
Figure BDA0002131798570000141
Here, we can find the top α most similar lip keypoint sequences by a sequence similarity dynamic specification algorithm. The specific algorithm is as follows:
first, s (0,0) is initialized to 0.
Then, an iterative formula is applied to calculate all s (i, j), where i 1.
Figure BDA0002131798570000142
And according to the specific selection condition, recording the corresponding optimal path p (i, j):
Figure BDA0002131798570000143
where ρ is1,ρ2,ρ3The penalty parameter is a real number larger than 0, and is selected according to the size, the speech rate and the like of the video.
And finally, finding the largest front alpha s (N, j), wherein j is the end position of the corresponding lip key point sequence, and performing path backtracking according to p (i, j) to find the corresponding whole lip key point sequence. This results in a lip keypoint sequence that is most similar to the original lip keypoint sequence of the pronunciation unit.
Step 407, calculating the similarity between each candidate lip keypoint sequence corresponding to the pronunciation unit and each candidate lip keypoint sequence corresponding to the adjacent pronunciation unit of the pronunciation unit.
In this embodiment, the execution subject may calculate a similarity between each candidate lip keypoint sequence corresponding to the pronunciation unit and each candidate lip keypoint sequence corresponding to an adjacent pronunciation unit of the pronunciation unit.
And step 408, determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity.
In this embodiment, the execution subject may determine the end position of the lip keypoint sequence corresponding to the pronunciation unit based on the calculated similarity.
Step 409, performing path backtracking based on the end position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
In this embodiment, the execution main body may perform path backtracking based on the end position of the lip key point sequence corresponding to the pronunciation unit, and determine the lip key point sequence corresponding to the pronunciation unit.
For example, the target text may be directly converted into a sequence of pronunciation units, and α candidate sequences for each position are obtained according to the sequence of pronunciation units:
({X1,1,...,X1,α},...,{XT,1,...,XT,α})。
we need to select a most suitable candidate at each location,
Figure BDA0002131798570000151
so that the adjacent pronunciation units are best joined. I.e., the previous segment of lip keypoint sequence
Figure BDA0002131798570000152
Last group of lip key points and the next segment of lip key point sequence
Figure BDA0002131798570000153
Most similar to the first set of lip keypoints. Here, the distance of the lip keypoint vector is used to measure similarity, which is abbreviated as
Figure BDA0002131798570000154
We can find the globally optimal lip keypoint sequence by the adjacent similarity dynamic programming algorithm. The specific algorithm is as follows:
first, d (1, k) ═ 0 is initialized, where k equals 1.
Then, applying an iterative formula to calculate all d (i, k) ═ 0, where i ═ 1.., T; k 1.
Figure BDA0002131798570000155
And recording the best path q (i, j):
Figure BDA0002131798570000156
finally, find the smallest d (T, k)T) Then, for the Tth pronunciation unit we take the kthTThe candidate is used as the lip key point sequence, path backtracking is carried out according to q (i, j), and all k can be obtainedi1, T. Thus, the best connected lip key point candidate sequence is obtained
Figure BDA0002131798570000157
And step 410, generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit.
Step 411, inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network, and obtaining the lip action image sequence corresponding to the target text.
And step 412, splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text.
In the present embodiment, the specific operations of steps 410-412 have been described in detail in step 203-205 in the embodiment shown in fig. 2, and are not described herein again.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating a lip motion video in the present embodiment highlights a step of determining a lip key point sequence corresponding to a pronunciation unit. Therefore, the scheme described in the embodiment enables adjacent pronunciation units to naturally connect and transition, so that the generated lip motion video is more natural and smooth.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a lip motion video, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating a lip motion video according to the present embodiment may include: a text acquisition unit 501, a sequence determination unit 502, a sequence generation unit 503, an image synthesis unit 504, and a video generation unit 505. The text acquiring unit 501 is configured to acquire a target text; a sequence determining unit 502 configured to determine a lip key point sequence corresponding to each pronunciation unit of the target text; a sequence generating unit 503 configured to generate a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; the image synthesis unit 504 is configured to input the lip key point sequence corresponding to the target text into a pre-trained image synthesis network, so as to obtain a lip action image sequence corresponding to the target text; and a video generating unit 505 configured to splice the lip motion image sequence corresponding to the target text and generate a lip motion video corresponding to the target text.
In the present embodiment, in the apparatus 500 for generating a lip motion video: the detailed processing and the technical effects of the text obtaining unit 501, the sequence determining unit 502, the sequence generating unit 503, the image synthesizing unit 504, and the video generating unit 505 can refer to the related descriptions of step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of this embodiment, the means 500 for generating lip motion video further includes: a speech synthesis unit (not shown in the figure) configured to synthesize speech corresponding to the target text by using a speech synthesis technique; and a voice fusion unit (not shown in the figure) configured to fuse the voice corresponding to the target text into the lip motion video corresponding to the target text.
In some optional implementations of this embodiment, the sequence determining unit 502 includes: a video acquisition subunit (not shown in the figure) configured to acquire lip motion videos of consecutive sentences pre-recorded by the target person and an original lip motion video of each pronunciation unit; a set generating subunit (not shown in the figure), configured to determine, for each pronunciation unit, a lip keypoint sequence corresponding to a lip motion video segment similar to the original lip motion video of the pronunciation unit in the lip motion video of the continuous sentence, and generate a set of candidate lip keypoint sequences corresponding to the pronunciation unit; and a sequence determining subunit (not shown in the figure) configured to determine the lip keypoint sequence corresponding to the pronunciation unit from the candidate lip keypoint sequence set corresponding to the pronunciation unit.
In some optional implementations of this embodiment, the set generating subunit includes: a first extraction module (not shown in the figure), configured to perform lip key point extraction on the lip motion video of the continuous sentence to obtain a lip key point sequence of the continuous sentence; a second extraction module (not shown in the figure) configured to perform lip keypoint extraction on the original lip motion video of the pronunciation unit to obtain an original lip keypoint sequence of the pronunciation unit; and a set generating module (not shown in the figure) configured to determine a lip key point sequence similar to the original lip key point sequence of the pronunciation unit from the lip key point sequences of the continuous sentences, and generate a candidate lip key point sequence set corresponding to the pronunciation unit.
In some optional implementations of this embodiment, the set generation module is further configured to: determining the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit based on the original lip key points in the original lip key point sequence of the pronunciation unit and the lip key points in the lip key point sequence of the continuous sentence; and performing path backtracking based on the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit, and determining the lip key point sequence similar to the original lip key point sequence of the pronunciation unit.
In some optional implementations of this embodiment, the sequence determination subunit is further configured to: calculating the similarity of each candidate lip key point sequence corresponding to the pronunciation unit and each candidate lip key point sequence corresponding to the adjacent pronunciation unit of the pronunciation unit; determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity; and performing path backtracking based on the ending position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
In some optional implementations of this embodiment, the sequence generating unit 503 includes: a time determining subunit (not shown in the figure) configured to determine a start-stop time of each pronunciation unit based on the corresponding voice of the target text; and a sequence generating subunit (not shown in the figure) configured to match the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit, and generate the lip key point sequence corresponding to the target text.
In some optional implementations of this embodiment, the sequence generating subunit includes: and a linear interpolation module (not shown in the figure) configured to linearly interpolate the lip keypoint sequence corresponding to each pronunciation unit in time sequence, and match the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
In some optional implementations of this embodiment, the sequence generating subunit further includes: and a smoothing module (not shown in the figure) configured to smooth the lip keypoint sequence corresponding to the adjacent pronunciation unit.
In some optional implementations of this embodiment, the smoothing module is further configured to: selecting a lip key point sequence segment with a later preset time length corresponding to a previous pronunciation unit and a lip key point sequence segment with a former preset time length corresponding to a next pronunciation unit in adjacent pronunciation units; and based on the selected lip key points, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
In some optional implementations of this embodiment, the image synthesis network is trained by: obtaining a training sample, wherein the training sample comprises sample lip key points and sample lip action images; and training to obtain the image synthesis network by taking the sample lip key points as input and the sample lip action images as output.
In some optional implementations of this embodiment, the sample lip motion image is an image extracted from a lip motion video of a continuous sentence prerecorded by the target person, and the sample lip keypoints are lip keypoints obtained by performing lip keypoint extraction on the extracted image.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing an electronic device (e.g., server 103 shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or electronic device. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a text acquisition unit, a sequence determination unit, a sequence generation unit, an image synthesis unit, and a video generation unit. Here, the names of these units do not constitute a limitation to the unit itself in this case, and for example, the text acquisition unit may also be described as "a unit that acquires a target text".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target text; determining a lip key point sequence corresponding to each pronunciation unit of the target text; generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit; inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text; and splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (13)

1. A method for generating lip motion video, comprising:
acquiring a target text;
determining a lip key point sequence corresponding to each pronunciation unit of the target text;
generating a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit;
inputting the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text;
splicing the lip action image sequence corresponding to the target text to generate a lip action video corresponding to the target text;
wherein the determining the lip key point sequence corresponding to each pronunciation unit of the target text comprises:
acquiring lip action videos of continuous sentences pre-recorded by a target person and original lip action videos of each pronunciation unit;
for each pronunciation unit, determining a lip key point sequence corresponding to a lip action video segment similar to an original lip action video of the pronunciation unit in the lip action videos of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit;
calculating the similarity of each candidate lip key point sequence corresponding to the pronunciation unit and each candidate lip key point sequence corresponding to the adjacent pronunciation unit of the pronunciation unit;
determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity;
and performing path backtracking based on the ending position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
2. The method of claim 1, wherein the method further comprises:
synthesizing the voice corresponding to the target text by utilizing a voice synthesis technology;
and fusing the voice corresponding to the target text into the lip action video corresponding to the target text.
3. The method according to claim 1, wherein the determining a lip keypoint sequence corresponding to a lip motion video segment similar to the original lip motion video of the pronunciation unit in the lip motion videos of the continuous sentence, and generating a candidate lip keypoint sequence set corresponding to the pronunciation unit comprises:
performing lip key point extraction on the lip action video of the continuous sentence to obtain a lip key point sequence of the continuous sentence;
extracting lip key points of the original lip motion video of the pronunciation unit to obtain an original lip key point sequence of the pronunciation unit;
and determining lip key point sequences similar to the original lip key point sequence of the pronunciation unit from the lip key point sequences of the continuous sentences, and generating a candidate lip key point sequence set corresponding to the pronunciation unit.
4. The method of claim 3, wherein the determining a sequence of lip keypoints that is similar to the original sequence of lip keypoints for the pronunciation unit from the sequence of lip keypoints for the continuous sentence comprises:
determining an end position of a lip key point sequence similar to the original lip key point sequence of the pronunciation unit based on the original lip key points in the original lip key point sequence of the pronunciation unit and the lip key points in the lip key point sequence of the continuous sentence;
and performing path backtracking based on the end position of the lip key point sequence similar to the original lip key point sequence of the pronunciation unit, and determining the lip key point sequence similar to the original lip key point sequence of the pronunciation unit.
5. The method according to claim 2, wherein the generating a lip keypoint sequence corresponding to the target text based on the lip keypoint sequence corresponding to each pronunciation unit comprises:
determining the starting and ending time of each pronunciation unit based on the voice corresponding to the target text;
and matching the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit to generate the lip key point sequence corresponding to the target text.
6. The method according to claim 5, wherein the matching the lip keypoint sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit comprises:
and performing linear interpolation on the lip key point sequence corresponding to each pronunciation unit in time sequence, and matching the lip key point sequence corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit.
7. The method according to claim 5, wherein after matching the sequence of lip keypoints corresponding to each pronunciation unit to the start-stop time corresponding to each pronunciation unit, further comprising:
and smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
8. The method according to claim 7, wherein the smoothing of the lip keypoint sequence corresponding to the adjacent pronunciation unit comprises:
selecting a lip key point sequence segment with a later preset time length corresponding to a previous pronunciation unit and a lip key point sequence segment with a former preset time length corresponding to a next pronunciation unit in adjacent pronunciation units;
and based on the selected lip key points, smoothing the lip key point sequence corresponding to the adjacent pronunciation unit.
9. The method of one of claims 1 to 8, wherein the image synthesis network is trained by:
obtaining a training sample, wherein the training sample comprises sample lip key points and sample lip action images;
and training to obtain the image synthesis network by taking the sample lip key points as input and the sample lip action images as output.
10. The method of claim 9, wherein the sample lip motion image is an image extracted from a lip motion video of a continuous sentence prerecorded by a target person, and the sample lip keypoints are lip keypoints obtained by performing lip keypoint extraction on the extracted image.
11. An apparatus for generating lip motion video, comprising:
a text acquisition unit configured to acquire a target text;
a sequence determining unit configured to determine a lip key point sequence corresponding to each pronunciation unit of the target text;
the sequence generating unit is configured to generate a lip key point sequence corresponding to the target text based on the lip key point sequence corresponding to each pronunciation unit;
the image synthesis unit is configured to input the lip key point sequence corresponding to the target text into a pre-trained image synthesis network to obtain a lip action image sequence corresponding to the target text;
the video generation unit is configured to splice the lip action image sequence corresponding to the target text and generate a lip action video corresponding to the target text;
wherein the sequence determination unit comprises:
the video acquisition subunit is configured to acquire lip action videos of continuous sentences pre-recorded by a target person and original lip action videos of each pronunciation unit;
the set generation subunit is configured to determine, for each pronunciation unit, a lip key point sequence corresponding to a lip motion video segment similar to an original lip motion video of the pronunciation unit in the lip motion videos of the continuous sentences, and generate a candidate lip key point sequence set corresponding to the pronunciation unit;
a sequence determination subunit configured to calculate a similarity of each candidate lip keypoint sequence corresponding to the pronunciation unit and each candidate lip keypoint sequence corresponding to an adjacent pronunciation unit of the pronunciation unit; determining the ending position of the lip key point sequence corresponding to the pronunciation unit based on the calculated similarity; and performing path backtracking based on the ending position of the lip key point sequence corresponding to the pronunciation unit, and determining the lip key point sequence corresponding to the pronunciation unit.
12. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.
13. A computer-readable medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN201910640823.3A 2019-07-16 2019-07-16 Method and device for generating lip motion video Active CN110347867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910640823.3A CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910640823.3A CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Publications (2)

Publication Number Publication Date
CN110347867A CN110347867A (en) 2019-10-18
CN110347867B true CN110347867B (en) 2022-04-19

Family

ID=68175446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910640823.3A Active CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Country Status (1)

Country Link
CN (1) CN110347867B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111147894A (en) * 2019-12-09 2020-05-12 苏宁智能终端有限公司 Sign language video generation method, device and system
CN111261187B (en) * 2020-02-04 2023-02-14 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN112131988B (en) * 2020-09-14 2024-03-26 北京百度网讯科技有限公司 Method, apparatus, device and computer storage medium for determining virtual character lip shape
CN112381926A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112752118B (en) * 2020-12-29 2023-06-27 北京字节跳动网络技术有限公司 Video generation method, device, equipment and storage medium
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment
CN113111812A (en) * 2021-04-20 2021-07-13 深圳追一科技有限公司 Mouth action driving model training method and assembly
CN113223123A (en) * 2021-05-21 2021-08-06 北京大米科技有限公司 Image processing method and image processing apparatus
JP2024513640A (en) * 2021-07-07 2024-03-27 北京捜狗科技▲発▼展有限公司 Virtual object action processing method, device, and computer program
CN113744368A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Animation synthesis method and device, electronic equipment and storage medium
CN113873297B (en) * 2021-10-18 2024-04-30 深圳追一科技有限公司 Digital character video generation method and related device
CN114173188B (en) * 2021-10-18 2023-06-02 深圳追一科技有限公司 Video generation method, electronic device, storage medium and digital person server
CN116579298A (en) * 2022-01-30 2023-08-11 腾讯科技(深圳)有限公司 Video generation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103971393A (en) * 2013-01-29 2014-08-06 株式会社东芝 Computer generated head
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103971393A (en) * 2013-01-29 2014-08-06 株式会社东芝 Computer generated head
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Video Rewrite: Driving Visual Speech with Audio;Christoph Bregler;《ACM SIGGRAPH》;20101231;全文 *
一种基于三维模型和照片的合成"说话头";赖伟;《中国图象图形学报》;20040731;全文 *

Also Published As

Publication number Publication date
CN110347867A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN110347867B (en) Method and device for generating lip motion video
CN107945786B (en) Speech synthesis method and device
KR20190139751A (en) Method and apparatus for processing video
CN109754783B (en) Method and apparatus for determining boundaries of audio sentences
KR102346046B1 (en) 3d virtual figure mouth shape control method and device
CN108121800B (en) Information generation method and device based on artificial intelligence
CN110446066B (en) Method and apparatus for generating video
Saunders et al. Signing at scale: Learning to co-articulate signs for large-scale photo-realistic sign language production
US20190088253A1 (en) Method and apparatus for converting english speech information into text
CN110444203B (en) Voice recognition method and device and electronic equipment
JP7232293B2 (en) MOVIE GENERATION METHOD, APPARATUS, ELECTRONICS AND COMPUTER-READABLE MEDIUM
CN107481715B (en) Method and apparatus for generating information
CN109582825B (en) Method and apparatus for generating information
US20240070397A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN110880198A (en) Animation generation method and device
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
CN113111812A (en) Mouth action driving model training method and assembly
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
CN112383721B (en) Method, apparatus, device and medium for generating video
CN112364653A (en) Text analysis method, apparatus, server and medium for speech synthesis
CN112381926A (en) Method and apparatus for generating video
CN111415662A (en) Method, apparatus, device and medium for generating video
CN112308950A (en) Video generation method and device
WO2023046016A1 (en) Optimization of lip syncing in natural language translated video
CN112652329B (en) Text realignment method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant