CN117095672A

CN117095672A - Digital human lip shape generation method and device

Info

Publication number: CN117095672A
Application number: CN202310855500.2A
Authority: CN
Inventors: 伏冠宇; 杨明晖; 薛吕欣; 金春祥
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-11-21
Anticipated expiration: 2043-07-12

Abstract

The embodiment of the specification relates to a digital human lip generating method and device, input data of the method comprises text data and audio data corresponding to the text data, the text data and the audio data are input into an alignment module to be aligned, and an audio fragment and a time period corresponding to any text unit in the text data are obtained. Then, using a pre-configured text-mouth shape dictionary, and obtaining corresponding mouth shape information according to a text unit; and acquiring amplitude information corresponding to the mouth shape according to the specific audio characteristics of the audio fragment, and combining the mouth shape information and the amplitude information to obtain a corresponding lip shape in a time period. And (5) arranging the lips in different time periods according to the time sequence to obtain a preliminary lip sequence. Interpolation and smoothing are carried out on the blank time period between adjacent lips of the lip sequence, so that transition between different lips is realized, and a final smooth lip sequence is obtained.

Description

Digital human lip shape generation method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of virtual digital people, and in particular, to a digital human lip generating method and apparatus.

Background

In recent years, virtual digital people have been widely used in fields such as virtual reality, digital marketing, meta space, medical treatment, education, and the like. An important indicator for evaluating digital human effects is whether the experience is natural in all dimensions, including visual texture, speech, motion, and lips, etc. Wherein, the lip shape refers to whether a person accurately executes a 'closing' action when playing and reporting by voice and whether the person executes a correct 'mouth shape' when reading different pronunciation characters. The manual configuration of lips for each segment of audio broadcast by a digital person is cost prohibitive and does not have scalability. Thus, automatic programming is required in the link of lip configuration for digital persons.

Disclosure of Invention

One or more embodiments of the present specification describe a digital human lip generation method and apparatus, which aim to implement automatic configuration of digital human lips.

In a first aspect, a digital human lip generating method is provided, including:

acquiring text data and corresponding audio data;

aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set;

For a target text unit in a target time period, determining target mouth shape information corresponding to the target text unit according to a first dictionary, wherein the first dictionary contains mouth shape information corresponding to any text unit;

determining target amplitude information corresponding to the target mouth shape information according to the target audio fragment corresponding to the target text unit;

and determining a target lip shape in the target time period according to the target mouth shape information and the target amplitude information.

In one possible implementation, the audio data is pre-recorded audio and the text units are words; aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set, wherein the method comprises the following steps:

and generating a word set, and an audio fragment and a time period corresponding to any word in the word set by using a forced alignment model according to the text data and the audio data.

In one possible implementation, the audio data is pre-recorded audio, and the text units are phonemes; aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set, wherein the method comprises the following steps:

Determining phonemes corresponding to any word in the text data to obtain a phoneme set;

and generating the audio fragments and the time periods corresponding to any phoneme in the phoneme set by using a forced alignment model according to the phoneme set and the audio data.

In a possible implementation manner, the audio data is generated according to the text data by using a text-to-speech model, and the text-to-speech model generates the audio data and simultaneously generates a text unit set corresponding to the text data and an audio fragment and a time period corresponding to any text unit in the text unit set;

aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set, wherein the method comprises the following steps:

and acquiring the text unit set and the audio frequency fragment and the time period corresponding to any text unit in the text unit set from the generation result of the text-to-speech model.

In one possible implementation manner, determining, according to the target audio segment corresponding to the target text unit, target amplitude information corresponding to the target mouth shape information includes:

And determining the target amplitude information according to the audio characteristics of the target audio fragment, wherein the audio characteristics comprise at least one of volume characteristics, fundamental frequency characteristics and formant characteristics.

In one possible implementation, the audio feature is a volume feature; determining the target amplitude information according to the audio characteristics of the target audio fragment, including:

and determining the target amplitude information according to the maximum volume value and the minimum volume value in the specific text range of the target text unit.

In one possible implementation, the specific text range in which the target text unit is located is a sentence or paragraph in which the target text unit is located.

In one possible embodiment, the method further comprises:

forming a set of lips from a plurality of said target lips, said set of lips comprising a first lip and a second lip adjacent in a time dimension;

and carrying out interpolation operation on a blank time period between the first lip and the second lip according to the first lip and the second lip to obtain a first lip sequence.

In one possible implementation, the interpolation operation is a linear interpolation.

In one possible implementation manner, according to the first lip and the second lip, interpolation operation is performed on a blank time period between the first lip and the second lip, so as to obtain a first lip sequence, which includes:

And determining the lip shape corresponding to any minimum time unit in the blank time period according to the first lip shape and the second lip shape, and obtaining a first lip shape sequence.

In one possible embodiment, the method further comprises:

and carrying out smoothing operation on the target lip sequence formed by the first lip, the second lip and the first lip sequence in the time dimension to obtain a target smooth lip sequence.

In one possible implementation, the smoothing operation is exponential smoothing or one-dimensional gaussian filtering smoothing.

In a second aspect, there is provided a digital human lip generating apparatus comprising:

an acquisition unit configured to acquire text data and corresponding audio data;

an alignment unit configured to align the text data with the audio data in a time dimension to obtain a text unit set corresponding to the text data, and an audio clip and a time period corresponding to any text unit in the text unit set;

a mouth shape determining unit configured to determine, for a target text unit in a target time period, target mouth shape information corresponding to the target text unit according to a first dictionary, wherein the first dictionary contains mouth shape information corresponding to any text unit;

The amplitude determining unit is configured to determine target amplitude information corresponding to the target mouth shape information according to the target audio fragment corresponding to the target text unit;

and a lip determining unit configured to determine a target lip over the target period of time based on the target mouth shape information and the target amplitude information.

In one possible embodiment, the method further comprises:

a set determination unit configured to compose a lip set including a first lip and a second lip adjacent in a time dimension from a plurality of the target lips;

and the interpolation unit is configured to perform interpolation operation on a blank time period between the first lip and the second lip according to the first lip and the second lip to obtain a first lip sequence.

In one possible embodiment, the method further comprises:

and the smoothing unit is configured to carry out smoothing operation on the target lip sequence formed by the first lip, the second lip and the first lip sequence in the time dimension to obtain the target smooth lip sequence.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method of the first aspect.

According to the digital human lip shape generation method and device, aiming at two different application scenes of digital human offline video production and real-time broadcasting, the fine control of opening and closing and various lip shapes is realized under the condition that marked audio and text alignment information is not used.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates an implementation scenario diagram of a digital human lip generation method according to one embodiment;

FIG. 2 illustrates a flow chart of a digital human lip generation method according to one embodiment;

FIG. 3 illustrates a schematic diagram of lip interpolation according to one embodiment;

FIG. 4 shows a schematic diagram of lip interpolation according to another embodiment;

fig. 5 shows a schematic block diagram of a digital human lip generating apparatus according to one embodiment.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

The digital person is also called an avatar, and is a digital character image which is created by using digital technology and is close to the human image. Lip-shaped driving is a key capability of a digital person, and means that when the digital person plays voice, mouth animation changes reasonably along with audio, so that the effect of synchronizing the audio and the mouth animation is achieved. In digital human scenes such as customer service broadcasting, intelligent assistants and the like, texts and audios are continuously changed and cannot be enumerated, so that all lip situations cannot be reproduced in a manual manufacturing mode. At this time, it is necessary to automatically generate a lip sequence from audio and text contents in real time using a certain method.

FIG. 1 illustrates an implementation scenario diagram of digital human lip generation according to one embodiment. In the example of fig. 1, the input data includes Text data and audio data corresponding thereto, wherein the audio data may be prerecorded or may be machine generated from the Text data using a Text To Speech (TTS) model. After inputting the text data and the audio data into the alignment module, the alignment module first determines whether the audio data is prerecorded or generated in real time using TTS based on the text. If the audio data is prerecorded, and the application scene at the moment is an offline video production scene, an Alignment module aligns the text and the audio by using a Forced Alignment (Forced Alignment) model, splits the text data into a plurality of text units, and simultaneously obtains an audio fragment and a time period corresponding to each text unit; if the audio data is generated by using a TTS model, the application scene at the moment is an online real-time broadcasting scene, the TTS model can output a time period corresponding to any text unit while generating the audio, and the alignment module only needs to acquire the time period corresponding to any text unit from the generation result of the TTS model, and the time period can be represented according to a start time stamp and an end time stamp of the time period. After obtaining the time period corresponding to the text unit, the audio can be split into corresponding audio fragments according to the time period, and at the moment, the audio fragments and the time period corresponding to any text unit can be obtained.

After obtaining an audio fragment and a time period corresponding to any text unit, obtaining corresponding mouth shape information according to the text unit by using a pre-configured text-mouth shape dictionary; and acquiring amplitude information corresponding to the mouth shape according to the specific audio characteristics of the audio fragment, and combining the mouth shape information and the amplitude information to obtain a corresponding lip shape in a time period. After the lips in different time periods are arranged according to the time sequence, a preliminary lip sequence can be obtained.

And interpolating and smoothing the blank time period between adjacent lips of the lip sequence to realize transition between different lips, so that global lip animation is smoother and more natural, and a final smooth lip sequence is obtained. After the final smooth lip sequence is obtained, the smooth lip sequence is sent to a digital person lip animation generation module (not shown in the figure), so that the digital person can generate the correct, smooth and natural lip animation corresponding to each time point according to the lip sequence.

Specific implementation steps of the above-described lip formation method are described below in connection with specific embodiments. Fig. 2 illustrates a flow chart of a digital human lip generation method, the execution subject of which may be any platform or server or cluster of devices with computing, processing capabilities, etc., according to one embodiment.

In step 202, text data and corresponding audio data are acquired.

The audio data may be pre-recorded or may be machine generated in real-time from the text data using a TTS model.

When audio data is generated in real time from text data using a TTS model, the text data is first split into several phonemes. Phonemes are basic phonetic units that are partitioned according to the natural properties of speech, and refer to a group of distinctive phonetic units that are the smallest distinguishable segments that make up a speech. After splitting text data into several phonemes, the TTS model generates corresponding audio data by predicting the corresponding time period of any one phoneme in the whole audio using digital signal processing and speech synthesis related techniques. The time period of the phoneme may be represented by a start time stamp and an end time stamp. According to the start time stamp and the end time stamp, the corresponding audio fragment of the phoneme in the audio data can be obtained. In one embodiment, the duration of any phoneme is obtained by multiplying the number of frames by the duration of each frame using a duration prediction module (duration predictor) in a non-regressive TTS model to predict the number of frames and the start and end frames.

When the audio data is pre-recorded, the time period corresponding to the phonemes cannot be acquired at the same time. The prerecorded audio can be a manual reading text, and then the voice is recorded by the audio acquisition equipment to obtain corresponding audio data; or may be generated from text data using a limited TTS model. When using a limited TTS model, a specific module related to the duration in the TTS model cannot be used. For example, the TTS model is an interface provided by an upstream service provider, or the TTS model has been exported in the form of an executable file, at which time the time period corresponding to the phoneme cannot be obtained from the TTS model.

In step 204, the text data and the audio data are aligned in a time dimension, so as to obtain a text unit set corresponding to the text data, and an audio clip and a time period corresponding to any text unit in the text unit set.

The text unit is the smallest text unit that is set manually in the embodiment and cannot be split again, for example, in chinese text, the text unit may be a single chinese character or a phoneme of chinese, and in english text, the text unit may be a word or a phoneme of english.

In one embodiment, the audio data is pre-recorded audio, and the text units are words. Words refer to a unit of language that is one order of magnitude larger than a phoneme of the language in any one of the languages included in the text data, for example, a single chinese character in chinese and a single english word in english. And generating a word set, and an audio fragment and a time period corresponding to any word in the word set by using a forced alignment model according to the text data and the audio data.

The forced alignment model may be any existing model, for example, may be MFA (Montreal Forced Aligner ), or a module related to forced alignment in the pre-trained acoustic model Wav2Vec2, which is not limited herein. The forced alignment model aligns word groups, words or phonemes in the text with corresponding voice fragments in the audio through statistics of co-occurrence and correlation of the characteristics of the audio data and the text sequence, so that the corresponding relation between the text and the voice is more accurate and precise.

After the text data and the audio data are input into the forced alignment model to obtain the start time stamp and the end time stamp of any word in the text data, the audio fragment corresponding to the word can be segmented from the audio data according to the start time stamp and the end time stamp of any word.

In another embodiment, the audio data is pre-recorded audio and the text units are phonemes.

Firstly, determining phonemes corresponding to any word in the text data to obtain a phoneme set.

According to the texts in different languages in the text data, the words are split into phonemes by using corresponding methods.

In a specific embodiment, when the text language is chinese, since there are polyphones in chinese, for any word, a polyphone set is first used to determine whether the word is a mono-or polyphone. Since the number of polyphones in a Chinese character is limited, a polyphone set can be constructed in advance.

When the word is a single-tone word, the single-tone word is converted into corresponding Chinese pinyin by using a pinyin library, and then the pinyin is split into corresponding phonemes according to a preset splitting rule. For example, the phonetic library is used to convert the single word "you" into "ni" and then split into phonemes "n" and "i" according to a preset splitting rule. The pinyin library used may be an existing open source library, such as pypinylin, or may be specifically tailored to specific needs, and is not limited herein.

It should be noted that, since the pinyin is limited and the number of phonemes in the chinese is limited, a splitting rule may be formulated in advance, and a correspondence between any pinyin and its corresponding phonemes may be established to form the splitting rule. Meanwhile, as the influence of the tone on the digital human lip shape is very limited, different tones are not distinguished in the process.

When the word is a polyphone, the word cannot be directly converted into the corresponding Chinese pinyin by using the pinyin library. At this time, a polyphone prediction model is used to predict the pronunciation of the word based on the context information in which the word is located. For example, using a pre-trained BERT-based text classification model, the sentence or paragraph in which the word is located is input into the text classification model to predict the pronunciation of the word. After obtaining the correct pronunciation of the polyphone, the polyphone is split into corresponding phonemes using a processing scheme similar to that described above for the monophone.

In a specific embodiment, when the text language is english, the english text is first split into a set of english words, and for any english word, a phoneme transcription model is used to split it into corresponding phonemes. The phoneme transcription model may be an existing model, such as CMU Pronouncing Dictionary, arpabet, etc., or may be specifically customized according to specific requirements, and is not limited herein.

When other languages exist in the text data, the corresponding existing technology of the language can also be used to split the words into phonemes, which is not described here again.

After obtaining the phoneme set, generating the audio fragments and the time periods corresponding to any phoneme in the phoneme set by using a forced alignment model according to the phoneme set and the audio data. Similar to the scheme in the embodiment where the previous text unit is a word, the description of the previous embodiment is referred to herein, and will not be repeated herein.

In yet another embodiment, the audio data is generated in real-time from the text data using a TTS model. Since the TTS model generates the audio data in step 202 at the same time, a time period corresponding to any one of phonemes in the audio data and an audio clip corresponding to the time period have been generated. So that only the corresponding data need be obtained from the generation result of the TTS model. If the text unit is a word, splicing the time periods of a plurality of phonemes corresponding to any word to obtain the time period of the word; and splicing a plurality of audio fragments corresponding to any word, or re-dividing the audio data according to the time period to obtain the audio fragment corresponding to the word.

In step 206, for a target text unit in a target time period, determining target mouth shape information corresponding to the target text unit according to a first dictionary, where the first dictionary includes mouth shape information corresponding to any text unit.

The first dictionary may be an existing text-to-mouth dictionary, or may be specifically customized according to specific requirements, and is not limited herein. When the method is used for customizing, a person wears the motion capture device, and the first dictionary is built by capturing and analyzing the mouth shape information corresponding to any text unit read by the person through the motion capture device. The first dictionary may be built based on the blend shape, or may be built based on any other animation technique, without limitation.

In step 208, according to the target audio segment corresponding to the target text unit, the target amplitude information corresponding to the target mouth shape information is determined.

Specifically, the target amplitude information is determined according to audio characteristics of the target audio piece, wherein the audio characteristics comprise at least one of volume characteristics, fundamental frequency characteristics and formant characteristics.

In one embodiment, the target amplitude information is determined using the volume characteristics corresponding to the text units alone. For any text unit, the maximum volume value X within a specific text range is first determined _max And a minimum volume value X _min . The particular text range may be a sentence or paragraph in which the text unit is located, or otherA preset range. Since in a longer section of audio there may be a situation where the audio features of a certain section are widely separated from the audio features of a certain section, e.g. for the volume features there is a situation where the overall volume of a certain section is great and the overall volume of another section is small, if only the maximum volume value X of the whole audio is considered _max And a minimum volume value X _min The target amplitude is too small at the portion where the volume is small, resulting in a situation where the movement of the mouth of the finally generated animation is not obvious. By setting a specific text range, the maximum value and the minimum value of the audio feature are determined in segments, and the above problems can be avoided.

X within a certain text range is determined _max And X _min Then, for a text unit having a sound volume value X within the text range, the amplitude information thereof may be represented by an amplitude value P, and the method of calculating P is as shown in formula (1):

wherein α is a super parameter for controlling the minimum value of the amplitude value P. In the formula (1), the value range of P is [ alpha, 2 alpha ]. Preferably, α has a value in the range of 0.5 to 0.8.

In other embodiments, the method in the above embodiments may be further referred to, and the target amplitude information corresponding to the target mouth shape information may be determined using the fundamental frequency feature, the formant feature, or any combination feature between them and the volume feature.

In step 210, a target lip over the target time period is determined based on the target mouth shape information and target amplitude information.

For the target text unit, after the corresponding target mouth shape information and the target amplitude information are determined, the corresponding target lip shape can be determined.

In one embodiment, the one-hot-like one-hot encoding is used to encode the mouth shape information to obtain a one-dimensional vector, the vector being 1 in one dimension and 0 in the remaining dimensions. The vector dimension is equal to the total number of mouth shapes. For example, die a corresponds to [1,0, ], 0, die B corresponds to [0,1,0, ], 0, and die H corresponds to [0, ], 1, 0.

Then, the vector corresponding to the mouth shape information is multiplied by the target amplitude to obtain the vector corresponding to the target lip shape. For example, if the target mouth shape of a certain text unit is mouth shape a and the target amplitude is 0.8, the vector corresponding to the target lip shape is [0.8,0,0, …,0]; the target mouth shape of another text unit is mouth shape B, and the target amplitude is 0.5, and the vector corresponding to the target lip shape is [0,0.5,0, …,0].

For any vector corresponding to a lip, the value of the vector in any dimension represents the magnitude of the lip in the mouth shape corresponding to that dimension. For example, the lip vector [0,0.5,0, …,0] represents an amplitude of 0.5 for die B, while the amplitudes of the remaining dies are all 0; the lip vector [0.2,0.4,0, …,0] represents that the amplitude of die a is 0.2, the amplitude of die B is 0.4, and the amplitudes of the remaining dies are all 0.

In other embodiments, other ways of representing lips may be devised, provided that different representations of different lips are possible. For example, other modes of encoding the mouth shape information are defined, and other modes of combining the mouth shape information and the amplitude information are defined, for example, two-dimensional vectors, encoded vectors of the first behavior mouth shape information, and values or encoded vectors corresponding to the second behavior amplitude information are defined.

The lip corresponding to a single text unit in the text may be determined by steps 202-210. And arranging the lips corresponding to a plurality of text units in a text segment according to the time sequence, so as to obtain a lip-out sequence corresponding to the text segment.

In some possible embodiments, the method further comprises:

step 212, composing a lip set according to a plurality of the target lips, wherein the lip set comprises a first lip and a second lip which are adjacent in the time dimension.

And step 214, performing interpolation operation on a blank time period between the first lip and the second lip according to the first lip and the second lip to obtain a first lip sequence.

Since there is a silent blank time between adjacent utterances when speaking, if the lips are directly set to a closed state during the blank time, the finally generated lip animation appears extremely hard and abrupt, i.e., suddenly opens during the voiced time and suddenly closes during the unvoiced blank time. It is necessary to set an excessive lip at a blank time between adjacent text units so that the finally generated lip animation looks more natural.

Specifically, the first lip corresponds to a first text unit, the second lip corresponds to a second text unit, and the blank time is divided into a plurality of minimum time units by presetting the minimum time units of the time. The minimum time unit may be arbitrarily set according to actual needs, for example, using the duration of one frame as the minimum time unit, or set to 0.05 seconds, 0.02 seconds, or the like. And then interpolating the blank time between the first lip and the second lip according to the first lip and the second lip, and determining the lip corresponding to any minimum time unit in the blank time period to obtain a first lip sequence.

In one embodiment, the first text unit and the second text unit are located in the same sentence. It is not a matter of course to assume that the first lip time sequence is before and the second lip time sequence is after. Dividing the blank time into a plurality of minimum time units, interpolating the blank time according to the first lip shape and the second lip shape by using linear interpolation to obtain the lip shape corresponding to any minimum time unit, and combining the lip shapes to obtain a first lip shape sequence.

Illustratively, in one specific example, assuming a total of 5 lips, the first lip being [0,1, 0], the second lip being [0,0,0,0,0.8], the minimum time unit being the duration of one frame, there is a 3-frame blank time between the first and second lips, i.e., [0, 0], the sequence of lips before interpolation is:

[0,1,0,0,0],

[0,0,0,0,0],

[0,0,0,0,0.8]

interpolation is carried out on the blank time according to the first lip shape and the second lip shape by using linear interpolation, and a first lip shape sequence after interpolation is obtained:

[0,1,0,0,0],

[0,0.75,0,0,0.2],

[0,0.5,0,0,0.4],

[0,0.25,0,0,0.6],

[0,0,0,0,0.8]

a schematic representation of the resulting first lip sequence is shown in FIG. 3.

In another embodiment, the first text unit and the second text unit are located in different sentences, the first text unit being located at the end of a preceding sentence and the second text unit being located at the beginning of a subsequent sentence. At this time, since there is a long pause between sentences, a closed lip should be provided for a part of the pause, and if any minimum time unit of the blank period is configured with the upper lip with reference to the previous embodiment, a transition from the first lip to the second lip is too slow and there is no closed lip.

At this time, the blank time between the first lip and the second lip is first split into two sections from the middle, the first half section belongs to the end blank time of the sentence where the first text unit is located, and the second half section belongs to the beginning blank time of the sentence where the second text unit is located. Interpolation is carried out on the tail blank time after the first text unit according to a fixed speed in time sequence; for the second text unit at the beginning of the sentence, interpolation is performed in reverse time order at a fixed rate. The rate is a preset value representing the amount of lip amplitude decrease per unit of time. Sequential interpolation is that lips of a time unit after a certain lip are sequentially reduced in time sequence; reverse temporal interpolation sequentially reduces lips in time reverse order by time units before a certain lip.

In a specific example, initially, as shown in fig. 4 (a), there is a blank time between the first lip and the second lip, the blank time including a plurality of time units, and the blank time being divided into an end blank time and a start blank time from the middle. Assuming a first lip of [0,0.6,0,0,0], 4 frames at a rate of 0.2/frame at end blank time after the first lip, the sequence of lips after sequential interpolation is:

[0,0.6,0,0,0],

[0,0.4,0,0,0],

[0,0.2,0,0,0],

[0,0,0,0,0],

[0,0,0,0,0]

Assuming the second lip is [0,0.4,0,0,0], the beginning blank time before the second lip is 4 frames, the lip sequence after reverse chronological interpolation is:

[0,0,0,0,0],

[0,0.2,0,0,0],

[0,0.4,0,0,0]

and splicing the lip sequences of the end blank time and the beginning blank time according to the time sequence to obtain a first lip sequence. A schematic representation of the resulting first lip sequence is shown in FIG. 4 (b).

In other embodiments, other interpolation methods, such as second order interpolation, third order interpolation, spline interpolation, etc., may be used as long as the excess lip of the blanking time can be reasonably set.

A reasonable excess lip can be set at the blank time between lips by steps 212-214, making the lip animation generated from the overall lip sequence more natural than without the excess lip.

In some possible embodiments, the method further comprises:

and step 216, performing smoothing operation on the target lip sequence formed by the first lip, the second lip and the first lip sequence in the time dimension to obtain a target smooth lip sequence.

After setting the excessive lip shape through interpolation operation, there may still be a situation that lip shape change of adjacent time units is not natural enough, and the change track of the lip shape is adjusted by using smoothing operation, so that the lip shape change is smoother and smoother, and abrupt change is avoided.

Specifically, the target lip sequence is a two-dimensional vector, the longitudinal direction is a time dimension, the transverse direction is an amplitude dimension, and the value of the vector at any position represents the amplitude of the mouth shape corresponding to the dimension of the lip sequence at the time point. And carrying out smoothing operation on a one-dimensional amplitude sequence formed by any amplitude dimension of the target lip sequence at a plurality of time points, and synthesizing smoothing results of all the amplitude dimensions to obtain the target smooth lip sequence. The smoothing operation used is exponential smoothing, one-dimensional gaussian filtering smoothing or any other smoothing method.

Illustratively, in one specific example, the amplitude sequence [10,12,13,15,17,20,19,18,16,14,12,11] of the target lip sequence in column 2 over 12 consecutive time units is exponentially smoothed to yield a smoothed sequence [10.00,10.30,10.91,12.13,13.29,15.01,16.20,16.94,16.46,15.12,13.88,12.82]. And calculating the smoothed sequences of other columns by a similar method, and then reducing the smoothed sequences into a two-bit vector according to the original sequence to obtain the target smoothed lip sequence.

One or more embodiments of the present disclosure only need to obtain a timestamp after aligning audio and text, and different lip driving schemes are configured for two types of digital human animation scenes, i.e., real-time broadcasting and offline production. One or more embodiments of the present description do not involve the collection of supervised data, and therefore have little data cost; meanwhile, the single-instance reasoning time is short, the required storage space is small, and the concurrency of the scheme is higher, so that the deployment cost can be greatly reduced. By arranging the lips in specific time intervals, it is ensured that the correct lip appears at each point in time, and thus the error rate can be greatly reduced. In addition, one or more embodiments of the present disclosure can migrate to different digital people with almost zero cost based on matching of time stamps and lip rules, and adapt to different lip schemes, get rid of limitations caused by data and specific virtual people, and skip long and high-cost data collection links, so that the present disclosure has high compatibility.

According to an embodiment of another aspect, there is also provided a digital human lip shape generating device. Fig. 5 illustrates a schematic block diagram of a digital human lip generating apparatus, which may be deployed in any device, platform, or cluster of devices having computing, processing capabilities, according to one embodiment. As shown in fig. 5, the apparatus 500 includes:

an obtaining unit 501 configured to obtain text data and corresponding audio data;

an alignment unit 502, configured to align the text data with the audio data in a time dimension, so as to obtain a text unit set corresponding to the text data, and an audio clip and a time period corresponding to any text unit in the text unit set;

a mouth shape determining unit 503 configured to determine, for a target text unit over a target time period, target mouth shape information corresponding to the target text unit according to a first dictionary, where the first dictionary includes mouth shape information corresponding to any text unit;

an amplitude determining unit 504, configured to determine, according to a target audio segment corresponding to the target text unit, target amplitude information corresponding to the target mouth shape information;

a lip determining unit 505 configured to determine a target lip over the target period of time based on the target mouth shape information and target amplitude information.

In some possible embodiments, the apparatus 500 further comprises:

a set determination unit 506 configured to compose a lip set including a first lip and a second lip adjacent in a time dimension from a plurality of the target lips;

and an interpolation unit 507 configured to interpolate a blank time period between the first lip and the second lip according to the first lip and the second lip, so as to obtain a first lip sequence.

In some possible embodiments, the apparatus 500 further comprises:

and a smoothing unit 508, configured to perform smoothing operation on the target lip sequence formed by the first lip, the second lip and the first lip sequence in a time dimension, so as to obtain a target smooth lip sequence.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in any of the above embodiments.

According to an embodiment of yet another aspect, there is also provided a computing device including a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements the method described in any of the above embodiments.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A digital human lip generation method, comprising:

acquiring text data and corresponding audio data;

2. The method of claim 1, wherein the audio data is pre-recorded audio and the text units are words; aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set, wherein the method comprises the following steps:

3. The method of claim 1, wherein the audio data is pre-recorded audio and the text units are phonemes; aligning the text data with the audio data in the time dimension to obtain a text unit set corresponding to the text data, and an audio fragment and a time period corresponding to any text unit in the text unit set, wherein the method comprises the following steps:

4. The method of claim 1, wherein the audio data is generated from the text data using a text-to-speech model that generates a set of text units corresponding to the text data and an audio clip and time period corresponding to any one of the set of text units while generating the audio data;

5. The method of claim 1, wherein determining target amplitude information corresponding to the target mouth shape information according to a target audio clip corresponding to the target text unit comprises:

6. The method of claim 5, wherein the audio feature is a volume feature; determining the target amplitude information according to the audio characteristics of the target audio fragment, including:

7. The method of claim 6, wherein the particular text range in which the target text unit is located is a sentence or paragraph in which the target text unit is located.

8. The method of claim 1, further comprising:

9. The method of claim 8, wherein the interpolation operation is linear interpolation.

10. The method of claim 8, wherein interpolating a blank time period between the first lip and the second lip from the first lip and the second lip to obtain a first lip sequence comprises:

11. The method of claim 8, further comprising:

12. The method of claim 11, wherein the smoothing operation is exponential smoothing or one-dimensional gaussian filtering smoothing.

13. A digital human lip generating apparatus comprising:

14. The apparatus of claim 13, further comprising:

15. The apparatus of claim 14, further comprising:

16. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-12.

17. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-12.