CN113539240B

CN113539240B - Animation generation method, device, electronic equipment and storage medium

Info

Publication number: CN113539240B
Application number: CN202110812403.6A
Authority: CN
Inventors: 王海新; 杜峰
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2024-06-18
Anticipated expiration: 2041-07-19
Also published as: CN113539240A

Abstract

The embodiment of the invention discloses an animation generation method, an animation generation device, electronic equipment and a storage medium, wherein the animation generation method comprises the following steps: acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and identifying the target text data to obtain each phoneme contained in the target text data, and analyzing and identifying the target voice data to obtain the pronunciation time period of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven in accordance with the corresponding mouth shape for the pronunciation period of each phoneme to generate a mouth shape animation. The embodiment of the invention can improve the fitting degree of the mouth shape of the virtual image and the expression statement, so that the mouth shape of the virtual image is richer, and the expression is smoother and more natural.

Description

Animation generation method, device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to computer technology, in particular to an animation generation method, an animation generation device, electronic equipment and a storage medium.

Background

With the development of the live broadcasting industry, each large platform disputes to push out an avatar of the large platform, and uses the avatar to say an interactive sentence to interact with a user, for example, uses the avatar to say welcome xx to welcome the user to enter a live broadcasting room, uses the avatar to explain product information to the user in the live broadcasting room, and the like. When the avatar is used to interact with the user, there are usually other languages (such as user name of english, product name of english, etc.) in the interactive sentence besides chinese, and for the languages of the other languages in the interactive sentence, it is currently common practice to use a chinese mouth shape instead of the mouth shape of the languages of the other languages, so as to drive the avatar.

In the process of realizing the invention, the inventor discovers that the method of using the Chinese mouth shape to replace the mouth shape of other languages to drive the virtual image can cause the problems of non-overlapping mouth shape and interactive sentences, single mouth shape change of the virtual image, hard and unnatural expression and the like.

Disclosure of Invention

The embodiment of the invention provides an animation generation method, an animation generation device, electronic equipment and a storage medium, which can improve the fitting degree of mouth shapes and expression sentences of an avatar, so that the mouth shapes of the avatar are more varied, and the expression is smoother and more natural.

In a first aspect, an embodiment of the present invention provides an animation generation method, including:

Acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;

Analyzing and identifying the target text data to obtain each phoneme contained in the target text data, and analyzing and identifying the target voice data to obtain the pronunciation time period of each phoneme in each phoneme;

determining the language to which each phoneme belongs;

inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;

the avatar is driven according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

In a second aspect, an embodiment of the present invention provides an animation generating apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring target voice data and target text data corresponding to the target voice data, and the target voice data comprises voice data of different languages;

The recognition module is used for analyzing and recognizing the target text data to obtain each phoneme contained in the target text data, analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in the each phoneme;

the determining module is used for determining the language to which each phoneme belongs;

the query module is used for querying a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;

And the generating module is used for driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme so as to generate mouth shape animation.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the animation generation method according to any one of the embodiments of the present invention when executing the program.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an animation generation method according to any of the embodiments of the present invention.

In the embodiment of the invention, the voice data (namely, target voice data) and the text data (namely, target text data) of target sentences formed by languages of different languages can be analyzed and identified to obtain each phoneme contained in the target text data and the pronunciation time period of each phoneme in each phoneme, and the language to which each phoneme belongs is determined; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven in accordance with the corresponding mouth shape for the pronunciation period of each phoneme to generate a mouth shape animation. In the embodiment of the invention, phonemes of different languages contained in the target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of the different languages, so that the mouth shapes of the virtual images are driven according to the mouth shapes of the different languages, the mouth shapes of the virtual images are varied abundantly, the problems that the mouth shapes generated by using the mouth shapes of the Chinese to replace the mouth shapes of the other languages to drive the virtual images are not overlapped with interactive sentences, the mouth shapes are single in variation, hard and unnatural in expression and the like are avoided, the fitting degree of the mouth shapes of the virtual images and the expression sentences is improved, and the expression of the virtual images is smoother and more natural.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an animation generation method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a method for creating a mouth shape configuration table according to an embodiment of the present invention.

Fig. 3 is an exemplary diagram of a die configuration table provided in an embodiment of the present invention.

Fig. 4 is another exemplary diagram of a die configuration table according to an embodiment of the present invention.

Fig. 5 is a flowchart illustrating a method of driving an avatar according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a vector change rule according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a structure of an animation generating device according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a schematic flow chart of an animation generation method according to an embodiment of the present invention, where the method may be performed by an animation generation device according to an embodiment of the present invention, and the device may be implemented in software and/or hardware. In a specific embodiment, the apparatus may be integrated in an electronic device, which may be a computer, for example. The following embodiments will be described taking the example of the integration of the apparatus in an electronic device, and referring to fig. 1, the method may specifically include the following steps:

Step 101, target voice data and target text data corresponding to the target voice data are obtained, wherein the target voice data comprise voice data of different languages.

By way of example, taking the case that the target sentence is expressed by the driving avatar, the target voice data may be understood as a target sentence expressed in a voice form, and the target text data may be understood as a target sentence expressed in a text form; the target sentence may be composed of languages of different languages, such as chinese, english, french, russian, etc. A specific target sentence (represented by text data) such as: your Sam, i.e. the target sentence is composed of Chinese and English.

In a specific implementation, the avatar to be driven may be a virtual character, and the character is not limited in character, and the avatar may be a two-dimensional avatar or a three-dimensional avatar. Specifically, the target voice data and the target text data can be obtained through real-time input of the user, for example, voice sent by the user can be picked up through a microphone, so that the target voice data is obtained; for example, text provided by a user can be acquired through a keyboard or a screen, so that target text data corresponding to target voice data is obtained. In addition, preset target voice data and corresponding target text data can be obtained; in addition, the target voice data can be obtained, and the target voice data is converted to obtain target text data corresponding to the target voice data.

Specifically, after the target text data is obtained, the target text data can be segmented according to languages, so that segmented text data are obtained; for example, when the target text data contains Chinese and English, the Chinese and English in the target text data can be segmented. For example, when the target text data is "your good" Sam, the target text data may be divided into "your good" and "Sam". After the target text data is segmented, the segmented text data can be preprocessed to obtain text data to be identified so as to facilitate subsequent identification; the preprocessing may include, but is not limited to, converting special symbols into words or words of a corresponding language, segmenting synthesized words, and the like; for example, for a special symbol "×" can be converted to "star", for a special symbol "&" can be converted to "and", for a special symbol "#" can be converted to "well", and for a composite word "what's" can be split into "what is".

And 102, analyzing and identifying the target text data to obtain each phoneme contained in the target text data, and analyzing and identifying the target voice data to obtain the pronunciation time period of each phoneme in each phoneme.

Specifically, the target text data after preprocessing may be analyzed and identified, that is, the text data to be identified is analyzed and identified, so as to obtain each phoneme included in the target text data. In a specific implementation, the text data to be recognized can be analyzed and recognized by utilizing a pre-established pronunciation dictionary of each language, so that each phoneme included in the target text data is obtained. For example, a pronunciation dictionary of a corresponding language may be queried based on each word or word in the text data to be recognized, thereby obtaining each phoneme.

Illustratively, the pronunciation dictionary library of each language such as a Chinese pronunciation dictionary library, an English pronunciation dictionary library, a French pronunciation dictionary library, and the like. Wherein, each Chinese character and its pronunciation (phoneme) can be included in the Chinese pronunciation dictionary, and each word and its pronunciation (phoneme) can be included in the English pronunciation dictionary.

In a specific embodiment, the partial data in the chinese pronunciation dictionary library may be as follows, where arabic numerals represent pronunciation tones of chinese phonemes:

through ch uan1 g uo4

Shuttle ch uan1 s uo1

Wearing ch uan1 zh uo2

Pass ch uan2

In a specific embodiment, the portion of the data in the english pronunciation dictionary library may be as follows, where arabic numerals represent pronunciation tones of english phonemes:

ABILENE AE1 B IH0 L IY2 N

ABILITIES AH0 B IH1 L AH0 T IY0 Z

ABILITY AH0 B IH1 L AH0 T IY0

ABIMELECH AE0 B IH0 M EH0 L EH1 K

ABINADAB AH0 B AY1 N AH0 D AE1 B

ABINGDON AE1 B IH0 NG D AH0 N

ABINGDON'S AE1 B IH0 NG D AH0 N Z

ABINGER AE1 B IH0 NG ER0

In addition, in order to make the application scene of the invention wider, english brand names, english person names and the like can be added into the English pronunciation dictionary library.

For example, the target speech data may be analyzed and identified by pre-trained acoustic models, which may include a hidden markov chain model (Hidden Markov Model, HMM) -gaussian mixture model (Gaussian Mixture Model, GMM) and a deep neural network (Deep Neural Networks, DNN) model, to obtain the pronunciation period of each of the individual phonemes. For example, frame processing may be performed on the target voice data to obtain a plurality of audio frames; extracting acoustic features of each audio frame, and inputting the extracted acoustic features into a pre-trained acoustic model so that the acoustic model predicts probabilities of candidate phonemes in each audio frame; and determining a phoneme sequence corresponding to the target voice data according to the probability of the candidate phonemes in each audio frame and a plurality of phonemes obtained by identifying the target text data, and acquiring the pronunciation time period of each phoneme according to the pronunciation starting time and the pronunciation ending time of each phoneme in the audio sequence.

The extracted acoustic features may be Mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC), among others. Experiments on human auditory perception show that human auditory perception focuses on certain specific areas, rather than the entire spectrum, so that MFCCs are designed according to human auditory characteristics and are more suitable for use in speech recognition scenarios.

The acoustic model may be trained based on a mixed corpus, which may include a corpus of each language, such as a chinese prediction corpus, an english corpus, a french corpus, and the like. The Chinese prediction library can be recorded by 400 Chinese people from different dialects by using an open-source corpus Aishell, aishell, the audio frequency is 16000Hz, the total time is 170 hours of voice, and the text corresponding to each voice is provided. The english corpus may employ an irish english dialect speech dataset (IRELAND ENGLISH DIALECT SPEECH DATA SET) consisting of english sentences recorded by volunteers of different dialects, changing the 48000Hz corpus to 16000Hz for training use, and providing text corresponding to each speech.

In a specific implementation, a training sample set may be constructed by extracting data from a mixed corpus according to a first proportion, each sample in the training sample set includes speech data and corresponding text data, phonemes included in the text data of each training sample are obtained according to a pronunciation dictionary library of each language, acoustic feature extraction and recognition are performed on the speech data of each training sample to obtain pronunciation time lengths of the corresponding phonemes, the pronunciation time lengths of the phonemes and the corresponding phonemes included in each training sample are marked as labels of the corresponding training samples, model training is performed by using the training sample set with the labels, and model reverse optimization is performed through a loss function, so as to obtain an acoustic model. For example, the loss function may employ a cross entropy loss function.

In addition, data can be extracted from the mixed corpus according to a second proportion to construct a test sample set, the test sample set is utilized to perform performance test on the trained acoustic model, and if the test result meets the requirement, the trained acoustic model is put into use. The first proportion and the second proportion can be set according to actual requirements or experience, for example, the first proportion can be set to 80%, and the second proportion can be set to 5%.

In a specific embodiment, the pronunciation period of each phoneme obtained by recognition may include a pronunciation start time and a pronunciation end time, taking the target text data as "your good" as an example, and the obtained pronunciation period of each phoneme and each phoneme may be as shown in the following table 1:

TABLE 1

The data shown in table 1 is merely illustrative, and does not constitute a final limitation of actual data processing.

And step 103, determining the language to which each phoneme belongs.

For example, the pronunciation dictionary database of each language may be queried based on each phoneme, and the language corresponding to the pronunciation dictionary database matched with each phoneme may be determined as the language to which the corresponding phoneme belongs; the pronunciation dictionary library matched with a certain phoneme may be a pronunciation dictionary library containing the phoneme. For example, if a certain phoneme is contained in the Chinese pronunciation dictionary, the Chinese language of the language to which the phoneme belongs can be determined; for another example, if a certain phoneme is included in the english pronunciation dictionary library, the english of the language to which the certain phoneme belongs may be determined.

And 104, inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme.

Specifically, a mouth shape configuration table may be created for each language in advance, where the mouth shape configuration table created for each language includes all phonemes included in the corresponding language and the mouth shape corresponding to each phoneme. After each phoneme included in the target sentence is obtained, the mouth shape configuration table of the corresponding language can be queried according to the language corresponding to each phoneme, so that the mouth shape configured for each phoneme is obtained. For example, if a language to which a certain phoneme belongs is chinese, a mouth shape configuration table of chinese may be queried, so as to obtain a chinese mouth shape configured for the phoneme. For another example, if the language to which a certain phoneme belongs is english, the english mouth shape configuration table may be queried to obtain the english mouth shape configured for the phoneme.

Step 105, driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

The following describes a method for creating a mouth shape configuration table according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:

in step 201, phonemes for each language are collected and a speech viseme for each language is determined.

Illustratively, the languages may include Chinese, english, french, russian, etc., each of which may include a plurality of phonemes, the phonemes referring to the smallest phonetic unit of speech. For example, the Chinese includes phonemes b, p, m, f, z, c, s, a, o, e, i, u, u, etc., and the English includes phonemes L, R, S, F, V, CH, SH, ZH, etc.

The speech viseme represents the facial and oral position when speaking a word or word, which is the visual equivalent of a phoneme, is the basic acoustic unit that forms the word, and is the basic visual building block of speech. In the language, each phoneme has a corresponding speech viseme representing the shape of the mouth when it pronounces.

Step 202, categorizing phonemes with the same speech viseme in each language.

In a specific implementation, different phonemes may have the same speech viseme, and phonemes in each language with the same speech viseme may be categorized. For example, if the phonemes included In the chinese language include In as the speech viseme of the phonemes In, ing, ie, the phonemes In, ing, ie may be categorized as one type. For another example, in the phonemes included in the english language, the phonemes AE and AY are classified into the class if the speech visual positions of both the phonemes AE and AY are ai.

And 203, classifying the phones in each language into the mouth shapes for each class of phones, thereby obtaining a mouth shape configuration table created for each language.

The same mouth shape can be configured for the phonemes with the same voice vision in each language. For example, the same speech view of Chinese phonemes in, ing, ie may be configured with the same mouth shape for in, ing, ie; the English phonemes AE and AY have the same speech video phase, and the same mouth shape can be configured for AE and AY. In a specific implementation, the configured mouth shape can be manufactured by combining various deformers blendshape.

In a specific embodiment, taking the example that each language includes chinese and english, the mouth shape configuration table created for chinese may be as shown in fig. 3, and the mouth shape configuration table created for english may be as shown in fig. 4. The mouth shape configuration table created for each language can include mouth shape identification, phonemes, voice vision, mouth shape and other items. Note that, phonemes, speech phonemes, mouth shapes, and the like in fig. 3 and 4 are merely examples, and do not constitute a final limitation on the actual configuration.

According to the method for creating the mouth shape configuration table, the mouth shape configuration table is created for each language according to the category by classifying the phonemes with the same voice vision in each language, so that data processing can be simplified, and creation efficiency can be improved.

The following describes the avatar driving method according to the embodiment of the present invention, as shown in fig. 5, that is, step 105 in fig. 1 may specifically include the following steps:

in step 1051, a multi-dimensional state vector for the mouth shape of the current phoneme configuration is determined, and a multi-dimensional state vector for the mouth shape of the previous phoneme configuration is determined.

Specifically, the previous phoneme may be a phoneme whose pronunciation period is before and adjacent to the pronunciation period of the current phoneme. For example, on a time axis, according to the pronunciation time sequence, a phoneme 1, a phoneme 2 and a phoneme 3 are respectively provided, if the current phoneme is the phoneme 2, the last phoneme of the current phoneme is the phoneme 1; if the current phoneme is phoneme 3, the last phoneme of the current phoneme is phoneme 2. In addition, if the current phoneme is the first phoneme in the time series ordering, it can be considered that the last phoneme of the current phoneme is empty and the multidimensional state vector of the mouth shape of the last phoneme is 0.

Each of the mouth shapes may be represented by a multi-dimensional state vector including a state vector of a plurality of dimensions, each of which represents a state characteristic value of a portion constituting one of the mouth shapes, and the plurality of dimensions may include, for example, an upper lip, a lower lip, a tip of a tongue, a tongue position, a tongue surface, and the like, and the values of the multi-dimensional state vectors of the different mouth shapes may be different.

Step 1052, calculating the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using the slow function to obtain multidimensional state vectors at each moment in the pronunciation period of the current phoneme.

Illustratively, the slow down function in embodiments of the present invention may include a slow down function ease-out, and the slow down function ease-out may be as follows:

f(x_i)＝-x_i ²+2x_i

Wherein x _i represents the i-th time in the pronunciation period of the current phoneme, and f (x _i) represents the vector change rate of the i-th time in the pronunciation period of the current phoneme; for example, the pronunciation time period of the current phoneme corresponds to a pronunciation time length of 5 seconds, and the ith moment can be 1 st second, 2 nd second, 3 rd second, 4 th second or 5 th second. It can be seen that the slow motion function provided by the embodiment of the invention is a slow motion function of variable speed, the change speed is fast at the beginning, smooth feeling is provided for people, and then the slow motion is gradually reduced, so that people cannot feel the Canning.

In order to facilitate calculation, when vector calculation is performed, the pronunciation duration of the current phoneme may be determined according to the pronunciation start time and the pronunciation end time of the current phoneme, normalization processing is performed at each time x _i in the pronunciation period of the current phoneme according to the pronunciation duration of the current phoneme, and x _i after normalization processing is substituted into the slow function to perform calculation.

Specifically, the vector change rate at each time in the pronunciation period of the current phoneme can be calculated by using the slow function; the method comprises the steps of calculating vector differences in each dimension of a mouth shape configured for a current phoneme and a mouth shape configured for a last phoneme according to a multi-dimensional state vector of the mouth shape configured for the current phoneme and a multi-dimensional state vector of the mouth shape configured for the last phoneme, calculating vector quantities in each dimension of each moment in a pronunciation period of the current phoneme according to vector differences in each dimension of the mouth shape configured for the current phoneme and the mouth shape configured for the last phoneme, vector change rates of each moment in a pronunciation period of the current phoneme and multi-dimensional state vector of the mouth shape configured for the last phoneme, and determining multi-dimensional state vectors in each moment in a pronunciation period of the current phoneme according to vector quantities in each moment in a pronunciation period of the current phoneme.

For example, the slow function, the multidimensional state vector of the mouth shape configured for the current phoneme, and the multidimensional state vector of the mouth shape configured for the previous phoneme may be processed according to the following formula, to obtain multidimensional state vectors at various moments in the pronunciation period of the current phoneme:

E_ij＝-Δ_jf(x_i)+s_j

E_i＝(E_i1,E_i2,......E_ij)

Wherein E _ij represents a vector in a j-th dimension at an i-th time in a pronunciation period of a current phoneme, Δ _j represents a vector difference between a multidimensional state vector of a mouth shape configured for the current phoneme and a vector in a j-th dimension in a multidimensional state vector of a mouth shape configured for a previous phoneme, s _j represents a vector in a j-th dimension in a multidimensional state vector of a mouth shape configured for a previous phoneme, and E _i represents a multidimensional state vector in an i-th time in a pronunciation period of the current phoneme.

It can be obtained by analyzing the formula E _ij＝-Δ_jf(x_i)+s_j that if Δ _j >0, the vector change rule can be shown in the (a) diagram in fig. 6, and if Δ _j <0, the vector change rule can be shown in the (b) diagram in fig. 6, and is reflected on animation change, that is, the speed of a certain mouth shape is faster when the mouth shape starts to change, and gradually changes to the next mouth shape, so that the design can not cause people to feel that the mouth shape suddenly stops, and cause the phenomenon of blocking, and the animation transition is better.

Step 1053, providing the multidimensional state vector of each moment in the pronunciation period of the current phoneme to the deformer of the corresponding dimension to drive the avatar by using the deformer of the corresponding dimension, and generating the mouth shape animation.

Wherein, the state vector of each dimension in the multidimensional state vector corresponds to a deformer, and each deformer is used for driving the corresponding part of the avatar. For example, the multidimensional state vector includes three dimensions of state vectors of an upper lip, a lower lip and a tongue tip, and the upper lip, the lower lip and the tongue tip correspond to one deformer respectively, when the multidimensional state vector at a certain moment is calculated, the state vector of the upper lip in the multidimensional state vector can be provided for the deformer corresponding to the upper lip, the state vector of the lower lip is provided for the deformer corresponding to the lower lip, and the state vector of the tongue tip is provided for the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding part of the virtual image according to the vector of the corresponding dimension, thereby generating the mouth shape animation. In addition, when the mouth shape animation is generated, the mouth shape animation and the target voice data can be synchronously played, so that the animation effect that the virtual image expresses the target voice data is presented.

The slow function provided by the embodiment of the invention is used for calculating the multidimensional state vector at each moment in the pronunciation period of the current phoneme, so that the virtual image is driven, the mouth shape variation of the virtual image is more real and natural, and the animation display effect is improved. In practical applications, other types of slow functions may be used to calculate the multidimensional state vector at each moment in the pronunciation period of the current phoneme, for example, a linear slow function is used, which is not limited herein specifically.

In the following, a specific example is used to describe the avatar driving method provided in the embodiment of the present invention, and the target text data is taken as "your good", and the pronunciation time of each phoneme is identified as shown in table 1, where each phoneme is a chinese phoneme, then according to fig. 3, the mouth shape corresponding to the phoneme n is a mouth shape five, the mouth shape corresponding to the phoneme in is a mouth shape eight, the mouth shape corresponding to the phoneme h is a mouth shape six, the mouth shape corresponding to the phoneme ao is a mouth shape three, that is, when the avatar is to express "your good", the mouth shape of the avatar is to be driven to change in sequence according to the mouth shape five, the mouth shape eight, the mouth shape six, and the mouth shape three.

For example, the current phoneme is ao, the mouth shape configured correspondingly is mouth shape three, the mouth shape configured by the last phoneme is mouth shape six, the multidimensional state vector of the mouth shape six is [20, 40, 60], the multidimensional state vector of the mouth shape three is [50, 20, 90], the pronunciation time length corresponding to the current phoneme is 5 seconds, and the multidimensional state vector of the 2 nd second in the pronunciation time length of the current phoneme is calculated as follows:

Normalization of pronunciation moment (2 nd second): 2/5=0.4;

The state vector for the first dimension at 2 seconds is: -30 x 0.4 (0.4-2) +20=39.2;

the state vector for the second dimension at 2 seconds is: - (-20) 0.4 (0.4-2) +40=27.2;

the state vector for the third dimension at 2 seconds is: -30 x 0.4 (0.4-2) +60=79.2;

that is, the 2 nd second multi-dimensional state vector in the pronunciation period of the phoneme ao is (39.2, 27.2, 79.2), assuming that the corresponding dimensions of this multi-dimensional state vector are the upper lip, the lower lip and the tongue tip, respectively, 39.2 may be provided to the deformer corresponding to the upper lip, 27.2 may be provided to the deformer corresponding to the lower lip, and 79.2 may be provided to the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding portion of the avatar according to the vector of the corresponding dimension.

For other moments in the pronunciation period of the current phoneme and various moments in the pronunciation period of other phonemes, calculation can be performed according to a similar method, so that a multidimensional state vector of each moment in the pronunciation period of each phoneme can be obtained, the state vector of each dimension is provided for a deformer of the corresponding dimension, and the deformer of the corresponding dimension drives the avatar according to time sequence, so that the animation effect of expressing your good of the avatar can be achieved.

Fig. 7 is a block diagram of an animation generating apparatus according to an embodiment of the present invention, which is adapted to perform the animation generating method according to the embodiment of the present invention. As shown in fig. 7, the apparatus may specifically include:

an obtaining module 401, configured to obtain target voice data and target text data corresponding to the target voice data, where the target voice data includes voice data of different languages;

The recognition module 402 is configured to perform analysis and recognition on the target text data to obtain each phoneme included in the target text data, and perform analysis and recognition on the target speech data to obtain a pronunciation period of each phoneme in the each phoneme;

a determining module 403, configured to determine a language to which each phoneme belongs;

a query module 404, configured to query a mouth shape configuration table of a language to which each phoneme belongs, to obtain a mouth shape configured for each phoneme;

a generating module 405 for driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme to generate a mouth shape animation.

In one embodiment, the generating module 405 is specifically configured to:

determining a multi-dimensional state vector of a mouth shape configured for a current phoneme, and determining a multi-dimensional state vector of a mouth shape configured for a previous phoneme, wherein the previous phoneme is a phoneme with a pronunciation period before and adjacent to the pronunciation period of the current phoneme;

Calculating a multi-dimensional state vector of the mouth shape configured for the current phoneme and a multi-dimensional state vector of the mouth shape configured for the last phoneme by using a slow function to obtain multi-dimensional state vectors at all times in the pronunciation period of the current phoneme;

And providing the multidimensional state vector of each moment in the pronunciation period of the current phoneme to a deformer with corresponding dimension so as to drive the avatar by using the deformer with corresponding dimension to generate the mouth shape animation.

In one embodiment, the slow down function includes a slow down function ease-out.

In one embodiment, the generating module 405 calculates the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function, so as to obtain multidimensional state vectors at each moment in the pronunciation period of the current phoneme, including:

calculating the vector change rate of each moment in the pronunciation period of the current phoneme by using the slow function;

Calculating vector differences in each dimension of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme according to the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme;

Calculating vectors of each dimension at each time in the pronunciation period of the current phoneme according to vector differences of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme at each dimension, vector change rates of each time in the pronunciation period of the current phoneme and multi-dimensional state vectors of the mouth shape configured for the previous phoneme;

And determining the multidimensional state vector at each moment in the pronunciation period of the current phoneme according to the vector at each moment in each dimension in the pronunciation period of the current phoneme.

In one embodiment, the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and before calculating the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow function, the generating module 405 is further configured to:

determining the pronunciation duration of the current phoneme according to the pronunciation start time and the pronunciation end time of the current phoneme;

And normalizing each moment in the pronunciation period of the current phoneme according to the pronunciation time length of the current phoneme.

In one embodiment, the determining module 403 is specifically configured to:

Inquiring a pronunciation dictionary base of each language based on each phoneme;

And determining the language corresponding to the pronunciation dictionary library matched with each phoneme as the language to which the corresponding phoneme belongs.

In one embodiment, the apparatus further comprises:

and the creation module is used for collecting phonemes of each language, classifying the phonemes with the same pronunciation in each language, and classifying the phonemes in each language into each type of phonemes to configure a mouth shape according to the phonemes in each language so as to obtain a mouth shape configuration table created for each language.

In one embodiment, the apparatus further comprises:

The preprocessing module is used for dividing the target text data according to languages to obtain divided text data; preprocessing the segmented text data to obtain text data to be identified;

The recognition module 402 performs analysis and recognition on the target text data to obtain each phoneme included in the target text data, where the method includes:

and analyzing and identifying the text data to be identified to obtain each phoneme included in the target text data.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The specific working process of the functional module described above may refer to the corresponding process in the foregoing method embodiment, and will not be described herein.

The device of the embodiment of the invention can analyze and identify the voice data (namely, target voice data) and the text data (namely, target text data) of target sentences formed by languages of different languages to obtain each phoneme contained in the target text data and the pronunciation time period of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven in accordance with the corresponding mouth shape for the pronunciation period of each phoneme to generate a mouth shape animation. In the embodiment of the invention, phonemes of different languages contained in the target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of the different languages, so that the mouth shapes of the virtual images are driven according to the mouth shapes of the different languages, the mouth shapes of the virtual images are varied abundantly, the problems that the mouth shapes generated by using the mouth shapes of the Chinese to replace the mouth shapes of the other languages to drive the virtual images are not overlapped with interactive sentences, the mouth shapes are single in variation, hard and unnatural in expression and the like are avoided, the fitting degree of the mouth shapes of the virtual images and the expression sentences is improved, and the expression of the virtual images is smoother and more natural.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the animation generation method provided by any embodiment when executing the program.

The embodiment of the invention also provides a computer readable medium, on which a computer program is stored, the program, when executed by a processor, implementing the animation generation method provided in any of the above embodiments.

Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the invention.

As shown in fig. 8, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 501.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units involved in the embodiments of the present invention may be implemented in software, or may be implemented in hardware. The described modules and/or units may also be provided in a processor, e.g., may be described as: a processor includes an acquisition module, an identification module, a determination module, a query module, and a generation module. The names of these modules do not constitute a limitation on the module itself in some cases.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:

Acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and identifying the target text data to obtain each phoneme contained in the target text data, analyzing and identifying the target voice data to obtain the pronunciation time period of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

According to the technical scheme of the embodiment of the invention, the voice data (namely, target voice data) and the text data (namely, target text data) of target sentences formed by languages of different languages can be analyzed and identified, each phoneme included in the target text data and the pronunciation time period of each phoneme in each phoneme are obtained, and the language to which each phoneme belongs is determined; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven in accordance with the corresponding mouth shape for the pronunciation period of each phoneme to generate a mouth shape animation. In the embodiment of the invention, phonemes of different languages contained in the target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of the different languages, so that the mouth shapes of the virtual images are driven according to the mouth shapes of the different languages, the mouth shapes of the virtual images are varied abundantly, the problems that the mouth shapes generated by using the mouth shapes of the Chinese to replace the mouth shapes of the other languages to drive the virtual images are not overlapped with interactive sentences, the mouth shapes are single in variation, hard and unnatural in expression and the like are avoided, the fitting degree of the mouth shapes of the virtual images and the expression sentences is improved, and the expression of the virtual images is smoother and more natural.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. An animation generation method, comprising:

determining the language to which each phoneme belongs;

Driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation;

driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation, wherein the method comprises the following steps:

Determining a multi-dimensional state vector of a mouth shape configured for a current phoneme, and determining a multi-dimensional state vector of a mouth shape configured for a previous phoneme, wherein the previous phoneme is a phoneme with a pronunciation period before and adjacent to the pronunciation period of the current phoneme, the multi-dimensional state vector comprises state vectors with multiple dimensions, and the state vector of each dimension represents a state characteristic value of a specific part forming one mouth shape;

2. The animation generation method of claim 1, wherein the slow function comprises a slow-out function ease-out.

3. The animation generation method according to claim 2, wherein the calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme using the slow function to obtain the multi-dimensional state vectors at respective times within the pronunciation period of the current phoneme comprises:

4. The animation generation method according to claim 2, wherein the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and further comprising, before calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme using a slow function, obtaining the multi-dimensional state vectors at respective times within the pronunciation period of the current phoneme:

5. The animation generation method of any one of claims 1 to 4, wherein the determining the language to which each phoneme belongs comprises:

6. The animation generation method according to any one of claims 1 to 4, wherein the mouth shape configuration table is created by:

collecting phonemes of each language, and determining the voice vision of the phonemes of each language;

classifying phonemes with the same voice vision in each language;

classifying the phonemes in each language into each class of phonemes to configure the mouth shapes, thereby obtaining a mouth shape configuration table created for each language.

7. The animation generation method of any one of claims 1 to 4, further comprising, prior to the analysis and recognition of the target text data:

dividing the target text data according to languages to obtain divided text data;

Preprocessing the segmented text data to obtain text data to be identified;

the analyzing and identifying the target text data to obtain each phoneme included in the target text data includes:

8. An animation generation device, comprising:

A generating module for driving the avatar according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation;

The generation module is also used for:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the animation generation method of any of claims 1 to 7 when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the animation generation method according to any of claims 1 to 7.