WO2019196306A1 - Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible - Google Patents
Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible Download PDFInfo
- Publication number
- WO2019196306A1 WO2019196306A1 PCT/CN2018/102209 CN2018102209W WO2019196306A1 WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1 CN 2018102209 W CN2018102209 W CN 2018102209W WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- feature
- model
- neural network
- voice data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000002156 mixing Methods 0.000 title abstract 2
- 238000003062 neural network model Methods 0.000 claims abstract description 74
- 230000015572 biosynthetic process Effects 0.000 claims description 51
- 238000003786 synthesis reaction Methods 0.000 claims description 31
- 238000012549 training Methods 0.000 claims description 27
- 230000001755 vocal effect Effects 0.000 claims description 19
- 230000002194 synthesizing effect Effects 0.000 claims description 13
- 238000001308 synthesis method Methods 0.000 claims description 12
- 238000000354 decomposition reaction Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 description 10
- 230000008569 process Effects 0.000 description 7
- 238000002583 angiography Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 210000004283 incisor Anatomy 0.000 description 4
- 238000005065 mining Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000005182 tip of the tongue Anatomy 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present application relates to the field of computer technologies, and in particular, to a voice-based lip animation synthesis apparatus, method, and readable storage medium.
- Speech synthesis also known as text-to-speech technology, is a technology that converts text information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.
- the application scenario of computer-assisted pronunciation training needs to dynamically display the speaker's mouth shape change when playing the voice data to help the user perform the pronunciation training.
- the playback is synthesized.
- voice data since there is no real speaker's mouth data corresponding to it, it is impossible to display a realistic mouth-shaped animation that matches the synthesized voice data.
- the present application provides a voice-based lip-type animation synthesizing apparatus, method, and readable storage medium, the main purpose of which is to solve the prior art technology that cannot display a realistic mouth-shaped animation that matches the synthesized speech data. problem.
- the present application provides a voice-based lip animation synthesis device, the device comprising a memory and a processor, wherein the memory stores a lip animation synthesis program executable on the processor,
- the mouth animation synthesis program is implemented by the processor to implement the following steps:
- the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
- the present application further provides a voice-based lip animation synthesis method, the method comprising:
- the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
- the present application further provides a computer readable storage medium having a lip animation synthesis program stored thereon, the lip animation synthesis program being executable by one or more processors Executing to implement the steps of the speech-based lip animation synthesis method as described above.
- FIG. 1 is a schematic diagram of a voice-based lip-shaped animation synthesizing device of the present application
- FIG. 2 is a schematic diagram of a program module of a lip-shaped animation synthesis program in an embodiment of a speech-based lip animation synthesis apparatus according to an embodiment of the present invention
- FIG. 3 is a flow chart of a preferred embodiment of a speech-based lip animation synthesis method of the present application.
- the present application provides a voice-based lip animation synthesis device.
- FIG. 1 a schematic diagram of a preferred embodiment of a speech-based lip animation synthesis apparatus of the present application is shown.
- the voice-based lip animation synthesis device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer.
- the speech-based lip animation synthesis apparatus includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
- the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
- the memory 11 may be, in some embodiments, an internal storage unit of a voice-based lip animation composition device, such as a hard disk of the speech-based lip animation synthesis device.
- the memory 11 may also be an external storage device of a voice-based lip animation synthesis device in other embodiments, such as a plug-in hard disk equipped with a voice-based lip animation synthesis device, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc.
- the memory 11 may also include an internal storage unit of the voice-based lip animation composition device and an external storage device.
- the memory 11 can be used not only for storing application software and various types of data installed in the voice-based lip animation synthesizing device, such as code of a lip animation synthesis program, but also for temporarily storing data that has been output or is to be output.
- the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as performing a lip animation synthesis program.
- CPU Central Processing Unit
- controller microcontroller
- microprocessor or other data processing chip for running program code or processing stored in the memory 11.
- Data such as performing a lip animation synthesis program.
- Communication bus 13 is used to implement connection communication between these components.
- the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.
- a standard wired interface such as a WI-FI interface
- Figure 1 shows only a speech-based lip animation synthesis device having components 11-14 and a lip animation synthesis program, but it should be understood that not all illustrated components may be implemented, alternative implementations may be Fewer components.
- the device may further include a user interface
- the user interface may include a display
- an input unit such as a keyboard
- the optional user interface may further include a standard wired interface and a wireless interface.
- the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
- the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the voice-based lip animation synthesis device and a user interface for displaying visualization.
- a memory animation synthesis program is stored in the memory 11; when the processor 12 executes the lip animation synthesis program stored in the memory 11, the following steps are implemented:
- the phoneme feature is input into a pre-trained deep neural network model, and an acoustic feature corresponding to the phoneme feature is output, the acoustic feature including a Mel cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency.
- the acoustic feature is input to a speech synthesizer, and speech data corresponding to the target text data is output.
- the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
- the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
- Said phonemes include the initial phoneme and the vowel phoneme.
- the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
- a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
- MFCC Mel-frequency cepstral coefficients
- the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
- the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
- a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
- the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
- the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
- the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
- the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
- the mouth position feature in the mouth data is directly used.
- the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
- a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
- the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
- the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
- the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
- the set of lip data corresponding to the pronunciation feature is used as a parameter space
- a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
- the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
- the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
- ⁇ is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
- the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
- the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
- the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining.
- the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
- the speech-based vocal animation synthesizing device of the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
- the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
- the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
- This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
- the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
- the lip animation synthesis program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation)
- the processor 12 is executed to complete the application
- the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing a lip-shaped animation synthesis program in a voice-based lip animation synthesis device. The execution process in .
- FIG. 2 it is a schematic diagram of a program module of a lip animation synthesis program in an embodiment of a speech-based lip animation synthesis device of the present application.
- the lip animation synthesis program can be segmented into feature extraction.
- the module 10, the feature conversion module 20, the speech synthesis module 30, the lip shape generation module 40, and the animation synthesis module 50 are exemplarily:
- the feature extraction module 10 is configured to: acquire target text data, and acquire phoneme features in the target text data according to a pronunciation dictionary;
- the feature conversion module 20 is configured to: input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration And pronunciation fundamental frequency;
- the speech synthesis module 30 is configured to: input the acoustic feature into a speech synthesizer, and output speech data corresponding to the target text data;
- the port type generating module 40 is configured to: acquire, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information,
- the tensor model expresses a correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data;
- the animation synthesizing module 50 is configured to: generate a lip animation corresponding to the voice data according to the lip shape data, to display the lip animation while playing the voice data.
- the present application also provides a voice-based lip animation synthesis method.
- FIG. 3 it is a flowchart of a preferred embodiment of a speech-based lip animation synthesis method of the present application. The method may be performed by a device, which may be implemented by software and/or hardware, and the following voice-based lip animation synthesis device as an execution subject describes the method of the present embodiment.
- the voice-based lip animation synthesis method includes:
- Step S10 Acquire target text data, and acquire phoneme features in the target text data according to the pronunciation dictionary.
- Step S20 input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency. .
- Step S30 inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data.
- the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
- the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
- Said phonemes include the initial phoneme and the vowel phoneme.
- the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
- a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
- MFCC Mel-frequency cepstral coefficients
- the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
- the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
- a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
- the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
- the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
- the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
- Step S40 acquiring, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, the tensor model expression The correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data.
- the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
- the mouth position feature in the mouth data is directly used, and the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
- a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
- the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
- the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
- the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
- the set of lip data corresponding to the pronunciation feature is used as a parameter space
- a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
- the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
- the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
- ⁇ is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
- the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
- the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
- Step S50 generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
- the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic.
- the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
- the speech-based vocal animation synthesis method proposed in the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
- the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
- the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
- This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
- the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
- the embodiment of the present application further provides a computer readable storage medium, where the mouth-shaped animation synthesis program is stored, and the lip animation synthesis program can be executed by one or more processors, Implement the following operations:
- the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
- the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
- a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Processing Or Creating Images (AREA)
Abstract
L'invention concerne un dispositif et un procédé de mélange d'animation de forme de bouche basé sur la parole. Le dispositif comporte une mémoire et un processeur. Un programme de mélange d'animation de forme de bouche qui peut s'exécuter sur le processeur est stocké dans la mémoire. Le programme met en œuvre les étapes suivantes lorsqu'il est exécuté par le processeur: acquérir des données de texte cible, acquérir des caractéristiques de phonèmes dans les données de texte cible d'après un dictionnaire de prononciation (S10); introduire les caractéristiques de phonèmes dans un modèle de réseau neuronal profond pré-entraîné, délivrer des caractéristiques acoustiques (S20); introduire les caractéristiques acoustiques dans un synthétiseur de parole et délivrer des données de parole (S30); acquérir des données de forme de bouche d'après les données de parole, un modèle de tenseur pré-entraîné, et des informations d'identification de locuteur (S40); et générer une animation correspondante de forme de bouche d'après les données de forme de bouche et les données de parole (S50). Le dispositif et le procédé résolvent le problème technique rencontré dans l'état antérieur de la technique, où il n'était pas possible de présenter une animation de forme de bouche correspondant à des données de parole et donnant une impression réaliste.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810327672.1A CN108763190B (zh) | 2018-04-12 | 2018-04-12 | 基于语音的口型动画合成装置、方法及可读存储介质 |
CN201810327672.1 | 2018-04-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019196306A1 true WO2019196306A1 (fr) | 2019-10-17 |
Family
ID=63981728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/102209 WO2019196306A1 (fr) | 2018-04-12 | 2018-08-24 | Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108763190B (fr) |
WO (1) | WO2019196306A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827799A (zh) * | 2019-11-21 | 2020-02-21 | 百度在线网络技术(北京)有限公司 | 用于处理语音信号的方法、装置、设备和介质 |
EP3866166A1 (fr) * | 2020-02-13 | 2021-08-18 | Baidu Online Network Technology (Beijing) Co., Ltd. | Procédé et appareil permettant de prédire une fonctionnalité en forme de bouche, dispositif électronique, support de stockage et produit programme informatique |
CN117173292A (zh) * | 2023-09-07 | 2023-12-05 | 河北日凌智能科技有限公司 | 一种基于元音切片的数字人交互方法及装置 |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447234B (zh) * | 2018-11-14 | 2022-10-21 | 腾讯科技(深圳)有限公司 | 一种模型训练方法、合成说话表情的方法和相关装置 |
CN109523616B (zh) * | 2018-12-04 | 2023-05-30 | 科大讯飞股份有限公司 | 一种面部动画生成方法、装置、设备及可读存储介质 |
CN111326141A (zh) * | 2018-12-13 | 2020-06-23 | 南京硅基智能科技有限公司 | 一种处理获取人声数据的方法 |
CN109801349B (zh) * | 2018-12-19 | 2023-01-24 | 武汉西山艺创文化有限公司 | 一种声音驱动的三维动画角色实时表情生成方法和*** |
CN109599113A (zh) | 2019-01-22 | 2019-04-09 | 北京百度网讯科技有限公司 | 用于处理信息的方法和装置 |
CN110136698B (zh) * | 2019-04-11 | 2021-09-24 | 北京百度网讯科技有限公司 | 用于确定嘴型的方法、装置、设备和存储介质 |
CN110189394B (zh) * | 2019-05-14 | 2020-12-29 | 北京字节跳动网络技术有限公司 | 口型生成方法、装置及电子设备 |
CN110288682B (zh) * | 2019-06-28 | 2023-09-26 | 北京百度网讯科技有限公司 | 用于控制三维虚拟人像口型变化的方法和装置 |
CN112181127A (zh) * | 2019-07-02 | 2021-01-05 | 上海浦东发展银行股份有限公司 | 用于人机交互的方法和装置 |
WO2021127821A1 (fr) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Procédé d'apprentissage de modèle de synthèse vocale, dispositif informatique et support de stockage |
CN110992926B (zh) * | 2019-12-26 | 2022-06-10 | 标贝(北京)科技有限公司 | 语音合成方法、装置、***和存储介质 |
CN111340920B (zh) * | 2020-03-02 | 2024-04-09 | 长沙千博信息技术有限公司 | 一种语义驱动的二维动画自动生成方法 |
CN111698552A (zh) * | 2020-05-15 | 2020-09-22 | 完美世界(北京)软件科技发展有限公司 | 一种视频资源的生成方法和装置 |
CN112331184B (zh) * | 2020-10-29 | 2024-03-15 | 网易(杭州)网络有限公司 | 语音口型同步方法、装置、电子设备及存储介质 |
CN112927712B (zh) * | 2021-01-25 | 2024-06-04 | 网易(杭州)网络有限公司 | 视频生成方法、装置和电子设备 |
CN112837401B (zh) * | 2021-01-27 | 2024-04-09 | 网易(杭州)网络有限公司 | 一种信息处理方法、装置、计算机设备及存储介质 |
CN113079328B (zh) * | 2021-03-19 | 2023-03-28 | 北京有竹居网络技术有限公司 | 视频生成方法和装置、存储介质和电子设备 |
CN113314094B (zh) * | 2021-05-28 | 2024-05-07 | 北京达佳互联信息技术有限公司 | 唇形模型的训练方法和装置及语音动画合成方法和装置 |
CN113707124A (zh) * | 2021-08-30 | 2021-11-26 | 平安银行股份有限公司 | 话术语音的联动播报方法、装置、电子设备及存储介质 |
CN113870396B (zh) * | 2021-10-11 | 2023-08-15 | 北京字跳网络技术有限公司 | 一种口型动画生成方法、装置、计算机设备及存储介质 |
CN114420088A (zh) * | 2022-01-20 | 2022-04-29 | 安徽淘云科技股份有限公司 | 一种展示方法及其相关设备 |
CN114581567B (zh) * | 2022-05-06 | 2022-08-02 | 成都市谛视无限科技有限公司 | 一种声音驱动虚拟形象口型方法、装置及介质 |
CN116257762B (zh) * | 2023-05-16 | 2023-07-14 | 世优(北京)科技有限公司 | 深度学习模型的训练方法及控制虚拟形象口型变化的方法 |
CN117894064A (zh) * | 2023-12-11 | 2024-04-16 | 中新金桥数字科技(北京)有限公司 | 一种基于遍历声母韵母及整体发音的训练的口型对齐方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080312930A1 (en) * | 1997-08-05 | 2008-12-18 | At&T Corp. | Method and system for aligning natural and synthetic video to speech synthesis |
CN104361620A (zh) * | 2014-11-27 | 2015-02-18 | 韩慧健 | 一种基于综合加权算法的口型动画合成方法 |
US9262857B2 (en) * | 2013-01-16 | 2016-02-16 | Disney Enterprises, Inc. | Multi-linear dynamic hair or clothing model with efficient collision handling |
CN106297792A (zh) * | 2016-09-14 | 2017-01-04 | 厦门幻世网络科技有限公司 | 一种语音口型动画的识别方法及装置 |
CN106531150A (zh) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | 一种基于深度神经网络模型的情感合成方法 |
-
2018
- 2018-04-12 CN CN201810327672.1A patent/CN108763190B/zh active Active
- 2018-08-24 WO PCT/CN2018/102209 patent/WO2019196306A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080312930A1 (en) * | 1997-08-05 | 2008-12-18 | At&T Corp. | Method and system for aligning natural and synthetic video to speech synthesis |
US9262857B2 (en) * | 2013-01-16 | 2016-02-16 | Disney Enterprises, Inc. | Multi-linear dynamic hair or clothing model with efficient collision handling |
CN104361620A (zh) * | 2014-11-27 | 2015-02-18 | 韩慧健 | 一种基于综合加权算法的口型动画合成方法 |
CN106297792A (zh) * | 2016-09-14 | 2017-01-04 | 厦门幻世网络科技有限公司 | 一种语音口型动画的识别方法及装置 |
CN106531150A (zh) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | 一种基于深度神经网络模型的情感合成方法 |
Non-Patent Citations (1)
Title |
---|
GRALEWSKI, L. ET AL.: "Using a Tensor Framework for the Analysis of Facial Dynamics", 7TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FGR06, 24 April 2006 (2006-04-24), pages 217 - 222, XP010911558, DOI: 10.1109/FGR.2006.108 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827799A (zh) * | 2019-11-21 | 2020-02-21 | 百度在线网络技术(北京)有限公司 | 用于处理语音信号的方法、装置、设备和介质 |
CN110827799B (zh) * | 2019-11-21 | 2022-06-10 | 百度在线网络技术(北京)有限公司 | 用于处理语音信号的方法、装置、设备和介质 |
EP3866166A1 (fr) * | 2020-02-13 | 2021-08-18 | Baidu Online Network Technology (Beijing) Co., Ltd. | Procédé et appareil permettant de prédire une fonctionnalité en forme de bouche, dispositif électronique, support de stockage et produit programme informatique |
US11562732B2 (en) | 2020-02-13 | 2023-01-24 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for predicting mouth-shape feature, and electronic device |
CN117173292A (zh) * | 2023-09-07 | 2023-12-05 | 河北日凌智能科技有限公司 | 一种基于元音切片的数字人交互方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN108763190B (zh) | 2019-04-02 |
CN108763190A (zh) | 2018-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019196306A1 (fr) | Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible | |
CN110688911B (zh) | 视频处理方法、装置、***、终端设备及存储介质 | |
CN106575500B (zh) | 基于面部结构合成话音的方法和装置 | |
US9361722B2 (en) | Synthetic audiovisual storyteller | |
Lee et al. | MMDAgent—A fully open-source toolkit for voice interaction systems | |
WO2017067206A1 (fr) | Procédé d'apprentissage de plusieurs modèles acoustiques personnalisés, et procédé et dispositif de synthèse de la parole | |
KR102116309B1 (ko) | 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템 | |
WO2019056500A1 (fr) | Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur | |
JP6206960B2 (ja) | 発音動作可視化装置および発音学習装置 | |
CN111145777A (zh) | 一种虚拟形象展示方法、装置、电子设备及存储介质 | |
JP2018146803A (ja) | 音声合成装置及びプログラム | |
JP5913394B2 (ja) | 音声同期処理装置、音声同期処理プログラム、音声同期処理方法及び音声同期システム | |
CN109949791A (zh) | 基于hmm的情感语音合成方法、装置及存储介质 | |
Karpov et al. | Automatic technologies for processing spoken sign languages | |
CN112599113B (zh) | 方言语音合成方法、装置、电子设备和可读存储介质 | |
WO2024088321A1 (fr) | Procédé et appareil de commande de visage d'image virtuelle, dispositif électronique et support | |
CN114121006A (zh) | 虚拟角色的形象输出方法、装置、设备以及存储介质 | |
CN112735371A (zh) | 一种基于文本信息生成说话人视频的方法及装置 | |
JP5807921B2 (ja) | 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム | |
TWI574254B (zh) | 用於電子系統的語音合成方法及裝置 | |
Mukherjee et al. | A Bengali speech synthesizer on Android OS | |
JP7510562B2 (ja) | オーディオデータの処理方法、装置、電子機器、媒体及びプログラム製品 | |
CN112634861B (zh) | 数据处理方法、装置、电子设备和可读存储介质 | |
JP2016142936A (ja) | 音声合成用データ作成方法、及び音声合成用データ作成装置 | |
JP6475572B2 (ja) | 発話リズム変換装置、方法及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18914626 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18914626 Country of ref document: EP Kind code of ref document: A1 |