WO2019196306A1 - Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible - Google Patents

Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible Download PDF

Info

Publication number
WO2019196306A1
WO2019196306A1 PCT/CN2018/102209 CN2018102209W WO2019196306A1 WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1 CN 2018102209 W CN2018102209 W CN 2018102209W WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
model
neural network
voice data
Prior art date
Application number
PCT/CN2018/102209
Other languages
English (en)
Chinese (zh)
Inventor
梁浩
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196306A1 publication Critical patent/WO2019196306A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice-based lip animation synthesis apparatus, method, and readable storage medium.
  • Speech synthesis also known as text-to-speech technology, is a technology that converts text information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.
  • the application scenario of computer-assisted pronunciation training needs to dynamically display the speaker's mouth shape change when playing the voice data to help the user perform the pronunciation training.
  • the playback is synthesized.
  • voice data since there is no real speaker's mouth data corresponding to it, it is impossible to display a realistic mouth-shaped animation that matches the synthesized voice data.
  • the present application provides a voice-based lip-type animation synthesizing apparatus, method, and readable storage medium, the main purpose of which is to solve the prior art technology that cannot display a realistic mouth-shaped animation that matches the synthesized speech data. problem.
  • the present application provides a voice-based lip animation synthesis device, the device comprising a memory and a processor, wherein the memory stores a lip animation synthesis program executable on the processor,
  • the mouth animation synthesis program is implemented by the processor to implement the following steps:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the present application further provides a voice-based lip animation synthesis method, the method comprising:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the present application further provides a computer readable storage medium having a lip animation synthesis program stored thereon, the lip animation synthesis program being executable by one or more processors Executing to implement the steps of the speech-based lip animation synthesis method as described above.
  • FIG. 1 is a schematic diagram of a voice-based lip-shaped animation synthesizing device of the present application
  • FIG. 2 is a schematic diagram of a program module of a lip-shaped animation synthesis program in an embodiment of a speech-based lip animation synthesis apparatus according to an embodiment of the present invention
  • FIG. 3 is a flow chart of a preferred embodiment of a speech-based lip animation synthesis method of the present application.
  • the present application provides a voice-based lip animation synthesis device.
  • FIG. 1 a schematic diagram of a preferred embodiment of a speech-based lip animation synthesis apparatus of the present application is shown.
  • the voice-based lip animation synthesis device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the speech-based lip animation synthesis apparatus includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may be, in some embodiments, an internal storage unit of a voice-based lip animation composition device, such as a hard disk of the speech-based lip animation synthesis device.
  • the memory 11 may also be an external storage device of a voice-based lip animation synthesis device in other embodiments, such as a plug-in hard disk equipped with a voice-based lip animation synthesis device, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc.
  • the memory 11 may also include an internal storage unit of the voice-based lip animation composition device and an external storage device.
  • the memory 11 can be used not only for storing application software and various types of data installed in the voice-based lip animation synthesizing device, such as code of a lip animation synthesis program, but also for temporarily storing data that has been output or is to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as performing a lip animation synthesis program.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor or other data processing chip for running program code or processing stored in the memory 11.
  • Data such as performing a lip animation synthesis program.
  • Communication bus 13 is used to implement connection communication between these components.
  • the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • Figure 1 shows only a speech-based lip animation synthesis device having components 11-14 and a lip animation synthesis program, but it should be understood that not all illustrated components may be implemented, alternative implementations may be Fewer components.
  • the device may further include a user interface
  • the user interface may include a display
  • an input unit such as a keyboard
  • the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the voice-based lip animation synthesis device and a user interface for displaying visualization.
  • a memory animation synthesis program is stored in the memory 11; when the processor 12 executes the lip animation synthesis program stored in the memory 11, the following steps are implemented:
  • the phoneme feature is input into a pre-trained deep neural network model, and an acoustic feature corresponding to the phoneme feature is output, the acoustic feature including a Mel cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency.
  • the acoustic feature is input to a speech synthesizer, and speech data corresponding to the target text data is output.
  • the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
  • the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
  • Said phonemes include the initial phoneme and the vowel phoneme.
  • the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
  • a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
  • MFCC Mel-frequency cepstral coefficients
  • the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
  • the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
  • a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
  • the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
  • the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
  • the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
  • the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
  • the mouth position feature in the mouth data is directly used.
  • the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
  • a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
  • the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
  • the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
  • the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
  • the set of lip data corresponding to the pronunciation feature is used as a parameter space
  • a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
  • the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
  • the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
  • is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
  • the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
  • the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
  • the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining.
  • the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
  • the speech-based vocal animation synthesizing device of the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
  • the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
  • the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
  • This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
  • the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
  • the lip animation synthesis program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation)
  • the processor 12 is executed to complete the application
  • the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing a lip-shaped animation synthesis program in a voice-based lip animation synthesis device. The execution process in .
  • FIG. 2 it is a schematic diagram of a program module of a lip animation synthesis program in an embodiment of a speech-based lip animation synthesis device of the present application.
  • the lip animation synthesis program can be segmented into feature extraction.
  • the module 10, the feature conversion module 20, the speech synthesis module 30, the lip shape generation module 40, and the animation synthesis module 50 are exemplarily:
  • the feature extraction module 10 is configured to: acquire target text data, and acquire phoneme features in the target text data according to a pronunciation dictionary;
  • the feature conversion module 20 is configured to: input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration And pronunciation fundamental frequency;
  • the speech synthesis module 30 is configured to: input the acoustic feature into a speech synthesizer, and output speech data corresponding to the target text data;
  • the port type generating module 40 is configured to: acquire, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information,
  • the tensor model expresses a correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data;
  • the animation synthesizing module 50 is configured to: generate a lip animation corresponding to the voice data according to the lip shape data, to display the lip animation while playing the voice data.
  • the present application also provides a voice-based lip animation synthesis method.
  • FIG. 3 it is a flowchart of a preferred embodiment of a speech-based lip animation synthesis method of the present application. The method may be performed by a device, which may be implemented by software and/or hardware, and the following voice-based lip animation synthesis device as an execution subject describes the method of the present embodiment.
  • the voice-based lip animation synthesis method includes:
  • Step S10 Acquire target text data, and acquire phoneme features in the target text data according to the pronunciation dictionary.
  • Step S20 input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency. .
  • Step S30 inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data.
  • the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
  • the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
  • Said phonemes include the initial phoneme and the vowel phoneme.
  • the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
  • a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
  • MFCC Mel-frequency cepstral coefficients
  • the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
  • the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
  • a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
  • the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
  • the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
  • the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
  • Step S40 acquiring, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, the tensor model expression The correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data.
  • the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
  • the mouth position feature in the mouth data is directly used, and the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
  • a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
  • the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
  • the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
  • the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
  • the set of lip data corresponding to the pronunciation feature is used as a parameter space
  • a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
  • the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
  • the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
  • is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
  • the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
  • the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
  • Step S50 generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
  • the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic.
  • the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
  • the speech-based vocal animation synthesis method proposed in the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
  • the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
  • the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
  • This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
  • the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
  • the embodiment of the present application further provides a computer readable storage medium, where the mouth-shaped animation synthesis program is stored, and the lip animation synthesis program can be executed by one or more processors, Implement the following operations:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Processing Or Creating Images (AREA)

Abstract

L'invention concerne un dispositif et un procédé de mélange d'animation de forme de bouche basé sur la parole. Le dispositif comporte une mémoire et un processeur. Un programme de mélange d'animation de forme de bouche qui peut s'exécuter sur le processeur est stocké dans la mémoire. Le programme met en œuvre les étapes suivantes lorsqu'il est exécuté par le processeur: acquérir des données de texte cible, acquérir des caractéristiques de phonèmes dans les données de texte cible d'après un dictionnaire de prononciation (S10); introduire les caractéristiques de phonèmes dans un modèle de réseau neuronal profond pré-entraîné, délivrer des caractéristiques acoustiques (S20); introduire les caractéristiques acoustiques dans un synthétiseur de parole et délivrer des données de parole (S30); acquérir des données de forme de bouche d'après les données de parole, un modèle de tenseur pré-entraîné, et des informations d'identification de locuteur (S40); et générer une animation correspondante de forme de bouche d'après les données de forme de bouche et les données de parole (S50). Le dispositif et le procédé résolvent le problème technique rencontré dans l'état antérieur de la technique, où il n'était pas possible de présenter une animation de forme de bouche correspondant à des données de parole et donnant une impression réaliste.
PCT/CN2018/102209 2018-04-12 2018-08-24 Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible WO2019196306A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810327672.1A CN108763190B (zh) 2018-04-12 2018-04-12 基于语音的口型动画合成装置、方法及可读存储介质
CN201810327672.1 2018-04-12

Publications (1)

Publication Number Publication Date
WO2019196306A1 true WO2019196306A1 (fr) 2019-10-17

Family

ID=63981728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102209 WO2019196306A1 (fr) 2018-04-12 2018-08-24 Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible

Country Status (2)

Country Link
CN (1) CN108763190B (fr)
WO (1) WO2019196306A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827799A (zh) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 用于处理语音信号的方法、装置、设备和介质
EP3866166A1 (fr) * 2020-02-13 2021-08-18 Baidu Online Network Technology (Beijing) Co., Ltd. Procédé et appareil permettant de prédire une fonctionnalité en forme de bouche, dispositif électronique, support de stockage et produit programme informatique
CN117173292A (zh) * 2023-09-07 2023-12-05 河北日凌智能科技有限公司 一种基于元音切片的数字人交互方法及装置

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447234B (zh) * 2018-11-14 2022-10-21 腾讯科技(深圳)有限公司 一种模型训练方法、合成说话表情的方法和相关装置
CN109523616B (zh) * 2018-12-04 2023-05-30 科大讯飞股份有限公司 一种面部动画生成方法、装置、设备及可读存储介质
CN111326141A (zh) * 2018-12-13 2020-06-23 南京硅基智能科技有限公司 一种处理获取人声数据的方法
CN109801349B (zh) * 2018-12-19 2023-01-24 武汉西山艺创文化有限公司 一种声音驱动的三维动画角色实时表情生成方法和***
CN109599113A (zh) 2019-01-22 2019-04-09 北京百度网讯科技有限公司 用于处理信息的方法和装置
CN110136698B (zh) * 2019-04-11 2021-09-24 北京百度网讯科技有限公司 用于确定嘴型的方法、装置、设备和存储介质
CN110189394B (zh) * 2019-05-14 2020-12-29 北京字节跳动网络技术有限公司 口型生成方法、装置及电子设备
CN110288682B (zh) * 2019-06-28 2023-09-26 北京百度网讯科技有限公司 用于控制三维虚拟人像口型变化的方法和装置
CN112181127A (zh) * 2019-07-02 2021-01-05 上海浦东发展银行股份有限公司 用于人机交互的方法和装置
WO2021127821A1 (fr) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Procédé d'apprentissage de modèle de synthèse vocale, dispositif informatique et support de stockage
CN110992926B (zh) * 2019-12-26 2022-06-10 标贝(北京)科技有限公司 语音合成方法、装置、***和存储介质
CN111340920B (zh) * 2020-03-02 2024-04-09 长沙千博信息技术有限公司 一种语义驱动的二维动画自动生成方法
CN111698552A (zh) * 2020-05-15 2020-09-22 完美世界(北京)软件科技发展有限公司 一种视频资源的生成方法和装置
CN112331184B (zh) * 2020-10-29 2024-03-15 网易(杭州)网络有限公司 语音口型同步方法、装置、电子设备及存储介质
CN112927712B (zh) * 2021-01-25 2024-06-04 网易(杭州)网络有限公司 视频生成方法、装置和电子设备
CN112837401B (zh) * 2021-01-27 2024-04-09 网易(杭州)网络有限公司 一种信息处理方法、装置、计算机设备及存储介质
CN113079328B (zh) * 2021-03-19 2023-03-28 北京有竹居网络技术有限公司 视频生成方法和装置、存储介质和电子设备
CN113314094B (zh) * 2021-05-28 2024-05-07 北京达佳互联信息技术有限公司 唇形模型的训练方法和装置及语音动画合成方法和装置
CN113707124A (zh) * 2021-08-30 2021-11-26 平安银行股份有限公司 话术语音的联动播报方法、装置、电子设备及存储介质
CN113870396B (zh) * 2021-10-11 2023-08-15 北京字跳网络技术有限公司 一种口型动画生成方法、装置、计算机设备及存储介质
CN114420088A (zh) * 2022-01-20 2022-04-29 安徽淘云科技股份有限公司 一种展示方法及其相关设备
CN114581567B (zh) * 2022-05-06 2022-08-02 成都市谛视无限科技有限公司 一种声音驱动虚拟形象口型方法、装置及介质
CN116257762B (zh) * 2023-05-16 2023-07-14 世优(北京)科技有限公司 深度学习模型的训练方法及控制虚拟形象口型变化的方法
CN117894064A (zh) * 2023-12-11 2024-04-16 中新金桥数字科技(北京)有限公司 一种基于遍历声母韵母及整体发音的训练的口型对齐方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312930A1 (en) * 1997-08-05 2008-12-18 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
CN104361620A (zh) * 2014-11-27 2015-02-18 韩慧健 一种基于综合加权算法的口型动画合成方法
US9262857B2 (en) * 2013-01-16 2016-02-16 Disney Enterprises, Inc. Multi-linear dynamic hair or clothing model with efficient collision handling
CN106297792A (zh) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 一种语音口型动画的识别方法及装置
CN106531150A (zh) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 一种基于深度神经网络模型的情感合成方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312930A1 (en) * 1997-08-05 2008-12-18 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
US9262857B2 (en) * 2013-01-16 2016-02-16 Disney Enterprises, Inc. Multi-linear dynamic hair or clothing model with efficient collision handling
CN104361620A (zh) * 2014-11-27 2015-02-18 韩慧健 一种基于综合加权算法的口型动画合成方法
CN106297792A (zh) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 一种语音口型动画的识别方法及装置
CN106531150A (zh) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 一种基于深度神经网络模型的情感合成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GRALEWSKI, L. ET AL.: "Using a Tensor Framework for the Analysis of Facial Dynamics", 7TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FGR06, 24 April 2006 (2006-04-24), pages 217 - 222, XP010911558, DOI: 10.1109/FGR.2006.108 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827799A (zh) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 用于处理语音信号的方法、装置、设备和介质
CN110827799B (zh) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 用于处理语音信号的方法、装置、设备和介质
EP3866166A1 (fr) * 2020-02-13 2021-08-18 Baidu Online Network Technology (Beijing) Co., Ltd. Procédé et appareil permettant de prédire une fonctionnalité en forme de bouche, dispositif électronique, support de stockage et produit programme informatique
US11562732B2 (en) 2020-02-13 2023-01-24 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for predicting mouth-shape feature, and electronic device
CN117173292A (zh) * 2023-09-07 2023-12-05 河北日凌智能科技有限公司 一种基于元音切片的数字人交互方法及装置

Also Published As

Publication number Publication date
CN108763190B (zh) 2019-04-02
CN108763190A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019196306A1 (fr) Dispositif et procédé de mélange d'animation de forme de bouche basé sur la parole, et support de stockage lisible
CN110688911B (zh) 视频处理方法、装置、***、终端设备及存储介质
CN106575500B (zh) 基于面部结构合成话音的方法和装置
US9361722B2 (en) Synthetic audiovisual storyteller
Lee et al. MMDAgent—A fully open-source toolkit for voice interaction systems
WO2017067206A1 (fr) Procédé d'apprentissage de plusieurs modèles acoustiques personnalisés, et procédé et dispositif de synthèse de la parole
KR102116309B1 (ko) 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템
WO2019056500A1 (fr) Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur
JP6206960B2 (ja) 発音動作可視化装置および発音学習装置
CN111145777A (zh) 一种虚拟形象展示方法、装置、电子设备及存储介质
JP2018146803A (ja) 音声合成装置及びプログラム
JP5913394B2 (ja) 音声同期処理装置、音声同期処理プログラム、音声同期処理方法及び音声同期システム
CN109949791A (zh) 基于hmm的情感语音合成方法、装置及存储介质
Karpov et al. Automatic technologies for processing spoken sign languages
CN112599113B (zh) 方言语音合成方法、装置、电子设备和可读存储介质
WO2024088321A1 (fr) Procédé et appareil de commande de visage d'image virtuelle, dispositif électronique et support
CN114121006A (zh) 虚拟角色的形象输出方法、装置、设备以及存储介质
CN112735371A (zh) 一种基于文本信息生成说话人视频的方法及装置
JP5807921B2 (ja) 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム
TWI574254B (zh) 用於電子系統的語音合成方法及裝置
Mukherjee et al. A Bengali speech synthesizer on Android OS
JP7510562B2 (ja) オーディオデータの処理方法、装置、電子機器、媒体及びプログラム製品
CN112634861B (zh) 数据处理方法、装置、电子设备和可读存储介质
JP2016142936A (ja) 音声合成用データ作成方法、及び音声合成用データ作成装置
JP6475572B2 (ja) 発話リズム変換装置、方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914626

Country of ref document: EP

Kind code of ref document: A1