CN111161368A - Method for synthesizing human body vocal organ motion image in real time by inputting voice - Google Patents

Method for synthesizing human body vocal organ motion image in real time by inputting voice Download PDF

Info

Publication number
CN111161368A
CN111161368A CN201911277445.3A CN201911277445A CN111161368A CN 111161368 A CN111161368 A CN 111161368A CN 201911277445 A CN201911277445 A CN 201911277445A CN 111161368 A CN111161368 A CN 111161368A
Authority
CN
China
Prior art keywords
voice
magnetic resonance
image
nuclear magnetic
resonance image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911277445.3A
Other languages
Chinese (zh)
Inventor
于瑞国
付钊
刘志强
于健
赵满坤
喻梅
王建荣
黄竑垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201911277445.3A priority Critical patent/CN111161368A/en
Publication of CN111161368A publication Critical patent/CN111161368A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/003Reconstruction from projections, e.g. tomography
    • G06T11/006Inverse problem, transformation from projection-space into object-space, e.g. transform methods, back-projection, algebraic methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Processing (AREA)

Abstract

A method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice, comprising: synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data; extracting a voice feature vector; preprocessing a nuclear magnetic resonance image and extracting an image feature vector; establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; and (5) reconstructing a nuclear magnetic resonance image. The invention establishes a continuous mapping model from a voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of a human vocal organ in real time based on voice information by inputting voice, including the situation that the lip, the mandible, the tongue, the throat, the soft palate and other parts move in the vocal process, realizing the synthesis of the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice, solving the problem that the vocal organ nuclear magnetic resonance image is difficult to collect, and having wide application in the field of voice recognition.

Description

Method for synthesizing human body vocal organ motion image in real time by inputting voice
Technical Field
The invention relates to a method for synthesizing a human vocal organ motion image. In particular to a method for synthesizing a human vocal organ motion image in real time through input voice.
Background
There are two main ways of synthesizing the movement of an acoustic organ by an acoustic signal. One is by a multi-stream structure method, typically by using an Artificial Neural Network (ANN), and then replacing the extracted result with or as a supplement to the original speech feature vector in the original measurement result.
Another way to synthesize the motion of an acoustic organ from acoustic signals is to use a frame-to-frame model. The frame models the frame model without using relevant knowledge of linguistics, so that the model can be independent of the linguistics, has better applicability, and generally needs a large amount of data to model. The mid-sagittal image of the vocal organs of the speaker is obtained in real time by magnetic resonance imaging (MRI technology). Because of the larger amount of physiological information of the vocalizing organs, the MRI data can better assist in the enhancement of the recognition rate of the automatic speech recognition.
In a real-world speech recognition scenario, physiological information data of a vocalization process cannot be obtained through direct measurement, but the motion condition of a physiological organ in the vocalization process plays an important role in improving the recognition rate of automatic speech recognition. Therefore, it becomes an important task to synthesize real-time information of the movement of the physiological organs during the vocalization by a certain method.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for synthesizing a human vocal organ moving image in real time through input voice, which can relatively well represent the important characteristics of vocal organs in a nuclear magnetic resonance image.
The technical scheme adopted by the invention is as follows: a method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector;
3) preprocessing a nuclear magnetic resonance image and extracting an image feature vector;
4) establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph;
5) and (5) reconstructing a nuclear magnetic resonance image.
The extracting of the voice feature vector in the step 2) is realized by adopting a Mel cepstrum coefficient, and comprises the following steps:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, namely multiplying the voice signal of each frame by a window function in order to increase the continuity of each frame at the left end and the right end of the time domain;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector.
The preprocessing the nuclear magnetic resonance image and extracting the image feature vector in the step 3) comprises the following steps:
(1) respectively carrying out discrete cosine transform on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix for each covariance matrix by a singular value decomposition method;
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
The step 4) comprises the following steps: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.
The nuclear magnetic resonance image reconstruction method in the step 5) comprises the following steps:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming an inverse projection, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
The method for synthesizing the human vocal organ moving image in real time by inputting the voice establishes a continuous mapping model from the voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of the human vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice based on the condition that the parts of the human vocal organ, including lips, mandible, tongue, throat, soft palate and the like, move in the vocal process by inputting the voice and synthesizing the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice.
Drawings
Fig. 1 is a schematic diagram of the reconstruction effect of a PCA image.
Detailed Description
The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to the present invention will be described in detail with reference to the accompanying drawings.
The invention discloses a method for synthesizing a human vocal organ motion image in real time by inputting voice, which comprises the following steps:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector; the invention is realized by adopting Mel cepstrum coefficient (MFCC), comprising:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, wherein dynamic information of a voice signal is easily lost in a voice frame obtained after framing, and in order to increase the continuity of each frame at the left end and the right end of a time domain, the voice signal of each frame is multiplied by a window function;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector. The purpose of using dynamic difference parameters is to add the dynamics of speech to the speech feature vector.
3) Preprocessing a nuclear magnetic resonance image and extracting an image feature vector; the method comprises the following steps:
(1) respectively carrying out Discrete Cosine Transform (DCT) on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix from each covariance matrix by a Singular Value Decomposition (SVD);
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
4) Establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; specifically, the speech feature vector and the image feature vector are combined, and a Gaussian Mixture Model (GMM) is established to obtain the relation between the speech feature vector and the image feature vector.
5) And (5) reconstructing a nuclear magnetic resonance image. The method comprises the following steps:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming inverse projection to project the image from the data represented by the principal component eigenvector of front k dimensions onto the original dimension, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
The embodiment of the invention uses the USC-TIMIT database recorded by the university of south California, USA as the source of experimental voice data and original measured vocal organ NMR image data. In the experiment, for each person's 460-sentence speech data, sentences were randomly selected in an 8:2 ratio, divided into training sets and test sets. The training set is used for combining the MFCC feature vectors of the voice data and the PCA features of the nuclear magnetic resonance images after the feature vectors are extracted, and is used for training parameters of the Gaussian mixture model. After extracting 39-dimensional feature vectors from the voice data in the test set, synthesizing information contained in the audio frequency of the voice data by using a trained Gaussian mixture model, comparing the information with an MRI image obtained by measurement of a nuclear magnetic resonance imager, and evaluating the Gaussian mixture model. The time-averaged euclidean distance (time-averaged euclidean distance) is used to evaluate the euclidean distance of the dynamic information by a first order difference. This evaluation method calculates the sum of the errors of the multidimensional data, which needs to be divided by the dimensionality of the PCA for the average error in a certain dimension. In the experiments of the present invention, the dimensionality of the PCA was 40 dimensions. With the variation of the number of components in the gaussian mixture model, the euclidean distance time average error between the synthesized PCA feature vector and the original PCA feature vector is in a fluctuating state, and the error reaches a minimum value when the number n of components of the GMM is 32. The first order difference euclidean distance time average error increases slightly with increasing number of components n of the GMM. The specific experimental process is as follows:
step S0101: the pre-emphasis process uses a FIR high-pass filter to filter the original speech signal. The pre-emphasis process high-pass filter is represented by equation (1). In the formula, μ is a pre-emphasis coefficient, and μ is selected to be 0.97.
H(Z)=1-μz-1(1)
Step S0102: each segment of the speech signal is divided into a short time frame. The frame length N has a value of 256 and each frame has a time length of about 30 ms. There is an overlapping region between two adjacent frames, and the overlapping region contains M sampling points, and M is about 1/2 of N.
Step S0103: a window function is first multiplied with the speech signal of each frame. The invention selects and uses a Hamming window as a window function in the voice signal characteristic extraction process. The hamming window function is w (N), the framed speech signal is s (N), N is 0,1, N-1, N, where N is the frame length. The speech signal multiplied by the hamming window is S' ═ S (n) × w (n). The numerical formula (2) of W (n) represents:
Figure BDA0002315935960000041
the value of a in the general case, i.e., a is 0.46, is used in this example.
Step S0104: and performing discrete Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum of each frame, and then performing modular square on the obtained frequency spectrum to obtain a power spectrum of the voice signal. The discrete fourier transform is represented using equation (3):
Figure BDA0002315935960000042
step S0105: after passing through the energy spectrum obtained in equation 3-4, the obtained energy spectrum is passed through a set of Mel-scale triangular filter banks. Equation (4) gives the definition of the frequency response of the triangular filter:
Figure BDA0002315935960000043
step S0106: after the frequency domain signal passes through the Mel filter banks, the logarithmic energy of the output of each filter bank is calculated, and equation (5) is an expression of the logarithmic energy calculation:
Figure BDA0002315935960000044
step S0107: after obtaining the logarithmic energy result, a Discrete Cosine Transform (DCT) is used to obtain the final MFCC parameters. Equation (6) gives the mathematical expression for the discrete cosine transform:
Figure BDA0002315935960000045
in this example, MFCC features were extracted using L-13.
Step S0108: the features of the speech data are augmented by dynamic difference parameters. Equation (7) gives a mathematical representation of the dynamic difference:
Figure BDA0002315935960000046
dt represents the result of the t-th first order difference; ct represents the Mel cepstrum coefficient (MFCC) of the t frame in the speech data; q represents the order of the cepstral coefficient; k represents the time difference of the first derivative, and the value of K is 1.
Step S0201: the original input image is subjected to a Discrete Cosine Transform (DCT) operation. Equation (8) gives the Discrete Cosine Transform (DCT) process for a two-dimensional image.
Figure BDA0002315935960000051
Wherein the content of the first and second substances,
Figure BDA0002315935960000052
step S0202: and calculating the covariance matrix of each input, and finding out the relation among the characteristics of each bit. Equation (10) gives a mathematical representation of the computation process of the covariance matrix.
Figure BDA0002315935960000053
m represents the number of training sets, and x (i) represents the ith training sample.
Step S0203: the projection matrix is obtained by Singular Value Decomposition (SVD). Equation (11) represents singular value decomposition.
[U,S,V=svd(∑) (11)
In equation (11), U represents a projection matrix of n × n, where n is the dimension of the original input data.
Step S0204: the first k dimensions of U are selected as needed to form an n × k matrix U k, and the first k-dimensional PCA feature vector of the new input data is extracted using equation (12), where xnew is the k-dimensional PCA feature of the new input data.
xnew=(UK)T×x (12)
Step S0301: and establishing a mapping Model from the speech signal feature vector to the nuclear magnetic resonance image feature vector, and constructing the mapping Model by using a Gaussian Mixture Model. The experimental results show that the average time required to train the model is 4771s, about 1.5 hours, when n is 64, and 16831s, about 5 hours, when the number of GMM components n is 256, under the same experimental circumstances. After n is 64, the increase in n does not bring about a significant increase in effect. Therefore, in such a case, it is appropriate to train using 64 gaussian mixture models, where n is the number of the mixture models.
Step S0401: as a result of the synthesis, the k-dimensional feature vector xin is subjected to an inverse projection process using equation (13), and an image is projected from data represented by k-dimensional principal component features onto the original dimension. The result is xout
xout=Uk*xin(13)
Step S0402: inverse Discrete Cosine Transform (iDCT for short) is performed on the feature vector of the synthesized image whose dimensions are restored to the original image by Inverse projection. The process of inverse discrete cosine transform can be represented by equation (14).
Figure BDA0002315935960000054
Wherein the content of the first and second substances,
Figure BDA0002315935960000055
the experimental results show that the first order difference euclidean distance time average error increases slightly with the number of components n of the GMM. When n is 32 or 64, the synthesized image can relatively well represent important features of the vocal organs in the nuclear magnetic resonance image, as shown in fig. 1.

Claims (5)

1. A method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector;
3) preprocessing a nuclear magnetic resonance image and extracting an image feature vector;
4) establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph;
5) and (5) reconstructing a nuclear magnetic resonance image.
2. The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to claim 1, wherein the extracting of the speech feature vector of step 2) is performed by using mel cepstral coefficients, and comprises:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, namely multiplying the voice signal of each frame by a window function in order to increase the continuity of each frame at the left end and the right end of the time domain;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector.
3. The method for synthesizing a human vocal organ moving image in real time through input speech according to claim 1, wherein the preprocessing the magnetic resonance image and extracting the image feature vector in step 3) comprises:
(1) respectively carrying out discrete cosine transform on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix for each covariance matrix by a singular value decomposition method;
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
4. The method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice according to claim 1, wherein the step 4) comprises: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.
5. The method for synthesizing a moving image of a vocal organ of a human body in real time by using an input voice according to claim 1, wherein the reconstructing of the magnetic resonance image in the step 5) comprises:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming an inverse projection, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
CN201911277445.3A 2019-12-13 2019-12-13 Method for synthesizing human body vocal organ motion image in real time by inputting voice Pending CN111161368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911277445.3A CN111161368A (en) 2019-12-13 2019-12-13 Method for synthesizing human body vocal organ motion image in real time by inputting voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911277445.3A CN111161368A (en) 2019-12-13 2019-12-13 Method for synthesizing human body vocal organ motion image in real time by inputting voice

Publications (1)

Publication Number Publication Date
CN111161368A true CN111161368A (en) 2020-05-15

Family

ID=70557031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911277445.3A Pending CN111161368A (en) 2019-12-13 2019-12-13 Method for synthesizing human body vocal organ motion image in real time by inputting voice

Country Status (1)

Country Link
CN (1) CN111161368A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102820030A (en) * 2012-07-27 2012-12-12 中国科学院自动化研究所 Vocal organ visible speech synthesis system
CN105551071A (en) * 2015-12-02 2016-05-04 中国科学院计算技术研究所 Method and system of face animation generation driven by text voice
CN106782503A (en) * 2016-12-29 2017-05-31 天津大学 Automatic speech recognition method based on physiologic information in phonation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102820030A (en) * 2012-07-27 2012-12-12 中国科学院自动化研究所 Vocal organ visible speech synthesis system
CN105551071A (en) * 2015-12-02 2016-05-04 中国科学院计算技术研究所 Method and system of face animation generation driven by text voice
CN106782503A (en) * 2016-12-29 2017-05-31 天津大学 Automatic speech recognition method based on physiologic information in phonation

Similar Documents

Publication Publication Date Title
Marafioti et al. Adversarial generation of time-frequency features with application in audio synthesis
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
Le Cornu et al. Reconstructing intelligible audio speech from visual speech features.
Rajan et al. Using group delay functions from all-pole models for speaker recognition
Su et al. Bandwidth extension is all you need
Gurbuz et al. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition
Porras et al. DNN-based acoustic-to-articulatory inversion using ultrasound tongue imaging
CN110428812B (en) Method for synthesizing tongue ultrasonic video according to voice information based on dynamic time programming
CN112634920A (en) Method and device for training voice conversion model based on domain separation
CN108198566B (en) Information processing method and device, electronic device and storage medium
Adiga et al. Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN.
Taguchi et al. Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory.
Chien et al. Evaluation of glottal inverse filtering algorithms using a physiologically based articulatory speech synthesizer
Yu et al. Reconstructing speech from real-time articulatory MRI using neural vocoders
Csapó Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract
Douros et al. Towards a method of dynamic vocal tract shapes generation by combining static 3D and dynamic 2D MRI speech data
Saleem et al. E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
Wu et al. Deep Speech Synthesis from MRI-Based Articulatory Representations
CN110176243A (en) Sound enhancement method, model training method, device and computer equipment
Sanches Noise-compensated hidden Markov models
Douros et al. Using silence MR image to synthesise dynamic MRI vocal tract data of CV
Veena et al. Study of vocal tract shape estimation techniques for children
CN111161368A (en) Method for synthesizing human body vocal organ motion image in real time by inputting voice
Shandiz et al. Improving neural silent speech interface models by adversarial training
Ou et al. Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200515