CN111161368A - Method for synthesizing human body vocal organ motion image in real time by inputting voice - Google Patents
Method for synthesizing human body vocal organ motion image in real time by inputting voice Download PDFInfo
- Publication number
- CN111161368A CN111161368A CN201911277445.3A CN201911277445A CN111161368A CN 111161368 A CN111161368 A CN 111161368A CN 201911277445 A CN201911277445 A CN 201911277445A CN 111161368 A CN111161368 A CN 111161368A
- Authority
- CN
- China
- Prior art keywords
- voice
- magnetic resonance
- image
- nuclear magnetic
- resonance image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000000056 organ Anatomy 0.000 title claims abstract description 43
- 230000001755 vocal effect Effects 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 21
- 239000013598 vector Substances 0.000 claims abstract description 59
- 238000005481 NMR spectroscopy Methods 0.000 claims abstract description 40
- 239000000203 mixture Substances 0.000 claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims description 32
- 238000001228 spectrum Methods 0.000 claims description 8
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 claims description 3
- 238000002595 magnetic resonance imaging Methods 0.000 abstract description 6
- 230000015572 biosynthetic process Effects 0.000 abstract description 4
- 238000013507 mapping Methods 0.000 abstract description 4
- 238000003786 synthesis reaction Methods 0.000 abstract description 4
- 210000004373 mandible Anatomy 0.000 abstract description 2
- 210000001584 soft palate Anatomy 0.000 abstract description 2
- 238000005259 measurement Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 210000000088 lip Anatomy 0.000 description 1
- 210000003800 pharynx Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 210000002105 tongue Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/003—Reconstruction from projections, e.g. tomography
- G06T11/006—Inverse problem, transformation from projection-space into object-space, e.g. transform methods, back-projection, algebraic methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Optimization (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Processing (AREA)
Abstract
A method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice, comprising: synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data; extracting a voice feature vector; preprocessing a nuclear magnetic resonance image and extracting an image feature vector; establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; and (5) reconstructing a nuclear magnetic resonance image. The invention establishes a continuous mapping model from a voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of a human vocal organ in real time based on voice information by inputting voice, including the situation that the lip, the mandible, the tongue, the throat, the soft palate and other parts move in the vocal process, realizing the synthesis of the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice, solving the problem that the vocal organ nuclear magnetic resonance image is difficult to collect, and having wide application in the field of voice recognition.
Description
Technical Field
The invention relates to a method for synthesizing a human vocal organ motion image. In particular to a method for synthesizing a human vocal organ motion image in real time through input voice.
Background
There are two main ways of synthesizing the movement of an acoustic organ by an acoustic signal. One is by a multi-stream structure method, typically by using an Artificial Neural Network (ANN), and then replacing the extracted result with or as a supplement to the original speech feature vector in the original measurement result.
Another way to synthesize the motion of an acoustic organ from acoustic signals is to use a frame-to-frame model. The frame models the frame model without using relevant knowledge of linguistics, so that the model can be independent of the linguistics, has better applicability, and generally needs a large amount of data to model. The mid-sagittal image of the vocal organs of the speaker is obtained in real time by magnetic resonance imaging (MRI technology). Because of the larger amount of physiological information of the vocalizing organs, the MRI data can better assist in the enhancement of the recognition rate of the automatic speech recognition.
In a real-world speech recognition scenario, physiological information data of a vocalization process cannot be obtained through direct measurement, but the motion condition of a physiological organ in the vocalization process plays an important role in improving the recognition rate of automatic speech recognition. Therefore, it becomes an important task to synthesize real-time information of the movement of the physiological organs during the vocalization by a certain method.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for synthesizing a human vocal organ moving image in real time through input voice, which can relatively well represent the important characteristics of vocal organs in a nuclear magnetic resonance image.
The technical scheme adopted by the invention is as follows: a method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector;
3) preprocessing a nuclear magnetic resonance image and extracting an image feature vector;
4) establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph;
5) and (5) reconstructing a nuclear magnetic resonance image.
The extracting of the voice feature vector in the step 2) is realized by adopting a Mel cepstrum coefficient, and comprises the following steps:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, namely multiplying the voice signal of each frame by a window function in order to increase the continuity of each frame at the left end and the right end of the time domain;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector.
The preprocessing the nuclear magnetic resonance image and extracting the image feature vector in the step 3) comprises the following steps:
(1) respectively carrying out discrete cosine transform on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix for each covariance matrix by a singular value decomposition method;
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
The step 4) comprises the following steps: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.
The nuclear magnetic resonance image reconstruction method in the step 5) comprises the following steps:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming an inverse projection, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
The method for synthesizing the human vocal organ moving image in real time by inputting the voice establishes a continuous mapping model from the voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of the human vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice based on the condition that the parts of the human vocal organ, including lips, mandible, tongue, throat, soft palate and the like, move in the vocal process by inputting the voice and synthesizing the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice.
Drawings
Fig. 1 is a schematic diagram of the reconstruction effect of a PCA image.
Detailed Description
The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to the present invention will be described in detail with reference to the accompanying drawings.
The invention discloses a method for synthesizing a human vocal organ motion image in real time by inputting voice, which comprises the following steps:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector; the invention is realized by adopting Mel cepstrum coefficient (MFCC), comprising:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, wherein dynamic information of a voice signal is easily lost in a voice frame obtained after framing, and in order to increase the continuity of each frame at the left end and the right end of a time domain, the voice signal of each frame is multiplied by a window function;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector. The purpose of using dynamic difference parameters is to add the dynamics of speech to the speech feature vector.
3) Preprocessing a nuclear magnetic resonance image and extracting an image feature vector; the method comprises the following steps:
(1) respectively carrying out Discrete Cosine Transform (DCT) on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix from each covariance matrix by a Singular Value Decomposition (SVD);
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
4) Establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; specifically, the speech feature vector and the image feature vector are combined, and a Gaussian Mixture Model (GMM) is established to obtain the relation between the speech feature vector and the image feature vector.
5) And (5) reconstructing a nuclear magnetic resonance image. The method comprises the following steps:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming inverse projection to project the image from the data represented by the principal component eigenvector of front k dimensions onto the original dimension, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
The embodiment of the invention uses the USC-TIMIT database recorded by the university of south California, USA as the source of experimental voice data and original measured vocal organ NMR image data. In the experiment, for each person's 460-sentence speech data, sentences were randomly selected in an 8:2 ratio, divided into training sets and test sets. The training set is used for combining the MFCC feature vectors of the voice data and the PCA features of the nuclear magnetic resonance images after the feature vectors are extracted, and is used for training parameters of the Gaussian mixture model. After extracting 39-dimensional feature vectors from the voice data in the test set, synthesizing information contained in the audio frequency of the voice data by using a trained Gaussian mixture model, comparing the information with an MRI image obtained by measurement of a nuclear magnetic resonance imager, and evaluating the Gaussian mixture model. The time-averaged euclidean distance (time-averaged euclidean distance) is used to evaluate the euclidean distance of the dynamic information by a first order difference. This evaluation method calculates the sum of the errors of the multidimensional data, which needs to be divided by the dimensionality of the PCA for the average error in a certain dimension. In the experiments of the present invention, the dimensionality of the PCA was 40 dimensions. With the variation of the number of components in the gaussian mixture model, the euclidean distance time average error between the synthesized PCA feature vector and the original PCA feature vector is in a fluctuating state, and the error reaches a minimum value when the number n of components of the GMM is 32. The first order difference euclidean distance time average error increases slightly with increasing number of components n of the GMM. The specific experimental process is as follows:
step S0101: the pre-emphasis process uses a FIR high-pass filter to filter the original speech signal. The pre-emphasis process high-pass filter is represented by equation (1). In the formula, μ is a pre-emphasis coefficient, and μ is selected to be 0.97.
H(Z)=1-μz-1(1)
Step S0102: each segment of the speech signal is divided into a short time frame. The frame length N has a value of 256 and each frame has a time length of about 30 ms. There is an overlapping region between two adjacent frames, and the overlapping region contains M sampling points, and M is about 1/2 of N.
Step S0103: a window function is first multiplied with the speech signal of each frame. The invention selects and uses a Hamming window as a window function in the voice signal characteristic extraction process. The hamming window function is w (N), the framed speech signal is s (N), N is 0,1, N-1, N, where N is the frame length. The speech signal multiplied by the hamming window is S' ═ S (n) × w (n). The numerical formula (2) of W (n) represents:
the value of a in the general case, i.e., a is 0.46, is used in this example.
Step S0104: and performing discrete Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum of each frame, and then performing modular square on the obtained frequency spectrum to obtain a power spectrum of the voice signal. The discrete fourier transform is represented using equation (3):
step S0105: after passing through the energy spectrum obtained in equation 3-4, the obtained energy spectrum is passed through a set of Mel-scale triangular filter banks. Equation (4) gives the definition of the frequency response of the triangular filter:
step S0106: after the frequency domain signal passes through the Mel filter banks, the logarithmic energy of the output of each filter bank is calculated, and equation (5) is an expression of the logarithmic energy calculation:
step S0107: after obtaining the logarithmic energy result, a Discrete Cosine Transform (DCT) is used to obtain the final MFCC parameters. Equation (6) gives the mathematical expression for the discrete cosine transform:
in this example, MFCC features were extracted using L-13.
Step S0108: the features of the speech data are augmented by dynamic difference parameters. Equation (7) gives a mathematical representation of the dynamic difference:
dt represents the result of the t-th first order difference; ct represents the Mel cepstrum coefficient (MFCC) of the t frame in the speech data; q represents the order of the cepstral coefficient; k represents the time difference of the first derivative, and the value of K is 1.
Step S0201: the original input image is subjected to a Discrete Cosine Transform (DCT) operation. Equation (8) gives the Discrete Cosine Transform (DCT) process for a two-dimensional image.
Wherein the content of the first and second substances,
step S0202: and calculating the covariance matrix of each input, and finding out the relation among the characteristics of each bit. Equation (10) gives a mathematical representation of the computation process of the covariance matrix.
m represents the number of training sets, and x (i) represents the ith training sample.
Step S0203: the projection matrix is obtained by Singular Value Decomposition (SVD). Equation (11) represents singular value decomposition.
[U,S,V=svd(∑) (11)
In equation (11), U represents a projection matrix of n × n, where n is the dimension of the original input data.
Step S0204: the first k dimensions of U are selected as needed to form an n × k matrix U k, and the first k-dimensional PCA feature vector of the new input data is extracted using equation (12), where xnew is the k-dimensional PCA feature of the new input data.
xnew=(UK)T×x (12)
Step S0301: and establishing a mapping Model from the speech signal feature vector to the nuclear magnetic resonance image feature vector, and constructing the mapping Model by using a Gaussian Mixture Model. The experimental results show that the average time required to train the model is 4771s, about 1.5 hours, when n is 64, and 16831s, about 5 hours, when the number of GMM components n is 256, under the same experimental circumstances. After n is 64, the increase in n does not bring about a significant increase in effect. Therefore, in such a case, it is appropriate to train using 64 gaussian mixture models, where n is the number of the mixture models.
Step S0401: as a result of the synthesis, the k-dimensional feature vector xin is subjected to an inverse projection process using equation (13), and an image is projected from data represented by k-dimensional principal component features onto the original dimension. The result is xout。
xout=Uk*xin(13)
Step S0402: inverse Discrete Cosine Transform (iDCT for short) is performed on the feature vector of the synthesized image whose dimensions are restored to the original image by Inverse projection. The process of inverse discrete cosine transform can be represented by equation (14).
Wherein the content of the first and second substances,
the experimental results show that the first order difference euclidean distance time average error increases slightly with the number of components n of the GMM. When n is 32 or 64, the synthesized image can relatively well represent important features of the vocal organs in the nuclear magnetic resonance image, as shown in fig. 1.
Claims (5)
1. A method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:
1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;
2) extracting a voice feature vector;
3) preprocessing a nuclear magnetic resonance image and extracting an image feature vector;
4) establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph;
5) and (5) reconstructing a nuclear magnetic resonance image.
2. The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to claim 1, wherein the extracting of the speech feature vector of step 2) is performed by using mel cepstral coefficients, and comprises:
(1) pre-emphasis, filtering the original voice signal;
(2) dividing each voice signal into a short-time frame;
(3) windowing, namely multiplying the voice signal of each frame by a window function in order to increase the continuity of each frame at the left end and the right end of the time domain;
(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;
(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;
(6) calculating logarithmic energy from the output of each filter bank;
(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;
(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector.
3. The method for synthesizing a human vocal organ moving image in real time through input speech according to claim 1, wherein the preprocessing the magnetic resonance image and extracting the image feature vector in step 3) comprises:
(1) respectively carrying out discrete cosine transform on the acquired nuclear magnetic resonance images to respectively obtain matrixes;
(2) respectively calculating a covariance matrix of each matrix;
(3) obtaining a corresponding projection matrix for each covariance matrix by a singular value decomposition method;
(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.
4. The method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice according to claim 1, wherein the step 4) comprises: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.
5. The method for synthesizing a moving image of a vocal organ of a human body in real time by using an input voice according to claim 1, wherein the reconstructing of the magnetic resonance image in the step 5) comprises:
(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.
(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance imageinUsing the formula xout=Uk×xinPerforming an inverse projection, wherein xoutIs the result of the back projection; u shapekIs a projection matrix;
(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911277445.3A CN111161368A (en) | 2019-12-13 | 2019-12-13 | Method for synthesizing human body vocal organ motion image in real time by inputting voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911277445.3A CN111161368A (en) | 2019-12-13 | 2019-12-13 | Method for synthesizing human body vocal organ motion image in real time by inputting voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111161368A true CN111161368A (en) | 2020-05-15 |
Family
ID=70557031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911277445.3A Pending CN111161368A (en) | 2019-12-13 | 2019-12-13 | Method for synthesizing human body vocal organ motion image in real time by inputting voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111161368A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102820030A (en) * | 2012-07-27 | 2012-12-12 | 中国科学院自动化研究所 | Vocal organ visible speech synthesis system |
CN105551071A (en) * | 2015-12-02 | 2016-05-04 | 中国科学院计算技术研究所 | Method and system of face animation generation driven by text voice |
CN106782503A (en) * | 2016-12-29 | 2017-05-31 | 天津大学 | Automatic speech recognition method based on physiologic information in phonation |
-
2019
- 2019-12-13 CN CN201911277445.3A patent/CN111161368A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102820030A (en) * | 2012-07-27 | 2012-12-12 | 中国科学院自动化研究所 | Vocal organ visible speech synthesis system |
CN105551071A (en) * | 2015-12-02 | 2016-05-04 | 中国科学院计算技术研究所 | Method and system of face animation generation driven by text voice |
CN106782503A (en) * | 2016-12-29 | 2017-05-31 | 天津大学 | Automatic speech recognition method based on physiologic information in phonation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Marafioti et al. | Adversarial generation of time-frequency features with application in audio synthesis | |
CN105023573B (en) | It is detected using speech syllable/vowel/phone boundary of auditory attention clue | |
Le Cornu et al. | Reconstructing intelligible audio speech from visual speech features. | |
Rajan et al. | Using group delay functions from all-pole models for speaker recognition | |
Su et al. | Bandwidth extension is all you need | |
Gurbuz et al. | Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition | |
Porras et al. | DNN-based acoustic-to-articulatory inversion using ultrasound tongue imaging | |
CN110428812B (en) | Method for synthesizing tongue ultrasonic video according to voice information based on dynamic time programming | |
CN112634920A (en) | Method and device for training voice conversion model based on domain separation | |
CN108198566B (en) | Information processing method and device, electronic device and storage medium | |
Adiga et al. | Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. | |
Taguchi et al. | Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory. | |
Chien et al. | Evaluation of glottal inverse filtering algorithms using a physiologically based articulatory speech synthesizer | |
Yu et al. | Reconstructing speech from real-time articulatory MRI using neural vocoders | |
Csapó | Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract | |
Douros et al. | Towards a method of dynamic vocal tract shapes generation by combining static 3D and dynamic 2D MRI speech data | |
Saleem et al. | E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis | |
Wu et al. | Deep Speech Synthesis from MRI-Based Articulatory Representations | |
CN110176243A (en) | Sound enhancement method, model training method, device and computer equipment | |
Sanches | Noise-compensated hidden Markov models | |
Douros et al. | Using silence MR image to synthesise dynamic MRI vocal tract data of CV | |
Veena et al. | Study of vocal tract shape estimation techniques for children | |
CN111161368A (en) | Method for synthesizing human body vocal organ motion image in real time by inputting voice | |
Shandiz et al. | Improving neural silent speech interface models by adversarial training | |
Ou et al. | Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200515 |