CN111161368A

CN111161368A - Method for synthesizing human body vocal organ motion image in real time by inputting voice

Info

Publication number: CN111161368A
Application number: CN201911277445.3A
Authority: CN
Inventors: 于瑞国; 付钊; 刘志强; 于健; 赵满坤; 喻梅; 王建荣; 黄竑垚
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-15

Abstract

A method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice, comprising: synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data; extracting a voice feature vector; preprocessing a nuclear magnetic resonance image and extracting an image feature vector; establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; and (5) reconstructing a nuclear magnetic resonance image. The invention establishes a continuous mapping model from a voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of a human vocal organ in real time based on voice information by inputting voice, including the situation that the lip, the mandible, the tongue, the throat, the soft palate and other parts move in the vocal process, realizing the synthesis of the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice, solving the problem that the vocal organ nuclear magnetic resonance image is difficult to collect, and having wide application in the field of voice recognition.

Description

Method for synthesizing human body vocal organ motion image in real time by inputting voice

Technical Field

The invention relates to a method for synthesizing a human vocal organ motion image. In particular to a method for synthesizing a human vocal organ motion image in real time through input voice.

Background

There are two main ways of synthesizing the movement of an acoustic organ by an acoustic signal. One is by a multi-stream structure method, typically by using an Artificial Neural Network (ANN), and then replacing the extracted result with or as a supplement to the original speech feature vector in the original measurement result.

Another way to synthesize the motion of an acoustic organ from acoustic signals is to use a frame-to-frame model. The frame models the frame model without using relevant knowledge of linguistics, so that the model can be independent of the linguistics, has better applicability, and generally needs a large amount of data to model. The mid-sagittal image of the vocal organs of the speaker is obtained in real time by magnetic resonance imaging (MRI technology). Because of the larger amount of physiological information of the vocalizing organs, the MRI data can better assist in the enhancement of the recognition rate of the automatic speech recognition.

In a real-world speech recognition scenario, physiological information data of a vocalization process cannot be obtained through direct measurement, but the motion condition of a physiological organ in the vocalization process plays an important role in improving the recognition rate of automatic speech recognition. Therefore, it becomes an important task to synthesize real-time information of the movement of the physiological organs during the vocalization by a certain method.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for synthesizing a human vocal organ moving image in real time through input voice, which can relatively well represent the important characteristics of vocal organs in a nuclear magnetic resonance image.

The technical scheme adopted by the invention is as follows: a method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:

1) synchronously acquiring voice data and a nuclear magnetic resonance image of the movement of a sounding organ to obtain training data;

2) extracting a voice feature vector;

3) preprocessing a nuclear magnetic resonance image and extracting an image feature vector;

4) establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph;

5) and (5) reconstructing a nuclear magnetic resonance image.

The extracting of the voice feature vector in the step 2) is realized by adopting a Mel cepstrum coefficient, and comprises the following steps:

(1) pre-emphasis, filtering the original voice signal;

(2) dividing each voice signal into a short-time frame;

(3) windowing, namely multiplying the voice signal of each frame by a window function in order to increase the continuity of each frame at the left end and the right end of the time domain;

(4) performing fast Fourier transform to convert the voice signal from time domain to frequency domain;

(5) smoothing the voice signals converted into the frequency spectrum by using a Mel filter group, highlighting formants of the original voice and eliminating harmonic waves at the same time;

(6) calculating logarithmic energy from the output of each filter bank;

(7) discrete cosine transform is carried out on the solved logarithmic energy to obtain a Mel frequency cepstrum coefficient, namely a voice characteristic vector;

(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector.

The preprocessing the nuclear magnetic resonance image and extracting the image feature vector in the step 3) comprises the following steps:

(1) respectively carrying out discrete cosine transform on the acquired nuclear magnetic resonance images to respectively obtain matrixes;

(2) respectively calculating a covariance matrix of each matrix;

(3) obtaining a corresponding projection matrix for each covariance matrix by a singular value decomposition method;

(4) and extracting front k-dimensional principal component analysis eigenvectors of each projection matrix to form image eigenvectors.

The step 4) comprises the following steps: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.

The nuclear magnetic resonance image reconstruction method in the step 5) comprises the following steps:

(1) and inputting the expanded voice characteristic vector into a Gaussian mixture model to obtain a matrix for primarily synthesizing the nuclear magnetic resonance image of the human vocal organ.

(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance image_inUsing the formula x_out＝U^k×x_inPerforming an inverse projection, wherein x_outIs the result of the back projection; u shape^kIs a projection matrix;

(3) and performing inverse discrete cosine transform on the result of the inverse projection to obtain a new matrix, namely the synthesized human vocal organ nuclear magnetic resonance image.

The method for synthesizing the human vocal organ moving image in real time by inputting the voice establishes a continuous mapping model from the voice signal to a Magnetic Resonance Imaging (MRI) image, thereby realizing the synthesis of the human vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice based on the condition that the parts of the human vocal organ, including lips, mandible, tongue, throat, soft palate and the like, move in the vocal process by inputting the voice and synthesizing the vocal organ nuclear magnetic resonance image in the vocal process of the voice information of continuous voice.

Drawings

Fig. 1 is a schematic diagram of the reconstruction effect of a PCA image.

Detailed Description

The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to the present invention will be described in detail with reference to the accompanying drawings.

The invention discloses a method for synthesizing a human vocal organ motion image in real time by inputting voice, which comprises the following steps:

2) extracting a voice feature vector; the invention is realized by adopting Mel cepstrum coefficient (MFCC), comprising:

(1) pre-emphasis, filtering the original voice signal;

(2) dividing each voice signal into a short-time frame;

(3) windowing, wherein dynamic information of a voice signal is easily lost in a voice frame obtained after framing, and in order to increase the continuity of each frame at the left end and the right end of a time domain, the voice signal of each frame is multiplied by a window function;

(6) calculating logarithmic energy from the output of each filter bank;

(8) and extracting dynamic differential parameters from the voice feature vector to obtain an expanded voice feature vector. The purpose of using dynamic difference parameters is to add the dynamics of speech to the speech feature vector.

3) Preprocessing a nuclear magnetic resonance image and extracting an image feature vector; the method comprises the following steps:

(1) respectively carrying out Discrete Cosine Transform (DCT) on the acquired nuclear magnetic resonance images to respectively obtain matrixes;

(2) respectively calculating a covariance matrix of each matrix;

(3) obtaining a corresponding projection matrix from each covariance matrix by a Singular Value Decomposition (SVD);

4) Establishing a Gaussian mixture model from the speech characteristic vector to the nuclear magnetic resonance image characteristic vector, and calculating the characteristic vector of the synthetic graph; specifically, the speech feature vector and the image feature vector are combined, and a Gaussian Mixture Model (GMM) is established to obtain the relation between the speech feature vector and the image feature vector.

5) And (5) reconstructing a nuclear magnetic resonance image. The method comprises the following steps:

(2) For the first k-dimensional characteristic vector x of matrix of the preliminary synthetic human vocal organ nuclear magnetic resonance image_inUsing the formula x_out＝U^k×x_inPerforming inverse projection to project the image from the data represented by the principal component eigenvector of front k dimensions onto the original dimension, wherein x_outIs the result of the back projection; u shape^kIs a projection matrix;

The embodiment of the invention uses the USC-TIMIT database recorded by the university of south California, USA as the source of experimental voice data and original measured vocal organ NMR image data. In the experiment, for each person's 460-sentence speech data, sentences were randomly selected in an 8:2 ratio, divided into training sets and test sets. The training set is used for combining the MFCC feature vectors of the voice data and the PCA features of the nuclear magnetic resonance images after the feature vectors are extracted, and is used for training parameters of the Gaussian mixture model. After extracting 39-dimensional feature vectors from the voice data in the test set, synthesizing information contained in the audio frequency of the voice data by using a trained Gaussian mixture model, comparing the information with an MRI image obtained by measurement of a nuclear magnetic resonance imager, and evaluating the Gaussian mixture model. The time-averaged euclidean distance (time-averaged euclidean distance) is used to evaluate the euclidean distance of the dynamic information by a first order difference. This evaluation method calculates the sum of the errors of the multidimensional data, which needs to be divided by the dimensionality of the PCA for the average error in a certain dimension. In the experiments of the present invention, the dimensionality of the PCA was 40 dimensions. With the variation of the number of components in the gaussian mixture model, the euclidean distance time average error between the synthesized PCA feature vector and the original PCA feature vector is in a fluctuating state, and the error reaches a minimum value when the number n of components of the GMM is 32. The first order difference euclidean distance time average error increases slightly with increasing number of components n of the GMM. The specific experimental process is as follows:

step S0101: the pre-emphasis process uses a FIR high-pass filter to filter the original speech signal. The pre-emphasis process high-pass filter is represented by equation (1). In the formula, μ is a pre-emphasis coefficient, and μ is selected to be 0.97.

H(Z)＝1-μz^-1(1)

Step S0102: each segment of the speech signal is divided into a short time frame. The frame length N has a value of 256 and each frame has a time length of about 30 ms. There is an overlapping region between two adjacent frames, and the overlapping region contains M sampling points, and M is about 1/2 of N.

Step S0103: a window function is first multiplied with the speech signal of each frame. The invention selects and uses a Hamming window as a window function in the voice signal characteristic extraction process. The hamming window function is w (N), the framed speech signal is s (N), N is 0,1, N-1, N, where N is the frame length. The speech signal multiplied by the hamming window is S' ═ S (n) × w (n). The numerical formula (2) of W (n) represents:

the value of a in the general case, i.e., a is 0.46, is used in this example.

Step S0104: and performing discrete Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum of each frame, and then performing modular square on the obtained frequency spectrum to obtain a power spectrum of the voice signal. The discrete fourier transform is represented using equation (3):

step S0105: after passing through the energy spectrum obtained in equation 3-4, the obtained energy spectrum is passed through a set of Mel-scale triangular filter banks. Equation (4) gives the definition of the frequency response of the triangular filter:

step S0106: after the frequency domain signal passes through the Mel filter banks, the logarithmic energy of the output of each filter bank is calculated, and equation (5) is an expression of the logarithmic energy calculation:

step S0107: after obtaining the logarithmic energy result, a Discrete Cosine Transform (DCT) is used to obtain the final MFCC parameters. Equation (6) gives the mathematical expression for the discrete cosine transform:

in this example, MFCC features were extracted using L-13.

Step S0108: the features of the speech data are augmented by dynamic difference parameters. Equation (7) gives a mathematical representation of the dynamic difference:

dt represents the result of the t-th first order difference; ct represents the Mel cepstrum coefficient (MFCC) of the t frame in the speech data; q represents the order of the cepstral coefficient; k represents the time difference of the first derivative, and the value of K is 1.

Step S0201: the original input image is subjected to a Discrete Cosine Transform (DCT) operation. Equation (8) gives the Discrete Cosine Transform (DCT) process for a two-dimensional image.

Wherein the content of the first and second substances,

step S0202: and calculating the covariance matrix of each input, and finding out the relation among the characteristics of each bit. Equation (10) gives a mathematical representation of the computation process of the covariance matrix.

m represents the number of training sets, and x (i) represents the ith training sample.

Step S0203: the projection matrix is obtained by Singular Value Decomposition (SVD). Equation (11) represents singular value decomposition.

[U,S,V＝svd(∑) (11)

In equation (11), U represents a projection matrix of n × n, where n is the dimension of the original input data.

Step S0204: the first k dimensions of U are selected as needed to form an n × k matrix U k, and the first k-dimensional PCA feature vector of the new input data is extracted using equation (12), where xnew is the k-dimensional PCA feature of the new input data.

x_new＝(U^K)^T×x (12)

Step S0301: and establishing a mapping Model from the speech signal feature vector to the nuclear magnetic resonance image feature vector, and constructing the mapping Model by using a Gaussian Mixture Model. The experimental results show that the average time required to train the model is 4771s, about 1.5 hours, when n is 64, and 16831s, about 5 hours, when the number of GMM components n is 256, under the same experimental circumstances. After n is 64, the increase in n does not bring about a significant increase in effect. Therefore, in such a case, it is appropriate to train using 64 gaussian mixture models, where n is the number of the mixture models.

Step S0401: as a result of the synthesis, the k-dimensional feature vector xin is subjected to an inverse projection process using equation (13), and an image is projected from data represented by k-dimensional principal component features onto the original dimension. The result is x_out。

x_out＝U^k*x_in(13)

Step S0402: inverse Discrete Cosine Transform (iDCT for short) is performed on the feature vector of the synthesized image whose dimensions are restored to the original image by Inverse projection. The process of inverse discrete cosine transform can be represented by equation (14).

Wherein the content of the first and second substances,

the experimental results show that the first order difference euclidean distance time average error increases slightly with the number of components n of the GMM. When n is 32 or 64, the synthesized image can relatively well represent important features of the vocal organs in the nuclear magnetic resonance image, as shown in fig. 1.

Claims

1. A method for synthesizing a moving image of a vocal organ of a human body in real time by inputting voice, comprising the steps of:

2) extracting a voice feature vector;

5) and (5) reconstructing a nuclear magnetic resonance image.

2. The method for synthesizing a moving image of a vocal organ of a human body in real time by inputting speech according to claim 1, wherein the extracting of the speech feature vector of step 2) is performed by using mel cepstral coefficients, and comprises:

(1) pre-emphasis, filtering the original voice signal;

(2) dividing each voice signal into a short-time frame;

(6) calculating logarithmic energy from the output of each filter bank;

3. The method for synthesizing a human vocal organ moving image in real time through input speech according to claim 1, wherein the preprocessing the magnetic resonance image and extracting the image feature vector in step 3) comprises:

(2) respectively calculating a covariance matrix of each matrix;

4. The method for synthesizing a moving image of a vocal organ of a human body in real time by an input voice according to claim 1, wherein the step 4) comprises: and establishing a Gaussian mixture model to obtain the relation between the voice characteristic vector and the image characteristic vector.

5. The method for synthesizing a moving image of a vocal organ of a human body in real time by using an input voice according to claim 1, wherein the reconstructing of the magnetic resonance image in the step 5) comprises: