CN110148408A

CN110148408A - A kind of Chinese speech recognition method based on depth residual error

Info

Publication number: CN110148408A
Application number: CN201910458947.XA
Authority: CN
Inventors: 袁三男; 刘虹
Original assignee: Shanghai University of Electric Power
Current assignee: Shanghai University of Electric Power; University of Shanghai for Science and Technology
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2019-08-20

Abstract

The present invention relates to a kind of Chinese speech recognition method based on depth residual error, this method includes the following steps: 1) to obtain the initial data containing voice messaging；2) MFCC characteristic parameter is extracted to initial data, and obtains the first-order difference and second differnce of MFCC characteristic parameter；3) first-order difference and second differnce of present frame and the frame are spliced, obtains last characteristic parameter, and the two-dimensional array of this feature parameter is converted into three-dimensional array；4) the last characteristic parameter of three-dimensional array in step 3) is all input into convolutional neural networks, to convolutional neural networks repetition training, until obtaining satisfactory discrimination；5) trained convolutional neural networks model is tested, output identification text.Compared with prior art, the present invention has many advantages, such as to accelerate model training speed, improves phonetic recognization rate.

Description

A kind of Chinese speech recognition method based on depth residual error

Technical field

The present invention relates to Speech processing and identification fields, more particularly, to a kind of Chinese speech based on depth residual error Recognition methods.

Background technique

For voice as a kind of most convenient natural form of communication, it carries the function of information transmitting and emotional expression.With The progress of speech recognition technology, more and more people be desirable to link up by voice and machine, therefore voice know This other technology also more and more attention has been paid to.Most widely a kind of structure is long memory network in short-term to speech recognition application at present, This network can to voice it is long when correlation model, to improve recognition correct rate.And two-way LSTM network can be with Better performance is obtained, but problem high there is also training complexity height, decoding delay simultaneously.

Summary of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind to be based on depth residual error Chinese speech recognition method.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of Chinese speech recognition method based on depth residual error, includes the following steps:

Step (1) obtains the initial data containing voice messaging.

Step (2) extracts MFCC characteristic parameter to initial data, and obtains the first-order difference and second order of MFCC characteristic parameter Difference.

It extracts MFCC characteristic parameter and specifically includes the following steps:

21) preemphasis, framing and adding window is carried out to voice to pre-process；

22) to each short-time analysis window, corresponding frequency spectrum is obtained by FFT；

23) frequency spectrum that step 22) obtains is obtained into Mel frequency spectrum by Mel filter group, it, will be linear by Mel frequency spectrum Natural frequency spectrum is converted to the Mel frequency spectrum for embodying human auditory system；

24) cepstral analysis is carried out on Mel frequency spectrum, obtains Mel frequency cepstral coefficient MFCC, using MFCC as voice spy Sign.

The first-order difference of MFCC characteristic parameter is the difference of two frame of continuous adjacent in discrete function, expression formula are as follows:

Y (k)=X (k+1)-X (k)

In formula, k is frame number, and X (k) is the MFCC characteristic parameter of kth frame, and X (k+1) is the MFCC characteristic parameter of+1 frame of kth.

Second differnce indicates the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, the table of second differnce Up to formula are as follows:

Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k)

Step (3) splices the first-order difference and second differnce of present frame and the frame, obtains last feature ginseng Number, and a channel will be increased on the two-dimensional array of this feature parameter, obtain the last characteristic parameter of three-dimensional array.

Residual block include two layers of convolutional layer and one layer of random deactivating layer, the output of the random deactivating layer directly with process Input after one layer of convolution is added, and obtains final target mapping.The structure of the depth residual error network includes multilayer convolution Layer, four residual blocks, two layers of pond layer, two layers of full articulamentum and softmax layers of composition, the full articulamentum of first layer are equipped with 512 A neural unit, the full articulamentum of the second layer are equipped with 1422 neural units, and the convolution kernels of all convolutional layers is 3x3, first layer, the The number of the convolution kernel of two layers and first residual blocks is 32, and the step-length of first layer pond layer is 2x2, third layer convolutional layer and The format of the convolution kernel of second residual block is 64, and the convolution kernel number of the 4th layer of convolutional layer and third residual block is 128, the The convolution kernel number of five layers of convolutional layer and the 4th residual block is 256, and the step-length of second layer pond layer is 1x2, the last layer volume The number of product core is 512.

Preferably, the size of the convolution kernel in the residual error block structure is 3x3, and the parameter of random deactivating layer is set as 0.2, random deactivating layer selectively responds input.

The last characteristic parameter of three-dimensional array in step (3) is all input into depth residual error network by step (4), right Depth residual error network repetition training, until obtaining satisfactory discrimination, the discrimination is that the phoneme of speech recognition misses Code rate.

Preferably, if training pattern reaches the 15.42% phoneme bit error rate, it is determined as that the result of model training reaches symbol Close desired discrimination.

Step (5) tests trained depth residual error network model, output identification text.

Trained model is tested, method when by voice to be tested according to training carries out feature extraction, mentions The characteristic parameter got inputs in trained model, and the output of model is the text recognized.

Compared with prior art, the invention has the following advantages that

1) residual error block structure is applied in convolutional neural networks by the method for the present invention using depth residual error network, Convolutional neural networks generally comprise convolutional layer, pond layer and full articulamentum, and the input of convolutional layer is characterized parameter, and convolution kernel is to set The step-length set is slided, and different local feature in learning characteristic figure, convolutional layer is more, and the feature of extraction is more, Chi Hua Layer mainly compresses characteristic parameter, calculates the average value or maximum value in each region, carries out dimensionality reduction to feature, reduces mould The number of network node in type, full articulamentum have the function of classifier, which is mapped to the characteristic parameter learnt Sample labeling space, carries out classification and matching, and predicted input signal generic, therefore, convolutional neural networks share the spy of weight Point can greatly reduce the parameter of model, accelerate the training speed of model, and then solve the problems, such as that decoding delay is high；

2) residual error structure is applied in convolutional neural networks by the present invention, and convolutional neural networks directly learn to input data into The target of output label maps, it may appear that and after the number of plies of neural network is deepened, training precision does not rise the problems such as declining instead, but It is this phenomenon is not as caused by over-fitting, simple network of deepening can make network itself be difficult to train, and residual error net Residual error amount is added with former input quantity by the residual error amount of learning objective mapping and former input, obtains final target mapping by network, This study mechanism can effectively solve the problems such as network performance is degenerated, and while deepening network depth, alleviate over-fitting Problem improves the discrimination of voice.

Detailed description of the invention

Residual error block structural diagram in Fig. 1 present invention；

Fig. 2 is the flow diagram of the method for the present invention；

Fig. 3 is the broad flow diagram for extracting MFCC feature；

Fig. 4 is depth residual error network general construction schematic diagram.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.Obviously, described embodiment is this A part of the embodiment of invention, rather than whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, all should belong to the scope of protection of the invention.

The present invention relates to a kind of Chinese speech recognition methods based on depth residual error, including the following steps:

Step 1: obtaining the initial data containing voice messaging.

Step 2: extracting MFCC characteristic parameter to initial data.

MFCC (Mel Frequency Cepstral Coefficents) spy is extracted by one group of Mel filter to voice Levy parameter, i.e. mel-frequency cepstrum coefficient (500,13), wherein extract mainly comprising the processes of for MFCC feature

1) preemphasis, framing and adding window first are carried out to voice, is used for enhanced speech signal performance (signal-to-noise ratio, processing accuracy Deng) some pretreatments.

2) to each short-time analysis window, corresponding frequency spectrum is obtained by FFT, it is different on a timeline for obtaining distribution Frequency spectrum in time window.

3) frequency spectrum above is obtained into Mel frequency spectrum by Mel filter group, by Mel frequency spectrum, by linear natural frequency spectrum Be converted to the Mel frequency spectrum for embodying human auditory system.

4) carried out on Mel frequency spectrum cepstral analysis (take logarithm, do inverse transformation, practical inverse transformation generally by DCT from Scattered cosine transform realizes that the 2nd after taking DCT be to the 13rd coefficient as MFCC coefficient), obtain Mel frequency cepstral coefficient MFCC, this MFCC are exactly this frame phonetic feature.

At this time, voice can be described by a series of cepstrum vector, and each vector is exactly the MFCC of every frame Feature vector.Speech classifier can be trained and be identified by these cepstrum vectors after obtaining MFCC feature vector ?.

However, MFCC is the static nature of voice, to extract the behavioral characteristics of voice, then single order and two scales are sought Point.First-order difference is exactly the difference of two frame of continuous adjacent in discrete function, is defined as follows formula:

Y (k)=X (k+1)-X (k)

Second differnce indicates that the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, second differnce are determined Justice such as following formula:

Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k)

Step 3: the first-order difference of present frame and the frame and second differnce are spliced, last characteristic parameter be (500, 39), increase a channel on the two-dimensional array, which is converted into three-dimensional array (500,39,1).

Step 4: the calculated characteristic parameter after is all input into depth residual error network, to depth residual error network Repetition training reduces the loss of neural network by backpropagation, until obtaining preferable discrimination.

Residual error block structure is by two layers of convolutional layer, and one layer of random deactivating layer is constituted.The output of random deactivating layer directly with warp Input after crossing one layer of convolution is added, and obtains final target mapping.The size of convolution kernel is 3x3 in the residual error block structure, with The parameter of machine deactivating layer is set as 0.2, and random deactivating layer selectively responds input, and study precision can be improved.

The structure of depth residual error network of the present invention is by the full connection of multilayer convolutional layer, 4 residual blocks, two layers of pond layer and two layers Layer and softmax layers of composition, the full articulamentum of first layer have 512 neural units, and the full articulamentum of the second layer has 1422 nerves Unit.The convolution kernel of all convolutional layers is 3x3, and the number of the convolution kernel of first layer, the second layer and first residual block is 32, The step-length of first layer pond layer is 2x2, and the format of the convolution kernel of third layer convolutional layer and second residual block is 64, the 4th layer of volume The convolution kernel number of lamination and third residual block is 128, and the convolution kernel number of layer 5 convolutional layer and the 4th residual block is 256, the step-length of second layer pond layer is 1x2, and the number of the last layer convolution kernel is 512.List entries is by neural network (x₁,x₂,...,x_T) characteristic parameter passing through a series of convolutional layers, pond layer after full articulamentum and softmax layers, converts For output sequence (y₁,y₂,...,y_T), CTC (Connectionist Temporal Classification, connection timing point Class technology) according to (y₁,y₂,...,y_T) calculate the posterior probability p (l of actual sequence₁,l₂,...,l_m|x₁,x₂,...x_T), mind Process through network training is exactly in the case where given input and practical aligned phoneme sequence, and adjustment neural network parameter to train Sample set p (l₁,l₂,...,l_m|x₁,x₂,...x_T) maximum, i.e. CTC decoding is exactly to find posteriority under conditions of given input The sequence of maximum probabilityWherein, l₁,l₂,...,l_mFor sequence label, T is frame number, and m is the number of label.

Discrimination is the phoneme bit error rate of speech recognition, by test of many times, depth residual error network loss hardly When decline, i.e., model reaches the 15.42% phoneme bit error rate, is determined as that the result of model training reaches the discrimination met.

The present embodiment is based on THCHS30 Chinese data collection and carries out actual experiment, relative to BLSTM traditional in speech recognition (bidirectional long short-term memory, two-way long short-term memory) frame, using the method for the present invention training When convergent speed ratio BLSTM network it is upper 3 times fast, the discrimination of voice improves 3%.

Step 5: testing trained model, method when by voice to be tested according to training carries out feature It extracts, the characteristic parameter extracted inputs in trained model, and the output of model is the text recognized.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any The staff for being familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims

1. a kind of Chinese speech recognition method based on depth residual error, which is characterized in that this method includes the following steps:

1) initial data containing voice messaging is obtained；

2) MFCC characteristic parameter is extracted to initial data, and obtains the first-order difference and second differnce of MFCC characteristic parameter；

3) first-order difference and second differnce of present frame and the frame are spliced, obtains last characteristic parameter, and by the spy It levies and increases a channel on the two-dimensional array of parameter, obtain the last characteristic parameter of three-dimensional array；

4) the last characteristic parameter of three-dimensional array in step 3) is all input into depth residual error network, to depth residual error network Repetition training, until obtaining satisfactory discrimination；

5) trained depth residual error network model is tested, output identification text.

2. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that step 2) In, it extracts MFCC characteristic parameter and specifically includes the following steps:

23) frequency spectrum that step 22) obtains is obtained into Mel frequency spectrum by Mel filter group, by Mel frequency spectrum, by linear nature Frequency spectrum is converted to the Mel frequency spectrum for embodying human auditory system；

24) cepstral analysis is carried out on Mel frequency spectrum, Mel frequency cepstral coefficient MFCC is obtained, using MFCC as phonetic feature.

3. a kind of Chinese speech recognition method based on depth residual error according to claim 2, which is characterized in that step 2) In, the first-order difference of MFCC characteristic parameter is the difference of two frame of continuous adjacent in discrete function, expression formula are as follows:

Y (k)=X (k+1)-X (k)

4. a kind of Chinese speech recognition method based on depth residual error according to claim 3, which is characterized in that step 2) In, second differnce indicates the relationship between the first-order difference of+1 frame of kth and the first-order difference of kth frame, the expression formula of second differnce Are as follows:

Z (k)=Y (k+1)-Y (k)=X (k+2) -2*X (k+1)+X (k).

5. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that step 3) In, the structure of the depth residual error network includes multilayer convolutional layer, four residual blocks, two layers of pond layer, two layers of full articulamentum And softmax layers of composition, the full articulamentum of first layer are equipped with 512 neural units, the full articulamentum of the second layer is equipped with 1422 minds Through unit, the convolution kernel of all convolutional layers is 3x3, and the number of the convolution kernel of first layer, the second layer and first residual block is 32, the step-length of first layer pond layer is 2x2, and the format of the convolution kernel of third layer convolutional layer and second residual block is the 64, the 4th The convolution kernel number of layer convolutional layer and third residual block is 128, the convolution kernel of layer 5 convolutional layer and the 4th residual block Number is 256, and the step-length of second layer pond layer is 1x2, and the number of the last layer convolution kernel is 512.

6. a kind of Chinese speech recognition method based on depth residual error according to claim 5, which is characterized in that step 3) In, the residual block include two layers of convolutional layer and one layer of random deactivating layer, the output of the random deactivating layer directly with warp Input after crossing one layer of convolution is added, and obtains final target mapping.

7. a kind of Chinese speech recognition method based on depth residual error according to claim 6, which is characterized in that described The size of convolution kernel in residual error block structure is 3x3, and the parameter of random deactivating layer is set as 0.2, and random deactivating layer is selectively Input is responded.

8. a kind of Chinese speech recognition method based on depth residual error according to claim 1, which is characterized in that described Discrimination is the phoneme bit error rate of speech recognition, if training pattern reaches the 15.42% phoneme bit error rate, is determined as that model is instructed Experienced result reaches satisfactory discrimination.