CN108269568A

CN108269568A - A kind of acoustic training model method based on CTC

Info

Publication number: CN108269568A
Application number: CN201710002096.9A
Authority: CN
Inventors: 张鹏远; 王智超; 潘接林; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2017-01-03
Filing date: 2017-01-03
Publication date: 2018-07-10
Anticipated expiration: 2037-01-03
Also published as: CN108269568B

Abstract

The present invention provides a kind of method of the acoustic training model based on CTC, and this method includes：One step 1, training initial GMM model carry out the text marking of training data with the GMM model time point pressure alignment, obtain the time zone corresponding to each phoneme；Step 2 is inserted into one and relevant " blank " symbol of the phoneme after each phoneme, and each phoneme is gathered around there are one distinctive " blank " symbol；Step 3, using finite state machine, to adding in the searching route figure of backcasting before one CTC of phoneme notation sequence construct after " blank " symbol；Step 4, according to time unifying as a result, the time range occurred to each phoneme limits, and carries out beta pruning to the searching route figure, phoneme position beyond the path of time restriction is cut, final CTC is obtained and calculates searching route figure required during network error；Step 5 carries out acoustic training model using Delayed Neural Networks (Time delay Neural Network, TDNN) structure combination CTC methods, obtains final TDNN CTC acoustic models.

Description

A kind of acoustic training model method based on CTC

Technical field

The present invention relates to technical field of voice recognition, the method for more particularly to a kind of acoustic training model based on CTC.

Background technology

In recent years, it in speech recognition system, introduces deep neural network (Deep Neural Network, DNN) and carries out Acoustic model modeling has been achieved for huge success.Due to the outstanding classification capacities of DNN, the hidden Ma Er of tradition can be replaced Mixed Gauss model (Gaussian Mixture Model, GMM) that can be in husband's model (Hidden Markov Model) framework For generating posterior probability.However, this new HMM/DNN model frameworks training get up it is extremely complex.Therefore, researchers Start to explore a kind of learning method end to end, that is, input a phonetic feature sequence, directly obtain its text sequence.This In the case of, connection sequential sorting criterion (Connectionist Temporal Classification, CTC) combines cycle god Method (Recurrent Neural Network, RNN) through network is increasingly paid close attention to by researcher.

CTC mainly has two aspects with traditional method using cross entropy (Cross-entropy, CE) training neural network Difference：First, an additional output node is in the output for being added to network, for representing " blank " symbol.In language In sound identification, each output node of neural network represents an Acoustic Modeling factor, can be with according to the difference of modeling granularity It is single-tone prime factor or the triphones factor.And the output of each moment network then represent the moment each phoneme because The posterior probability of son.The reason of adding in " blank " symbol be with its represent network output it is uncertain when state, i.e., ought input and be Noise is characterized when not cognizable feature or input are between the critical condition of two different phonemes, network is exportable " empty In vain " symbol and avoid output one determining phoneme；Second, CTC training method are that the whole word of network inputs is optimized, Purpose is to maximize the output probability of the whole correct text sequence of sentence, and the output that each frame is not maximized as cross entropy is general Rate, CTC in the output matrix of network, find all paths that may be mapped to correct text sequence by preceding backward algorithm, Calculate its probability and, and then calculate network error, by error back propagation and gradient descent algorithm come update neural network join Number.

The calculation formula of the loss function of CTC is：

Wherein, S represents training dataset；X represents input feature vector；Z represents to map to the path set of correct text sequence It closes；L (S) represents the error between network output and mark；Lnp (z | x) represents the natural logrithm of likelihood probability；L (x, z) is represented The error of individualized training sample.

Likelihood probability p (z | x) it can be calculated by preceding backward algorithm：

Wherein, | U ' | it is the annotated sequence length added in after " blank " symbol, α (t, u) is the forward calculation factor, β (t, u) For the backcasting factor, α (t, u) β (t, u) represents to pass through the probability of phoneme u in t moment in z.

Finally, the error of network can be represented by the following formula：

Wherein,Represent the value of k-th of output node of t moment network,It is thenBefore by activation primitive Network output valve,Represent that node k appears in the location sets in U '.

It is front and rear to map to correct text sequence by all into search process in the training process of CTC models Path is included, the position and actual conditions that extremely asymmetric path more wherein included, i.e. those phonemes occur Compared to the path for having significant delays or shifting to an earlier date, and these paths can lead to the unstable of model training.In addition, traditional CTC Model framework is trained using RNN, modeling ability when RNN has long, is greatly improved the model performance of CTC, but by It is caused to be not easy parallelization training in some characteristics of RNN, training speed is very slow, and training effectiveness is low.

Although the training step of CTC models simplifies, its accuracy of identification does not have competitiveness compared with CE models； It is but declined slightly compared to traditional cross entropy (Cross-entropy, CE) method on accuracy of identification, accuracy of identification is relatively low；Especially Hydraulic performance decline on middle-size and small-size data set is more serious, and the performance of CTC acoustic models is usually not so good as CE models.In addition, CTC moulds The training of type is extremely unstable, easily dissipates.

Invention content

It is an object of the present invention to for training method of the solution to existing acoustic model, there are the above problem, the present invention A kind of method of the acoustic training model based on CTC is provided, this method includes：

One step 1, training initial GMM model, time point is carried out with the GMM model to the text marking of training data Alignment is forced, obtains the time zone corresponding to each phoneme in the phoneme notation sequence of training data；

Step 2, in the phoneme notation sequence of training data, be all inserted into after each phoneme one it is relevant with the phoneme " blank " symbol, then each phoneme gather around there are one distinctive " blank " symbol, i.e. the quantity of " blank " symbol and training data Phoneme quantity in phoneme notation sequence is identical；

Step 3, using finite state machine (Finite-State Transducers, FST), after adding in " blank " symbol One CTC of phoneme notation sequence construct before backcasting searching route figure；

Step 4, according to the time unifying in step 1 as a result, to each phoneme occur time range limited, it is specified that The time that each phoneme occurs in result is aligned, sets " time tolerance " parameter, that is, sets each phoneme time of occurrence, and Beta pruning is carried out to the searching route figure built in step 3 according to this limitation, phoneme position is subtracted beyond the path of time restriction Fall, obtain final CTC and calculate searching route figure required during network error.

Step 5, using Delayed Neural Networks (Time-delay Neural Network, TDNN) structure combination CTC methods Acoustic training model is carried out, obtains final TDNN-CTC acoustic models, ReLU is used in TDNN as activation primitive.It is described ReLU activation primitive formula are：

G (y)=max (0, y)

Wherein, g (y) represents the value by the neuron node after activation primitive, and y is represented by net before activation primitive The value of the neuron node of the output of network；

In the searching route for carrying out the preceding backcasting, allow each " blank " that arbitrary time can be continuously repeated and go out It is existing, and the modeling phoneme between two adjacent " blank " symbols can not continuously repeat, state transition topological structure is：If work as Preceding state is phoneme, then NextState can jump to " blank " state or next phoneme；If current state is " blank " shape State, then NextState can jump to " blank " itself or next phoneme.

" blank " symbol is shared using all phonemes in multiple independent " blank " symbol substitution original CTC models.

Ranging from 50-300 milliseconds of each phoneme time of occurrence.

The advantage of the invention is that：It is total to using all phonemes in multiple independent " blank " symbol substitution original CTC models " blank " symbol is enjoyed, improves the independence and identification of " blank " symbol, while can be played to network output phoneme Auxiliary judgement acts on, and improves model accuracy；And time point is added to limit CTC searching routes, on the one hand reduce preceding backcasting Searching route quantity, improve training speed, while after the interference path for differing with actual conditions more in removal, model Training more stablize, final accuracy of identification also higher；Using TDNN-CTC structures, modeling ability when having long can be abundant Using the contextual information of input, training week can be greatly shortened while ensureing that accuracy of identification is identical with RNN-CTC models Phase；As common DNN, it is easy to parallel computation, the training speed of CTC models is made to improve about 3 times；And it is proposed by this patent The CTC models trained of method about opposite 10% can be obtained on identification Word Error Rate compared to original CTC baseline models Decline.

Description of the drawings

Fig. 1 is a kind of state transition topology diagram of the method for acoustic training model based on CTC of the present invention

Fig. 2 is a kind of flow chart of the method for acoustic training model based on CTC of the present invention

Specific embodiment

Below in conjunction with attached drawing, the present invention is described in further detail.

As shown in Fig. 2, the present invention provides a kind of method of the acoustic training model based on CTC, this method uses first All phonemes in multiple independent " blank " symbol substitution original CTC models share " blank " symbol, then to training number According to phoneme notation sequence time point alignment is done by an initial model GMM, obtain the Position Approximate that each phoneme occurs, so Afterwards to the searching route figure of backcasting before one CTC of phoneme notation sequence construct after addition " blank " symbol；Then pass through One configurable parameter " time tolerance " control phoneme slightly shifts to an earlier date in searching route in the range of " time tolerance " Or appearance is delayed, " time tolerance " range is the range of each factor time of occurrence, is usually arranged as 50-300 milliseconds. In the present embodiment, described " time tolerance " is set as 50 milliseconds, the time occurred to phoneme in the path of sweep backward before CTC Point is limited, and this method specifically includes：

Step 2, in the phoneme notation sequence of training data, be all inserted into after each phoneme one it is relevant with the phoneme " blank " symbol, then each phoneme gather around there are one distinctive " blank " symbol, i.e. the quantity of " blank " symbol and training data Phoneme quantity in phoneme notation sequence is identical；Structure is redirected between phoneme in the every paths of path profile as shown in Figure 1, it will All " blank " symbols in a certain paths remove, and can obtain corresponding aligned phoneme sequence.Therefore, path profile is expressed as owning The isometric set of paths for carrying " blank " symbol and mapping to correct phoneme notation sequence.Wherein, the phoneme per paths It is different with the number that " blank " repeats position occur.

Step 4, according to the time unifying in step 1 as a result, to each phoneme occur time range limited, it is specified that The time that each phoneme occurs in result is aligned, sets " time tolerance " parameter, will each phoneme time of occurrence set Beta pruning is carried out to the searching route figure built in step 3 at 50 milliseconds, and according to this limitation, phoneme position is limited beyond the time The path of system is cut, and is obtained final CTC and is calculated searching route figure required during network error.

G (y)=max (0, y)

Wherein, g (y) represents the value by the neuron node after activation primitive, and y is represented by net before activation primitive The value of the neuron node of the output of network.

In the present embodiment, it is tied using using Delayed Neural Networks (Time-delay Neural Network, TDNN) Structure, used TDNN structures are seven layers, each 576 output nodes of hidden layer, using ReLU activation primitives, every layer of configuration point It is not：{ -1,0,1 } { -1,0,1,2 } { -3,0,3 } { -3,0,3 } { -6,3,0 } { 0 }, wherein { -1,0,1 } represent first layer will be defeated Enter the present frame of layer and its input feature vector of former frame and a later frame be stitched together as input, and so on, every layer of meeting by its The output node at preceding layer several moment is stitched together as input.The output node number of network for 78 (39 English phonemes and Corresponding 39 " blank " symbols).

It should be noted last that the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted.Although ginseng The present invention is described in detail according to embodiment, it will be understood by those of ordinary skill in the art that, to the technical side of the present invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Right in.

Claims

A kind of 1. method of the acoustic training model based on CTC, which is characterized in that this method includes：

One step 1, training initial GMM model, time point pressure is carried out with the GMM model to the text marking of training data Alignment, obtains the time zone corresponding to each phoneme in the phoneme notation sequence of training data；

Step 2, in the phoneme notation sequence of training data, be all inserted into after each phoneme one it is relevant " empty with the phoneme Symbol in vain ", then each phoneme gather around there are one distinctive " blank " symbol, the i.e. phoneme of the quantity of " blank " symbol and training data Phoneme quantity in annotated sequence is identical；

Step 3, using finite state machine, it is front and rear to meter to adding in one CTC of phoneme notation sequence construct after " blank " symbol The searching route figure of calculation；

Step 4, according to the time unifying in step 1 as a result, being limited the time range that each phoneme occurs, it is specified that each The time that phoneme occurs, in result be aligned setting " time tolerance " parameter set each phoneme time of occurrence, and according to This limitation carries out beta pruning to the searching route figure built in step 3, and phoneme position beyond the path of time restriction is cut, is obtained Required searching route figure when calculating network error to final CTC；

Step 5, using Delayed Neural Networks structure TDNN, carry out acoustic training model in conjunction with CTC methods, obtain final TDNN-CTC acoustic models, using ReLU as activation primitive in TDNN；The ReLU activation primitives formula is：

G (y)=max (0, y)

Wherein, g (y) represents the value by the neuron node after activation primitive, and y is represented by network before activation primitive The value of the neuron node of output.
2. the method for a kind of acoustic training model based on CTC according to claim 1, which is characterized in that in step 3 In, in the searching route for carrying out the preceding backcasting, each " blank " is allowed to continuously repeat arbitrary time and is occurred, and two phases Modeling phoneme between adjacent " blank " symbol discontinuously repeats.
3. the method for a kind of acoustic training model based on CTC according to claim 1, which is characterized in that described each Ranging from 50-300 milliseconds of phoneme time of occurrence.