CN111653275B

CN111653275B - Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method

Info

Publication number: CN111653275B
Application number: CN202010253075.6A
Authority: CN
Inventors: 高戈; 曾邦; 杨玉红; 陈怡�; 尹文兵; 王霄; 方依云
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2022-06-03
Anticipated expiration: 2040-04-02
Also published as: CN111653275A

Abstract

The invention discloses a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolution layer is used for parallelizing calculation which needs to be carried out simultaneously by an original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method and the device have the advantages that the training of the voice model is accelerated, the time cost of a developer is reduced, and the requirement standard of hardware is reduced to a certain extent.

Description

Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method

Technical Field

The invention relates to the field of voice recognition, in particular to a method and a device for constructing a voice recognition model based on LSTM-CTC tail convolution and a voice recognition method.

Background

Speech recognition technology is a technology that lets a machine convert a speech signal into a corresponding text or command through a recognition and understanding process. In recent years, with the great heat of artificial intelligence energy technology, the speech recognition technology is also rapidly developed, speech recognition models are updated and optimized for several times, and typical models include Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory Networks (LSTM).

Among them, the Long and Short Term Memory network (LSTM-CTC) with CTC as loss function is widely used in speech recognition due to its characteristics of easy training, high decoding efficiency and good performance.

The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:

although the LSTM-CTC has many advantages, due to the LSTM timing sequence, the LSTM is very time-consuming because it is difficult to parallelize in network training, and the hardware requirement of the machine is also increased to some extent.

Therefore, the technical problem of long model training time in the prior art is known.

Disclosure of Invention

The invention provides a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method, which are used for solving or at least partially solving the technical problem of long model training time in the method in the prior art.

In order to solve the above technical problems, a first aspect of the present invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, including:

s1: acquiring training data;

s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;

s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.

In one embodiment, S1 specifically includes:

FBank features extracted from the speech data are used as training data.

In an embodiment, S3 specifically includes:

s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:

wherein

u denotes the length of the sequence and,

indicates the probability of the output being a space character at time t, < l >'_uA tag indicating the output at the t time step;

s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t +1, resulting in the sequence l after the final mapping, as follows

Wherein

u denotes the length of the sequence and,

indicates the probability, l ', that the output is a space character at time t + 1'_uA tag indicating the output at the t time step;

s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:

s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:

where B (z, k) is the set of all paths for which tag k appears in sequence z',

a character indicating the output at time t,

p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e., the label;

s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.

Based on the same inventive concept, the second aspect of the present invention provides an apparatus for constructing a speech recognition model based on LSTM-CTC tail convolution, comprising:

the training data acquisition module is used for acquiring training data;

the model building module is used for building a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as the input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;

and the model training module is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.

Based on the same inventive concept, a third aspect of the present invention provides a speech recognition method, comprising:

and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first aspect to obtain a voice recognition result.

In one embodiment, the recognition process of the speech recognition model includes:

s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;

s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;

s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.

In one embodiment, the LSTM layer includes the input word X at a time of day_tCell state C_tTemporary cell state

Hidden state h_tForgetting door f_tInput door i_tOutput gate o_tExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:

s1.1: calculating a forgetting gate, selecting information to be forgotten: f. of_t＝σ(W_f·[h_t-1,x_t]+b_f)

Wherein the input is a hidden state h at the previous time_t-1And the input word x at the current time_tOutput is f_t，W_f、 b_fRespectively is a weight matrix and an offset of the forgetting gate;

s1.2: a calculation input gate, selecting information to be memorized:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

wherein the input is a hidden state h at the previous time_t-1And the input word x at the current time_tThe output is the value i of the memory gate_tAnd transient cell state

W_i、b_iWeight matrix and offset, W, of the input gate, respectively_C、b_CRespectively are a weight matrix and an offset of the output gate;

s1.3: calculating the cell state at the current moment:

wherein the input is the value i of the memory gate_tForgetting gate value f_tTemporary cell status

And last-minute cell state C_t-1The output is the cell state C at the current moment_t；

S1.4: compute output gate and current time hidden state

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t*tanh(C_t)

Wherein the input is the hidden state h of the previous moment_t-1Input word x at the present time_tAnd cell state C at the present time_tThe output is the value o of the output gate_tAnd hidden state h_t；

S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation₀,h₁,...,h_n-1}。

In an embodiment, S3 specifically includes: characterizing the characteristics of the full convolutional layer output as relative probabilities between different classes to obtain a final class prediction,

wherein i represents the ith class, N represents the total number of classes, V_iRepresenting the probability value of the ith category, S_iRepresenting the probability value of the ith category after softmax processing.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, wherein a constructed neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the full convolution layer is adopted to replace a full connection layer between the LSTM layer and the Softmax layer in the traditional scheme, compared with the existing full connection layer, a convolution kernel is used for calculating in the convolution layer, and the calculation of the convolution kernel is parallel, so that the training time of the model can be reduced.

Based on the constructed speech recognition model, the invention also provides a speech recognition method based on the model, thereby improving the speech recognition efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative work.

FIG. 1 is a schematic flow chart of an implementation of a method for constructing a speech recognition model based on LSTM-CTC tail convolution according to the present invention;

FIG. 2 is a flow chart of a LSTM-CTC model provided by an embodiment of the present invention;

FIG. 3 is a block diagram of the construction device of the speech recognition model based on LSTM-CTC tail convolution according to the present invention;

FIG. 4 is a flow chart of the operation of speech recognition using the speech recognition model of the present invention.

Detailed Description

The inventor of the application finds out through a great deal of research and practice that: based on prior knowledge, the long-term memory network and the short-term memory network both depend on the prediction of the last time point in the backward propagation process, and therefore, the three gates and the memory cell cannot be parallel. This makes LSTM very time consuming to train and it is very difficult to parallelize LSTM networks due to the temporal characteristics of LSTM. Based on this, the present invention aims to reduce the training time of the speech recognition model by modifying the network structure of the LSTM-CTC.

In order to achieve the above object, the main concept of the present invention is as follows:

the invention provides a method for constructing a speech recognition model based on LSTM-CTC (Long Short Term Memory connection termination) tail convolution, which replaces a full connection layer between a BilSTM layer and a softmax layer by a full convolution layer to achieve the effect of accelerating network training. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolutional layer is used for parallelizing calculation which needs to be carried out simultaneously by an original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method accelerates the training of the voice model, reduces the time cost of a developer, and reduces the requirement standard of hardware to a certain extent.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example one

The embodiment provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 1, and the method includes:

s1: acquiring training data;

Specifically, the training data in S1 may be acquired by speech recognition.

In the S2, a neural network model framework is constructed, the invention innovatively replaces the full-connection layer between the LSTM layer and the softmax layer with the convolutional layer, and the efficiency of model training is improved through the parallel calculation of the convolutional layer.

Ctc (connectionist Temporal classification) in S3 may be directly trained using the sequence. CTC introduces a new loss function that can be trained directly using unsingulated semaphores.

In one embodiment, S1 specifically includes:

FBank features extracted from the speech data are used as training data.

Specifically, the FBank feature of the audio can be obtained by acquiring voice data through an audio input device and then by audio front-end processing.

In one embodiment, S3 specifically includes:

s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and mapped to sequence l, as follows:

wherein

u denotes the length of the sequence and,

Wherein

u denotes the length of the sequence and,

s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the output of the loss function relative to the network is as follows:

wherein B (z, k) is the position of tag k in sequence zThere is a set of paths that are,

a character indicating the output at time t,

Specifically, CTC is used as a loss function, a Stochastic Gradient Descent (SGD) algorithm is adopted to train the network, whether a model is optimal or not is measured through the loss function, if the model is optimal, the training is stopped, and if the model is not optimal, the next training and optimization of the network are guided by matching with the Stochastic gradient descent algorithm.

Referring to fig. 2, which is a flow chart of a speech recognition model, training data is first input, and then a network result is constructed: after determining the structure of the model, the two LSTM (LSTM1 and LSTM2), the full convolution layer and the Softmax layer train the model by adopting a CTC loss function to finally obtain the speech recognition model.

Compared with the prior art, the invention has the following advantages and beneficial effects: the time cost of network training is saved, and the hardware requirement of the network training is reduced to a certain extent.

Example two

Based on the same inventive concept, the embodiment provides a device for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 3, the device includes:

a training data acquisition module 201, configured to acquire training data;

the model building module 202 is configured to build a neural network model, where the neural network model includes two LSTM layers, a full convolution layer and a Softmax layer, where the LSTM layer is used to extract a hidden state sequence with the same length as an input feature sequence, the full convolution layer is used to reduce the rank and classify the input hidden state sequence, and the Softmax layer is used to map the output of the full convolution layer to obtain a category prediction;

and the model training module 203 is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.

Since the apparatus introduced in the second embodiment of the present invention is an apparatus used for implementing the method for constructing the speech recognition model based on the LSTM-CTC tail convolution in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus based on the method introduced in the first embodiment of the present invention, and thus details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

EXAMPLE III

Based on the same inventive concept, the embodiment provides a speech recognition method, which comprises the following steps:

and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first embodiment to obtain a voice recognition result.

Hidden state h_tForgetting door f_tDoor for inputtingi_tOutput gate o_tExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:

s1.2: a calculation input gate, selecting information to be memorized:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

wherein, the input is the hidden state h at the previous moment_t-1And the input word x at the current time_tThe output is the value i of the memory gate_tAnd transient cell state

s1.3: calculating the cell state at the current moment:

wherein the input is the value i of the memory gate_tForgetting gate value f_tTemporary cell state

S1.4: compute output gate and current time hidden state

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t*tanh(C_t)

Specifically, S1.1-S1.5 describe the implementation process of LTSM layer in detail, the two layers of LSTMs are the same in function, the expression capability of the network model can be enhanced by deepening the network depth by adopting the multiple layers of LSTMs, but because the gradient disappears, the two layers of LSTMs are selected for training and prediction.

In one embodiment, S3 specifically includes: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,

Referring to fig. 4, which is a flowchart of performing speech recognition by using a speech recognition model, the Fbank feature extracted from the training speech is used for model training, the obtained decoding model is the final speech recognition model, and the speech to be recognized or the test speech is input into the decoding model to obtain the final recognition result, i.e., the recognition text.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass these modifications and variations.

Claims

1. The method for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps of:

s1: acquiring training data;

2. The method of claim 1, wherein S1 specifically comprises:

FBank features extracted from the speech data are used as training data.

3. The method of claim 1, wherein S3 specifically comprises:

wherein

u denotes the length of the sequence and,

indicates the probability of output as a space character at time t, < l >'_uA tag indicating the output at the t time step;

s3.2: the back propagation vector β (t, u) is computed, which is the sum of the probabilities of adding a path π' on the forward variable α (t, u) from time t +1, so that the sequence l is the final pass through the mapping, as follows

Wherein

u denotes the length of the sequence and,

where B (z, k) is the set of all paths for which tag k appears in sequence z',

a character indicating the output at time t,

p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e. the label;

4. The device for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps:

the training data acquisition module is used for acquiring training data;

5. A speech recognition method, comprising:

the speech recognition result is obtained by inputting the speech data to be recognized into the speech recognition model according to any one of claims 1 to 3 after feature extraction.

6. The method of claim 5, wherein the recognition process of the speech recognition model comprises:

7. The method of claim 6, wherein the LSTM layer includes an input word X of a time instant_tCell state C_tTemporary cell state

Hidden state h_tForgetting door f_tInput door i_tOutput gate o_tExtracting a hidden state sequence with the same length as the input feature sequence through an LSTM layer, wherein the hidden state sequence comprises the following steps:

Wherein the input is a hidden state h at the previous time_t-1And the input word x at the current time_tOutput is f_t，W_f、b_fRespectively is a weight matrix and an offset of the forgetting gate;

s1.2: a calculation input gate for selecting information to be memorized:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

W_i、b_iWeight matrix and offset, W, of the input gate, respectively_C、b_CRespectively are the weight matrix and the offset of the output gate;

s1.3: calculating the cell state at the current moment:

And last minute cell stateC_t-1The output is the cell state C at the current moment_t；

S1.4: compute output gate and current time hidden state

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t*tanh(C_t)

Wherein the input is the hidden state h of the previous moment_t-1Input word x at the present time_tAnd the current time cell status C_tThe output is the value o of the output gate_tAnd hidden state h_t；

8. The method of claim 6, wherein S3 specifically comprises: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,