CN111653275B - Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method - Google Patents

Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method Download PDF

Info

Publication number
CN111653275B
CN111653275B CN202010253075.6A CN202010253075A CN111653275B CN 111653275 B CN111653275 B CN 111653275B CN 202010253075 A CN202010253075 A CN 202010253075A CN 111653275 B CN111653275 B CN 111653275B
Authority
CN
China
Prior art keywords
model
output
input
sequence
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010253075.6A
Other languages
Chinese (zh)
Other versions
CN111653275A (en
Inventor
高戈
曾邦
杨玉红
陈怡�
尹文兵
王霄
方依云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202010253075.6A priority Critical patent/CN111653275B/en
Publication of CN111653275A publication Critical patent/CN111653275A/en
Application granted granted Critical
Publication of CN111653275B publication Critical patent/CN111653275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolution layer is used for parallelizing calculation which needs to be carried out simultaneously by an original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method and the device have the advantages that the training of the voice model is accelerated, the time cost of a developer is reduced, and the requirement standard of hardware is reduced to a certain extent.

Description

Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
Technical Field
The invention relates to the field of voice recognition, in particular to a method and a device for constructing a voice recognition model based on LSTM-CTC tail convolution and a voice recognition method.
Background
Speech recognition technology is a technology that lets a machine convert a speech signal into a corresponding text or command through a recognition and understanding process. In recent years, with the great heat of artificial intelligence energy technology, the speech recognition technology is also rapidly developed, speech recognition models are updated and optimized for several times, and typical models include Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory Networks (LSTM).
Among them, the Long and Short Term Memory network (LSTM-CTC) with CTC as loss function is widely used in speech recognition due to its characteristics of easy training, high decoding efficiency and good performance.
The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:
although the LSTM-CTC has many advantages, due to the LSTM timing sequence, the LSTM is very time-consuming because it is difficult to parallelize in network training, and the hardware requirement of the machine is also increased to some extent.
Therefore, the technical problem of long model training time in the prior art is known.
Disclosure of Invention
The invention provides a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method, which are used for solving or at least partially solving the technical problem of long model training time in the method in the prior art.
In order to solve the above technical problems, a first aspect of the present invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, including:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
In one embodiment, S1 specifically includes:
FBank features extracted from the speech data are used as training data.
In an embodiment, S3 specifically includes:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:
Figure BDA0002436198510000021
wherein
Figure BDA0002436198510000022
u denotes the length of the sequence and,
Figure BDA0002436198510000023
indicates the probability of the output being a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t +1, resulting in the sequence l after the final mapping, as follows
Figure BDA0002436198510000024
Wherein
Figure BDA0002436198510000025
u denotes the length of the sequence and,
Figure BDA0002436198510000026
indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:
Figure BDA0002436198510000027
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:
Figure BDA0002436198510000031
where B (z, k) is the set of all paths for which tag k appears in sequence z',
Figure BDA0002436198510000032
a character indicating the output at time t,
Figure BDA0002436198510000033
p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e., the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
Based on the same inventive concept, the second aspect of the present invention provides an apparatus for constructing a speech recognition model based on LSTM-CTC tail convolution, comprising:
the training data acquisition module is used for acquiring training data;
the model building module is used for building a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as the input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
and the model training module is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
Based on the same inventive concept, a third aspect of the present invention provides a speech recognition method, comprising:
and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first aspect to obtain a voice recognition result.
In one embodiment, the recognition process of the speech recognition model includes:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
In one embodiment, the LSTM layer includes the input word X at a time of daytCell state CtTemporary cell state
Figure BDA0002436198510000041
Hidden state htForgetting door ftInput door itOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten: f. oft=σ(Wf·[ht-1,xt]+bf)
Wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、 bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate, selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0002436198510000042
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell state
Figure BDA0002436198510000043
Wi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are a weight matrix and an offset of the output gate;
s1.3: calculating the cell state at the current moment:
Figure BDA0002436198510000044
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell status
Figure BDA0002436198510000045
And last-minute cell state Ct-1The output is the cell state C at the current momentt
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein the input is the hidden state h of the previous momentt-1Input word x at the present timetAnd cell state C at the present timetThe output is the value o of the output gatetAnd hidden state ht
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
In an embodiment, S3 specifically includes: characterizing the characteristics of the full convolutional layer output as relative probabilities between different classes to obtain a final class prediction,
Figure BDA0002436198510000046
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, wherein a constructed neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the full convolution layer is adopted to replace a full connection layer between the LSTM layer and the Softmax layer in the traditional scheme, compared with the existing full connection layer, a convolution kernel is used for calculating in the convolution layer, and the calculation of the convolution kernel is parallel, so that the training time of the model can be reduced.
Based on the constructed speech recognition model, the invention also provides a speech recognition method based on the model, thereby improving the speech recognition efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative work.
FIG. 1 is a schematic flow chart of an implementation of a method for constructing a speech recognition model based on LSTM-CTC tail convolution according to the present invention;
FIG. 2 is a flow chart of a LSTM-CTC model provided by an embodiment of the present invention;
FIG. 3 is a block diagram of the construction device of the speech recognition model based on LSTM-CTC tail convolution according to the present invention;
FIG. 4 is a flow chart of the operation of speech recognition using the speech recognition model of the present invention.
Detailed Description
The inventor of the application finds out through a great deal of research and practice that: based on prior knowledge, the long-term memory network and the short-term memory network both depend on the prediction of the last time point in the backward propagation process, and therefore, the three gates and the memory cell cannot be parallel. This makes LSTM very time consuming to train and it is very difficult to parallelize LSTM networks due to the temporal characteristics of LSTM. Based on this, the present invention aims to reduce the training time of the speech recognition model by modifying the network structure of the LSTM-CTC.
In order to achieve the above object, the main concept of the present invention is as follows:
the invention provides a method for constructing a speech recognition model based on LSTM-CTC (Long Short Term Memory connection termination) tail convolution, which replaces a full connection layer between a BilSTM layer and a softmax layer by a full convolution layer to achieve the effect of accelerating network training. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolutional layer is used for parallelizing calculation which needs to be carried out simultaneously by an original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method accelerates the training of the voice model, reduces the time cost of a developer, and reduces the requirement standard of hardware to a certain extent.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example one
The embodiment provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 1, and the method includes:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
Specifically, the training data in S1 may be acquired by speech recognition.
In the S2, a neural network model framework is constructed, the invention innovatively replaces the full-connection layer between the LSTM layer and the softmax layer with the convolutional layer, and the efficiency of model training is improved through the parallel calculation of the convolutional layer.
Ctc (connectionist Temporal classification) in S3 may be directly trained using the sequence. CTC introduces a new loss function that can be trained directly using unsingulated semaphores.
In one embodiment, S1 specifically includes:
FBank features extracted from the speech data are used as training data.
Specifically, the FBank feature of the audio can be obtained by acquiring voice data through an audio input device and then by audio front-end processing.
In one embodiment, S3 specifically includes:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and mapped to sequence l, as follows:
Figure BDA0002436198510000071
wherein
Figure BDA0002436198510000072
u denotes the length of the sequence and,
Figure BDA0002436198510000073
indicates the probability of the output being a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t +1, resulting in the sequence l after the final mapping, as follows
Figure BDA0002436198510000074
Wherein
Figure BDA0002436198510000075
u denotes the length of the sequence and,
Figure BDA0002436198510000076
indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:
Figure BDA0002436198510000077
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the output of the loss function relative to the network is as follows:
Figure BDA0002436198510000078
wherein B (z, k) is the position of tag k in sequence zThere is a set of paths that are,
Figure BDA0002436198510000081
a character indicating the output at time t,
Figure BDA0002436198510000082
p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e., the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
Specifically, CTC is used as a loss function, a Stochastic Gradient Descent (SGD) algorithm is adopted to train the network, whether a model is optimal or not is measured through the loss function, if the model is optimal, the training is stopped, and if the model is not optimal, the next training and optimization of the network are guided by matching with the Stochastic gradient descent algorithm.
Referring to fig. 2, which is a flow chart of a speech recognition model, training data is first input, and then a network result is constructed: after determining the structure of the model, the two LSTM (LSTM1 and LSTM2), the full convolution layer and the Softmax layer train the model by adopting a CTC loss function to finally obtain the speech recognition model.
Compared with the prior art, the invention has the following advantages and beneficial effects: the time cost of network training is saved, and the hardware requirement of the network training is reduced to a certain extent.
Example two
Based on the same inventive concept, the embodiment provides a device for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 3, the device includes:
a training data acquisition module 201, configured to acquire training data;
the model building module 202 is configured to build a neural network model, where the neural network model includes two LSTM layers, a full convolution layer and a Softmax layer, where the LSTM layer is used to extract a hidden state sequence with the same length as an input feature sequence, the full convolution layer is used to reduce the rank and classify the input hidden state sequence, and the Softmax layer is used to map the output of the full convolution layer to obtain a category prediction;
and the model training module 203 is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
Since the apparatus introduced in the second embodiment of the present invention is an apparatus used for implementing the method for constructing the speech recognition model based on the LSTM-CTC tail convolution in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus based on the method introduced in the first embodiment of the present invention, and thus details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
EXAMPLE III
Based on the same inventive concept, the embodiment provides a speech recognition method, which comprises the following steps:
and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first embodiment to obtain a voice recognition result.
In one embodiment, the recognition process of the speech recognition model includes:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
In one embodiment, the LSTM layer includes the input word X at a time of daytCell state CtTemporary cell state
Figure BDA0002436198510000091
Hidden state htForgetting door ftDoor for inputtingitOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten: f. oft=σ(Wf·[ht-1,xt]+bf)
Wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、 bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate, selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0002436198510000092
wherein, the input is the hidden state h at the previous momentt-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell state
Figure BDA0002436198510000093
Wi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are a weight matrix and an offset of the output gate;
s1.3: calculating the cell state at the current moment:
Figure BDA0002436198510000094
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell state
Figure BDA0002436198510000095
And last-minute cell state Ct-1The output is the cell state C at the current momentt
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein the input is the hidden state h of the previous momentt-1Input word x at the present timetAnd cell state C at the present timetThe output is the value o of the output gatetAnd hidden state ht
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
Specifically, S1.1-S1.5 describe the implementation process of LTSM layer in detail, the two layers of LSTMs are the same in function, the expression capability of the network model can be enhanced by deepening the network depth by adopting the multiple layers of LSTMs, but because the gradient disappears, the two layers of LSTMs are selected for training and prediction.
In one embodiment, S3 specifically includes: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,
Figure BDA0002436198510000101
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
Referring to fig. 4, which is a flowchart of performing speech recognition by using a speech recognition model, the Fbank feature extracted from the training speech is used for model training, the obtained decoding model is the final speech recognition model, and the speech to be recognized or the test speech is input into the decoding model to obtain the final recognition result, i.e., the recognition text.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass these modifications and variations.

Claims (8)

1. The method for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps of:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
2. The method of claim 1, wherein S1 specifically comprises:
FBank features extracted from the speech data are used as training data.
3. The method of claim 1, wherein S3 specifically comprises:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:
Figure FDA0002436198500000011
wherein
Figure FDA0002436198500000012
u denotes the length of the sequence and,
Figure FDA0002436198500000013
indicates the probability of output as a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is computed, which is the sum of the probabilities of adding a path π' on the forward variable α (t, u) from time t +1, so that the sequence l is the final pass through the mapping, as follows
Figure FDA0002436198500000014
Wherein
Figure FDA0002436198500000015
u denotes the length of the sequence and,
Figure FDA0002436198500000016
indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:
Figure FDA0002436198500000021
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:
Figure FDA0002436198500000022
where B (z, k) is the set of all paths for which tag k appears in sequence z',
Figure FDA0002436198500000023
a character indicating the output at time t,
Figure FDA0002436198500000024
p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e. the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
4. The device for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps:
the training data acquisition module is used for acquiring training data;
the model building module is used for building a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as the input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
and the model training module is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
5. A speech recognition method, comprising:
the speech recognition result is obtained by inputting the speech data to be recognized into the speech recognition model according to any one of claims 1 to 3 after feature extraction.
6. The method of claim 5, wherein the recognition process of the speech recognition model comprises:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
7. The method of claim 6, wherein the LSTM layer includes an input word X of a time instanttCell state CtTemporary cell state
Figure RE-FDA0002593243550000036
Hidden state htForgetting door ftInput door itOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through an LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten: f. oft=σ(Wf·[ht-1,xt]+bf)
Wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate for selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
Figure RE-FDA0002593243550000031
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell state
Figure RE-FDA0002593243550000032
Wi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are the weight matrix and the offset of the output gate;
s1.3: calculating the cell state at the current moment:
Figure RE-FDA0002593243550000033
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell status
Figure RE-FDA0002593243550000034
And last minute cell stateCt-1The output is the cell state C at the current momentt
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein the input is the hidden state h of the previous momentt-1Input word x at the present timetAnd the current time cell status CtThe output is the value o of the output gatetAnd hidden state ht
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
8. The method of claim 6, wherein S3 specifically comprises: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,
Figure FDA0002436198500000036
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
CN202010253075.6A 2020-04-02 2020-04-02 Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method Active CN111653275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010253075.6A CN111653275B (en) 2020-04-02 2020-04-02 Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010253075.6A CN111653275B (en) 2020-04-02 2020-04-02 Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method

Publications (2)

Publication Number Publication Date
CN111653275A CN111653275A (en) 2020-09-11
CN111653275B true CN111653275B (en) 2022-06-03

Family

ID=72352085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010253075.6A Active CN111653275B (en) 2020-04-02 2020-04-02 Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method

Country Status (1)

Country Link
CN (1) CN111653275B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112235470B (en) * 2020-09-16 2021-11-23 重庆锐云科技有限公司 Incoming call client follow-up method, device and equipment based on voice recognition
CN112233655A (en) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 Neural network training method for improving voice command word recognition performance
CN112802491B (en) * 2021-02-07 2022-06-14 武汉大学 Voice enhancement method for generating confrontation network based on time-frequency domain
CN113192489A (en) * 2021-05-16 2021-07-30 金陵科技学院 Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model
CN113808581B (en) * 2021-08-17 2024-03-12 山东大学 Chinese voice recognition method based on acoustic and language model training and joint optimization
CN115563508A (en) * 2022-11-08 2023-01-03 北京百度网讯科技有限公司 Model training method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710922A (en) * 2018-12-06 2019-05-03 深港产学研基地产业发展中心 Text recognition method, device, computer equipment and storage medium
CN110633646A (en) * 2019-08-21 2019-12-31 数字广东网络建设有限公司 Method and device for detecting image sensitive information, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10762637B2 (en) * 2017-10-27 2020-09-01 Siemens Healthcare Gmbh Vascular segmentation using fully convolutional and recurrent neural networks
EP3724819A4 (en) * 2017-12-13 2022-06-22 Cognizant Technology Solutions U.S. Corporation Evolutionary architectures for evolution of deep neural networks
US11315570B2 (en) * 2018-05-02 2022-04-26 Facebook Technologies, Llc Machine learning-based speech-to-text transcription cloud intermediary

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710922A (en) * 2018-12-06 2019-05-03 深港产学研基地产业发展中心 Text recognition method, device, computer equipment and storage medium
CN110633646A (en) * 2019-08-21 2019-12-31 数字广东网络建设有限公司 Method and device for detecting image sensitive information, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴邦誉等.采用拼音降维的中文对话模型.《中文信息学报》.2019,(第05期), *
杨艳芳等.基于深度卷积长短时记忆网络的加速度手势识别.《电子测量技术》.2019,(第21期), *

Also Published As

Publication number Publication date
CN111653275A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111653275B (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN108170749B (en) Dialog method, device and computer readable medium based on artificial intelligence
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN106098059A (en) customizable voice awakening method and system
US11205419B2 (en) Low energy deep-learning networks for generating auditory features for audio processing pipelines
WO2021208455A1 (en) Neural network speech recognition method and system oriented to home spoken environment
JP2022141931A (en) Method and device for training living body detection model, method and apparatus for living body detection, electronic apparatus, storage medium, and computer program
CN110459207A (en) Wake up the segmentation of voice key phrase
CN111563161B (en) Statement identification method, statement identification device and intelligent equipment
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
Han et al. Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification
Regmi et al. Nepali speech recognition using rnn-ctc model
CN112417890B (en) Fine granularity entity classification method based on diversified semantic attention model
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
US20240046921A1 (en) Method, apparatus, electronic device, and medium for speech processing
CN114333768A (en) Voice detection method, device, equipment and storage medium
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis
CN112740200B (en) Systems and methods for end-to-end deep reinforcement learning based on coreference resolution
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
US20220122586A1 (en) Fast Emit Low-latency Streaming ASR with Sequence-level Emission Regularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant