CN111063336A

CN111063336A - End-to-end voice recognition system based on deep learning

Info

Publication number: CN111063336A
Application number: CN201911391159.XA
Authority: CN
Inventors: 曹琉; 张大朋; 孙哲南; 张森
Original assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Current assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-24

Abstract

The invention discloses an end-to-end voice recognition system based on deep learning, which comprises: the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio frequency and obtaining probability distribution of each time step through network processing, and outputs candidate pinyin sequences according to entropy value results of the probability distribution of the time steps; the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model; the Transformer coder is used for outputting Chinese character sequences with equal length according to the input candidate pinyin sequences, and the n-gram model is used for processing the output Chinese character sequences and selecting target Chinese character texts to output. The invention can obtain the final recognition result which is most suitable for the current context and the human expression habit.

Description

End-to-end voice recognition system based on deep learning

Technical Field

The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition system based on deep learning.

Background

Speech recognition, which is used to convert speech into corresponding text, generally includes two basic modules, an acoustic module and a language module. For an input speech signal, the acoustic module is responsible for extracting features of the signal and calculating the probability of speech to syllable (or other minimum unit), and the language module converts the minimum unit into a complete natural language that can be understood by human or computer using a language model.

The current speech recognition is divided into two types, namely a probability model method and a deep learning method. As the former, most typical is a speech recognition model (HMM-GMM) based on a Hidden Markov Model (HMM) and a mixture gaussian distribution (GMM), which first frames audio at a millisecond level, extracts acoustic features (including FBank, MFCC) for each frame, calculates expectation and covariance of the mixture distribution model using the GMM for each frame, thereby obtaining a probability value of each HMM state corresponding to each frame, and calculates transition probabilities between different states in the HMM.

For the deep learning method, the classic deep speech2 model is divided into an acoustic model and a language model. For the acoustic model, CNN and RNN are respectively adopted to learn the pronunciation characteristics and the static and dynamic characteristics of signals, and finally, CTC is adopted as a training target through a full-connection network to output the posterior probability of the minimum unit. For the language model, n-grams are directly added to the loss function, thereby learning the context information of the target language.

Both of the above solutions have a drawback that, for the former, the model cannot utilize context information of each frame, i.e., cannot utilize historical information to assist the current task. In addition, the model assumes that the frames and states follow a gaussian distribution, and although the model is simplified, it is very limited. For the latter, although the model can achieve a better convergence effect, due to the loop structure of the RNN itself, the training time is longer due to more RNN units, and parallelization is difficult.

Disclosure of Invention

The invention aims to provide an end-to-end voice recognition system based on deep learning, aiming at the technical defects in the prior art.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a deep learning based end-to-end speech recognition system comprising:

the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio, processing the two-dimensional FBank characteristics through the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer to obtain normalized probability distribution of each time step, and outputting candidate pinyin sequences according to entropy results of the normalized probability distribution of the time steps;

the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.

The VGG Net layer comprises a plurality of VGG blocks, the front VGG blocks comprise two convolution kernels, namely a 3 x 3 convolution layer and a 2 x 2 maximum pooling layer, and the number of channels is doubled after each convolution is carried out, so that the information loss caused by sampling of the maximum pooling layer is reduced; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.

Preferably, the bidirectional RNN layer adopts a GRU structure.

After being processed by the Softmax layer and the CTC layer, the output of each time step is a normalized probability distribution with the length of the minimum number of acoustic units, and in the normalized probability distribution, each component represents the probability of corresponding to a certain character at the time step.

The acoustic model adopts a CNN-RNN-CTC framework, and the assumption of Gaussian distribution on the posterior probability of the acoustic model is not needed. Meanwhile, the adopted CNN-RNN framework has strong learning capacity of context information. In addition, the entropy value of the time step is calculated by utilizing the output probability distribution, so that the candidate pinyin sequence is determined, and the model has the self-repairing capability.

The language model of the invention adopts a Transformer encoder structure, utilizes a self-attention mode in a Transformer, can efficiently learn the context information of the text, can use the context information of the text to the maximum extent, has strong semantic information mining capability, simultaneously can realize parallel calculation on different time steps through matrix multiplication without the dependence relationship among the time steps, greatly reduces the training time of the language model, and can accelerate the training speed by utilizing the self-attention mechanism.

Finally, the invention fuses the Transformer and the n-gram into the language model together, and takes the model as the last unit before the model is output, and is responsible for scoring a plurality of candidate outputs, thereby obtaining the final recognition result which best accords with the current context and the human expression habit.

Drawings

FIG. 1 is a schematic diagram of an end-to-end speech recognition system architecture based on deep learning;

FIG. 2 is a schematic diagram of the structure of a VGG block;

FIG. 3 is a schematic diagram of an acoustic model output;

FIG. 4 is a signal processing diagram of a speech model;

FIG. 5 is a schematic diagram of an end-to-end speech recognition system process based on deep learning.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the deep learning based end-to-end speech recognition system of the present invention comprises:

the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, and is used for obtaining the normalized probability distribution of each time step after the two-dimensional FBank characteristics of audio are extracted and processed by the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer; then, calculating the entropy value of the normalized probability distribution of the time step through entropy value calculation, and determining to output a candidate pinyin sequence according to the calculated entropy value result;

For an acoustic model, in the invention, an audio file with a wav format is input, and the model extracts two-dimensional Fbank characteristics of audio through an Fbank algorithm. The front part of the model adopts a VGG Net (Visual geometry group Network) architecture in CNN, and the Network architecture is very simple, and adopts a convolution kernel of 3 x 3 and a maximum pooling layer of 2 x 2 for many times. The VGG Net is composed of a plurality of VGG blocks, and the basic structure is as shown in fig. 2 for each VGG block. In the VGG Net structure, after each convolution is carried out in the first VGG blocks, the number of channels is doubled so as to reduce information loss caused by sampling of the maximum pooling layer. For the last two VGG blocks, convolutional layers are used instead of max pooling layers, with the goal of increasing the number of model layers to learn text information deeper in the acoustic information.

After the VGG Net structure outputs the multi-channel three-dimensional tensor, firstly, a plurality of channels in the output result of the VGG Net are merged into a single channel in a mode of splicing the data of the plurality of channels, so as to reduce the three-dimensional information to two dimensions and form a two-dimensional matrix, aiming at regarding each row of data as a time step, namely, the data of each time step is input to realize dimension reduction through a full connection layer behind a VGG Net structure so as to reduce the calculation amount of a subsequent model, then, through a bidirectional RNN layer (adopting a GRU structure), inputting information of each time step to learn context information of deeper layers in the audio data, then, through the second full-connected layer, the probability (logits) distribution of each time step is output, mapping the dimension of each time step in the RNN to the minimum acoustic unit number (wherein, the minimum acoustic unit adopts Chinese pinyin and an additional blank character representing pinyin interval); after the Softmax normalization, the output of each time step is a normalized probability distribution with the length of the minimum acoustic unit number, and in the probability distribution, each component represents the probability of corresponding to a certain character (namely Chinese pinyin or blank characters) on the time step; and finally, by a CTC structure, using the difference between the output of each time step and a real result as a final loss function of the acoustic model.

The acoustic model adopts a deep learning mode, so that the assumption of Gaussian distribution is avoided, a VGG Net framework in CNN is utilized, a simpler model structure is provided, and meanwhile, the model performance is improved by continuously deepening a network structure. In addition, the bidirectional GRU structure can fully utilize the context dependence information of the voice, and finally, entropy values are adopted to determine all candidate pinyin sequences which are finally output.

In the invention, the output of the acoustic model is the normalized probability distribution of each time step to each minimum acoustic unit, and the corresponding pinyin sequence can be obtained through the normalized probability distribution.

For each time step, it is necessary to calculate the entropy of the probability distribution to obtain the degree of confusion (i.e., the degree of uncertainty) at that time step. As in FIG. 3, the entropy of the probability distribution at the first time step from left to right is significantly greater than the entropy of the probability distributions at the other time steps. By setting a threshold, outputting the most possible pinyin prediction result and the next possible pinyin prediction result of the time step with lower confidence (namely higher chaos) at the same time, and obtaining a plurality of candidate pinyin sequences by permutation and combination; the number of pinyin sequences is the nth power of 2, and n represents the number of time steps with the entropy value of time step probability distribution exceeding a threshold value.

In the invention, for a language model, a Transformer encoder structure based on a self-attention mechanism is adopted, the output (namely a Chinese pinyin sequence) of an acoustic model is used as the input of the model, and a Chinese character sequence with the same length as the Chinese pinyin sequence is output through the self-attention mechanism of multi-head. Because the self-attention machine has stronger context learning capability and faster computing capability, the context information of the text can be efficiently learned, so that the model has stronger inference capability and faster convergence speed. Because the self-attention model of the Transformer does not need the dependency relationship among time steps, the parallel computation on different time steps can be realized through matrix multiplication, and the training time of the language model is greatly reduced.

In addition, for a plurality of outputs of the acoustic model, the language model of the present invention takes the outputs as inputs, and can obtain the outputs of a plurality of language models, and scores all the language model outputs by using an n-gram model which is counted in advance based on mass data, and sorts a plurality of candidate outputs, and the highest score is the final output, thereby obtaining the most smooth natural language text information, as shown in fig. 4.

As shown in fig. 5, it is assumed that the acoustic model can output characters "yu 3, yin1, shi2, shi4, _, bie2,", which are numbered 0 to 5 in sequence, where numerals represent tones and underlines represent blank characters.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An end-to-end speech recognition system based on deep learning, comprising:

2. The deep learning based end-to-end speech recognition system of claim 1, wherein the VGGNet layer comprises a plurality of VGG blocks, the previous VGG blocks comprise two convolution kernels of 3 x 3 convolution layers and a maximum pooling layer of 2 x 2, and the number of channels is doubled after each convolution to reduce information loss due to sampling of the maximum pooling layer; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.

3. The deep learning-based end-to-end speech recognition system of claim 1, wherein the bidirectional RNN layer is a GRU structure.

4. The deep learning-based end-to-end speech recognition system of claim 1, wherein the output at each time step after processing by the Softmax and CTC layers is a normalized probability distribution with a length of a minimum number of acoustic units, wherein each component represents a probability of corresponding to a character at the time step.