CN111063336A - End-to-end voice recognition system based on deep learning - Google Patents

End-to-end voice recognition system based on deep learning Download PDF

Info

Publication number
CN111063336A
CN111063336A CN201911391159.XA CN201911391159A CN111063336A CN 111063336 A CN111063336 A CN 111063336A CN 201911391159 A CN201911391159 A CN 201911391159A CN 111063336 A CN111063336 A CN 111063336A
Authority
CN
China
Prior art keywords
layer
model
vgg
deep learning
recognition system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911391159.XA
Other languages
Chinese (zh)
Inventor
曹琉
张大朋
孙哲南
张森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Original Assignee
Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd filed Critical Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority to CN201911391159.XA priority Critical patent/CN111063336A/en
Publication of CN111063336A publication Critical patent/CN111063336A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an end-to-end voice recognition system based on deep learning, which comprises: the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio frequency and obtaining probability distribution of each time step through network processing, and outputs candidate pinyin sequences according to entropy value results of the probability distribution of the time steps; the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model; the Transformer coder is used for outputting Chinese character sequences with equal length according to the input candidate pinyin sequences, and the n-gram model is used for processing the output Chinese character sequences and selecting target Chinese character texts to output. The invention can obtain the final recognition result which is most suitable for the current context and the human expression habit.

Description

End-to-end voice recognition system based on deep learning
Technical Field
The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition system based on deep learning.
Background
Speech recognition, which is used to convert speech into corresponding text, generally includes two basic modules, an acoustic module and a language module. For an input speech signal, the acoustic module is responsible for extracting features of the signal and calculating the probability of speech to syllable (or other minimum unit), and the language module converts the minimum unit into a complete natural language that can be understood by human or computer using a language model.
The current speech recognition is divided into two types, namely a probability model method and a deep learning method. As the former, most typical is a speech recognition model (HMM-GMM) based on a Hidden Markov Model (HMM) and a mixture gaussian distribution (GMM), which first frames audio at a millisecond level, extracts acoustic features (including FBank, MFCC) for each frame, calculates expectation and covariance of the mixture distribution model using the GMM for each frame, thereby obtaining a probability value of each HMM state corresponding to each frame, and calculates transition probabilities between different states in the HMM.
For the deep learning method, the classic deep speech2 model is divided into an acoustic model and a language model. For the acoustic model, CNN and RNN are respectively adopted to learn the pronunciation characteristics and the static and dynamic characteristics of signals, and finally, CTC is adopted as a training target through a full-connection network to output the posterior probability of the minimum unit. For the language model, n-grams are directly added to the loss function, thereby learning the context information of the target language.
Both of the above solutions have a drawback that, for the former, the model cannot utilize context information of each frame, i.e., cannot utilize historical information to assist the current task. In addition, the model assumes that the frames and states follow a gaussian distribution, and although the model is simplified, it is very limited. For the latter, although the model can achieve a better convergence effect, due to the loop structure of the RNN itself, the training time is longer due to more RNN units, and parallelization is difficult.
Disclosure of Invention
The invention aims to provide an end-to-end voice recognition system based on deep learning, aiming at the technical defects in the prior art.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a deep learning based end-to-end speech recognition system comprising:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio, processing the two-dimensional FBank characteristics through the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer to obtain normalized probability distribution of each time step, and outputting candidate pinyin sequences according to entropy results of the normalized probability distribution of the time steps;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
The VGG Net layer comprises a plurality of VGG blocks, the front VGG blocks comprise two convolution kernels, namely a 3 x 3 convolution layer and a 2 x 2 maximum pooling layer, and the number of channels is doubled after each convolution is carried out, so that the information loss caused by sampling of the maximum pooling layer is reduced; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.
Preferably, the bidirectional RNN layer adopts a GRU structure.
After being processed by the Softmax layer and the CTC layer, the output of each time step is a normalized probability distribution with the length of the minimum number of acoustic units, and in the normalized probability distribution, each component represents the probability of corresponding to a certain character at the time step.
The acoustic model adopts a CNN-RNN-CTC framework, and the assumption of Gaussian distribution on the posterior probability of the acoustic model is not needed. Meanwhile, the adopted CNN-RNN framework has strong learning capacity of context information. In addition, the entropy value of the time step is calculated by utilizing the output probability distribution, so that the candidate pinyin sequence is determined, and the model has the self-repairing capability.
The language model of the invention adopts a Transformer encoder structure, utilizes a self-attention mode in a Transformer, can efficiently learn the context information of the text, can use the context information of the text to the maximum extent, has strong semantic information mining capability, simultaneously can realize parallel calculation on different time steps through matrix multiplication without the dependence relationship among the time steps, greatly reduces the training time of the language model, and can accelerate the training speed by utilizing the self-attention mechanism.
Finally, the invention fuses the Transformer and the n-gram into the language model together, and takes the model as the last unit before the model is output, and is responsible for scoring a plurality of candidate outputs, thereby obtaining the final recognition result which best accords with the current context and the human expression habit.
Drawings
FIG. 1 is a schematic diagram of an end-to-end speech recognition system architecture based on deep learning;
FIG. 2 is a schematic diagram of the structure of a VGG block;
FIG. 3 is a schematic diagram of an acoustic model output;
FIG. 4 is a signal processing diagram of a speech model;
FIG. 5 is a schematic diagram of an end-to-end speech recognition system process based on deep learning.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the deep learning based end-to-end speech recognition system of the present invention comprises:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, and is used for obtaining the normalized probability distribution of each time step after the two-dimensional FBank characteristics of audio are extracted and processed by the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer; then, calculating the entropy value of the normalized probability distribution of the time step through entropy value calculation, and determining to output a candidate pinyin sequence according to the calculated entropy value result;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
For an acoustic model, in the invention, an audio file with a wav format is input, and the model extracts two-dimensional Fbank characteristics of audio through an Fbank algorithm. The front part of the model adopts a VGG Net (Visual geometry group Network) architecture in CNN, and the Network architecture is very simple, and adopts a convolution kernel of 3 x 3 and a maximum pooling layer of 2 x 2 for many times. The VGG Net is composed of a plurality of VGG blocks, and the basic structure is as shown in fig. 2 for each VGG block. In the VGG Net structure, after each convolution is carried out in the first VGG blocks, the number of channels is doubled so as to reduce information loss caused by sampling of the maximum pooling layer. For the last two VGG blocks, convolutional layers are used instead of max pooling layers, with the goal of increasing the number of model layers to learn text information deeper in the acoustic information.
After the VGG Net structure outputs the multi-channel three-dimensional tensor, firstly, a plurality of channels in the output result of the VGG Net are merged into a single channel in a mode of splicing the data of the plurality of channels, so as to reduce the three-dimensional information to two dimensions and form a two-dimensional matrix, aiming at regarding each row of data as a time step, namely, the data of each time step is input to realize dimension reduction through a full connection layer behind a VGG Net structure so as to reduce the calculation amount of a subsequent model, then, through a bidirectional RNN layer (adopting a GRU structure), inputting information of each time step to learn context information of deeper layers in the audio data, then, through the second full-connected layer, the probability (logits) distribution of each time step is output, mapping the dimension of each time step in the RNN to the minimum acoustic unit number (wherein, the minimum acoustic unit adopts Chinese pinyin and an additional blank character representing pinyin interval); after the Softmax normalization, the output of each time step is a normalized probability distribution with the length of the minimum acoustic unit number, and in the probability distribution, each component represents the probability of corresponding to a certain character (namely Chinese pinyin or blank characters) on the time step; and finally, by a CTC structure, using the difference between the output of each time step and a real result as a final loss function of the acoustic model.
The acoustic model adopts a deep learning mode, so that the assumption of Gaussian distribution is avoided, a VGG Net framework in CNN is utilized, a simpler model structure is provided, and meanwhile, the model performance is improved by continuously deepening a network structure. In addition, the bidirectional GRU structure can fully utilize the context dependence information of the voice, and finally, entropy values are adopted to determine all candidate pinyin sequences which are finally output.
In the invention, the output of the acoustic model is the normalized probability distribution of each time step to each minimum acoustic unit, and the corresponding pinyin sequence can be obtained through the normalized probability distribution.
For each time step, it is necessary to calculate the entropy of the probability distribution to obtain the degree of confusion (i.e., the degree of uncertainty) at that time step. As in FIG. 3, the entropy of the probability distribution at the first time step from left to right is significantly greater than the entropy of the probability distributions at the other time steps. By setting a threshold, outputting the most possible pinyin prediction result and the next possible pinyin prediction result of the time step with lower confidence (namely higher chaos) at the same time, and obtaining a plurality of candidate pinyin sequences by permutation and combination; the number of pinyin sequences is the nth power of 2, and n represents the number of time steps with the entropy value of time step probability distribution exceeding a threshold value.
In the invention, for a language model, a Transformer encoder structure based on a self-attention mechanism is adopted, the output (namely a Chinese pinyin sequence) of an acoustic model is used as the input of the model, and a Chinese character sequence with the same length as the Chinese pinyin sequence is output through the self-attention mechanism of multi-head. Because the self-attention machine has stronger context learning capability and faster computing capability, the context information of the text can be efficiently learned, so that the model has stronger inference capability and faster convergence speed. Because the self-attention model of the Transformer does not need the dependency relationship among time steps, the parallel computation on different time steps can be realized through matrix multiplication, and the training time of the language model is greatly reduced.
In addition, for a plurality of outputs of the acoustic model, the language model of the present invention takes the outputs as inputs, and can obtain the outputs of a plurality of language models, and scores all the language model outputs by using an n-gram model which is counted in advance based on mass data, and sorts a plurality of candidate outputs, and the highest score is the final output, thereby obtaining the most smooth natural language text information, as shown in fig. 4.
As shown in fig. 5, it is assumed that the acoustic model can output characters "yu 3, yin1, shi2, shi4, _, bie2,", which are numbered 0 to 5 in sequence, where numerals represent tones and underlines represent blank characters.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (4)

1. An end-to-end speech recognition system based on deep learning, comprising:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio, processing the two-dimensional FBank characteristics through the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer to obtain normalized probability distribution of each time step, and outputting candidate pinyin sequences according to entropy results of the normalized probability distribution of the time steps;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
2. The deep learning based end-to-end speech recognition system of claim 1, wherein the VGGNet layer comprises a plurality of VGG blocks, the previous VGG blocks comprise two convolution kernels of 3 x 3 convolution layers and a maximum pooling layer of 2 x 2, and the number of channels is doubled after each convolution to reduce information loss due to sampling of the maximum pooling layer; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.
3. The deep learning-based end-to-end speech recognition system of claim 1, wherein the bidirectional RNN layer is a GRU structure.
4. The deep learning-based end-to-end speech recognition system of claim 1, wherein the output at each time step after processing by the Softmax and CTC layers is a normalized probability distribution with a length of a minimum number of acoustic units, wherein each component represents a probability of corresponding to a character at the time step.
CN201911391159.XA 2019-12-30 2019-12-30 End-to-end voice recognition system based on deep learning Pending CN111063336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911391159.XA CN111063336A (en) 2019-12-30 2019-12-30 End-to-end voice recognition system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911391159.XA CN111063336A (en) 2019-12-30 2019-12-30 End-to-end voice recognition system based on deep learning

Publications (1)

Publication Number Publication Date
CN111063336A true CN111063336A (en) 2020-04-24

Family

ID=70304577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911391159.XA Pending CN111063336A (en) 2019-12-30 2019-12-30 End-to-end voice recognition system based on deep learning

Country Status (1)

Country Link
CN (1) CN111063336A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112086087A (en) * 2020-09-14 2020-12-15 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and device
CN112116907A (en) * 2020-10-22 2020-12-22 浙江同花顺智能科技有限公司 Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN113011127A (en) * 2021-02-08 2021-06-22 杭州网易云音乐科技有限公司 Text phonetic notation method and device, storage medium and electronic equipment
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113255888A (en) * 2021-05-26 2021-08-13 东南大学 End-to-end hand-sending equal-amplitude telegraph decoding system based on deep learning
CN113327585A (en) * 2021-05-31 2021-08-31 杭州芯声智能科技有限公司 Automatic voice recognition method based on deep neural network
CN113763519A (en) * 2021-11-09 2021-12-07 江苏原力数字科技股份有限公司 Voice-driven 3D character facial expression method based on deep learning
CN114758649A (en) * 2022-04-06 2022-07-15 北京百度网讯科技有限公司 Voice recognition method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107408384A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 The end-to-end speech recognition of deployment
JP2019159058A (en) * 2018-03-12 2019-09-19 国立研究開発法人情報通信研究機構 Speech recognition system, speech recognition method, learned model
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107408384A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 The end-to-end speech recognition of deployment
JP2019159058A (en) * 2018-03-12 2019-09-19 国立研究開発法人情報通信研究機構 Speech recognition system, speech recognition method, learned model
CN110415683A (en) * 2019-07-10 2019-11-05 上海麦图信息科技有限公司 A kind of air control voice instruction recognition method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEI KANG ET AL.: "《Convolve,Attend and Spell:An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition》", 《GCPR 2018:PATTERN RECOGNITION》 *
SHIYU ZHOU ET AL.: "《Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages》", 《ARXIV:1806.05059V2》 *
XINPEI ZHOU ET AL.: "《Cascaded CNN-resBiLSTM-CTC:An End-to-End Acoustic Model For Speech Recognition》", 《ARXIV:1810.12001V2》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112086087A (en) * 2020-09-14 2020-12-15 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and device
CN112086087B (en) * 2020-09-14 2024-03-12 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and device
CN112116907A (en) * 2020-10-22 2020-12-22 浙江同花顺智能科技有限公司 Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN113011127A (en) * 2021-02-08 2021-06-22 杭州网易云音乐科技有限公司 Text phonetic notation method and device, storage medium and electronic equipment
CN113160798A (en) * 2021-04-28 2021-07-23 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113160798B (en) * 2021-04-28 2024-04-16 厦门大学 Chinese civil aviation air traffic control voice recognition method and system
CN113255888A (en) * 2021-05-26 2021-08-13 东南大学 End-to-end hand-sending equal-amplitude telegraph decoding system based on deep learning
CN113327585A (en) * 2021-05-31 2021-08-31 杭州芯声智能科技有限公司 Automatic voice recognition method based on deep neural network
CN113763519A (en) * 2021-11-09 2021-12-07 江苏原力数字科技股份有限公司 Voice-driven 3D character facial expression method based on deep learning
CN114758649A (en) * 2022-04-06 2022-07-15 北京百度网讯科技有限公司 Voice recognition method, device, equipment and medium
CN114758649B (en) * 2022-04-06 2024-04-19 北京百度网讯科技有限公司 Voice recognition method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111063336A (en) End-to-end voice recognition system based on deep learning
US11314921B2 (en) Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN111199727B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111145728B (en) Speech recognition model training method, system, mobile terminal and storage medium
KR102423302B1 (en) Apparatus and method for calculating acoustic score in speech recognition, apparatus and method for learning acoustic model
CN109065032B (en) External corpus speech recognition method based on deep convolutional neural network
CN111145729B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN111210807B (en) Speech recognition model training method, system, mobile terminal and storage medium
EP4018437B1 (en) Optimizing a keyword spotting system
CN110070855B (en) Voice recognition system and method based on migrating neural network acoustic model
US20180068652A1 (en) Apparatus and method for training a neural network language model, speech recognition apparatus and method
CN108389575B (en) Audio data identification method and system
CN110675859A (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
CN112184859A (en) End-to-end virtual object animation generation method and device, storage medium and terminal
CN111199149A (en) Intelligent statement clarifying method and system for dialog system
CN111243591B (en) Air control voice recognition method introducing external data correction
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN105869622B (en) Chinese hot word detection method and device
CN116303966A (en) Dialogue behavior recognition system based on prompt learning
CN113327585B (en) Automatic voice recognition method based on deep neural network
CN114333768A (en) Voice detection method, device, equipment and storage medium
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
JP7445089B2 (en) Fast-emission low-latency streaming ASR using sequence-level emission regularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200424

WD01 Invention patent application deemed withdrawn after publication