CN111063336A - End-to-end voice recognition system based on deep learning - Google Patents
End-to-end voice recognition system based on deep learning Download PDFInfo
- Publication number
- CN111063336A CN111063336A CN201911391159.XA CN201911391159A CN111063336A CN 111063336 A CN111063336 A CN 111063336A CN 201911391159 A CN201911391159 A CN 201911391159A CN 111063336 A CN111063336 A CN 111063336A
- Authority
- CN
- China
- Prior art keywords
- layer
- model
- vgg
- deep learning
- recognition system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 16
- 238000009826 distribution Methods 0.000 claims abstract description 26
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000000034 method Methods 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an end-to-end voice recognition system based on deep learning, which comprises: the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio frequency and obtaining probability distribution of each time step through network processing, and outputs candidate pinyin sequences according to entropy value results of the probability distribution of the time steps; the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model; the Transformer coder is used for outputting Chinese character sequences with equal length according to the input candidate pinyin sequences, and the n-gram model is used for processing the output Chinese character sequences and selecting target Chinese character texts to output. The invention can obtain the final recognition result which is most suitable for the current context and the human expression habit.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition system based on deep learning.
Background
Speech recognition, which is used to convert speech into corresponding text, generally includes two basic modules, an acoustic module and a language module. For an input speech signal, the acoustic module is responsible for extracting features of the signal and calculating the probability of speech to syllable (or other minimum unit), and the language module converts the minimum unit into a complete natural language that can be understood by human or computer using a language model.
The current speech recognition is divided into two types, namely a probability model method and a deep learning method. As the former, most typical is a speech recognition model (HMM-GMM) based on a Hidden Markov Model (HMM) and a mixture gaussian distribution (GMM), which first frames audio at a millisecond level, extracts acoustic features (including FBank, MFCC) for each frame, calculates expectation and covariance of the mixture distribution model using the GMM for each frame, thereby obtaining a probability value of each HMM state corresponding to each frame, and calculates transition probabilities between different states in the HMM.
For the deep learning method, the classic deep speech2 model is divided into an acoustic model and a language model. For the acoustic model, CNN and RNN are respectively adopted to learn the pronunciation characteristics and the static and dynamic characteristics of signals, and finally, CTC is adopted as a training target through a full-connection network to output the posterior probability of the minimum unit. For the language model, n-grams are directly added to the loss function, thereby learning the context information of the target language.
Both of the above solutions have a drawback that, for the former, the model cannot utilize context information of each frame, i.e., cannot utilize historical information to assist the current task. In addition, the model assumes that the frames and states follow a gaussian distribution, and although the model is simplified, it is very limited. For the latter, although the model can achieve a better convergence effect, due to the loop structure of the RNN itself, the training time is longer due to more RNN units, and parallelization is difficult.
Disclosure of Invention
The invention aims to provide an end-to-end voice recognition system based on deep learning, aiming at the technical defects in the prior art.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a deep learning based end-to-end speech recognition system comprising:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio, processing the two-dimensional FBank characteristics through the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer to obtain normalized probability distribution of each time step, and outputting candidate pinyin sequences according to entropy results of the normalized probability distribution of the time steps;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
The VGG Net layer comprises a plurality of VGG blocks, the front VGG blocks comprise two convolution kernels, namely a 3 x 3 convolution layer and a 2 x 2 maximum pooling layer, and the number of channels is doubled after each convolution is carried out, so that the information loss caused by sampling of the maximum pooling layer is reduced; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.
Preferably, the bidirectional RNN layer adopts a GRU structure.
After being processed by the Softmax layer and the CTC layer, the output of each time step is a normalized probability distribution with the length of the minimum number of acoustic units, and in the normalized probability distribution, each component represents the probability of corresponding to a certain character at the time step.
The acoustic model adopts a CNN-RNN-CTC framework, and the assumption of Gaussian distribution on the posterior probability of the acoustic model is not needed. Meanwhile, the adopted CNN-RNN framework has strong learning capacity of context information. In addition, the entropy value of the time step is calculated by utilizing the output probability distribution, so that the candidate pinyin sequence is determined, and the model has the self-repairing capability.
The language model of the invention adopts a Transformer encoder structure, utilizes a self-attention mode in a Transformer, can efficiently learn the context information of the text, can use the context information of the text to the maximum extent, has strong semantic information mining capability, simultaneously can realize parallel calculation on different time steps through matrix multiplication without the dependence relationship among the time steps, greatly reduces the training time of the language model, and can accelerate the training speed by utilizing the self-attention mechanism.
Finally, the invention fuses the Transformer and the n-gram into the language model together, and takes the model as the last unit before the model is output, and is responsible for scoring a plurality of candidate outputs, thereby obtaining the final recognition result which best accords with the current context and the human expression habit.
Drawings
FIG. 1 is a schematic diagram of an end-to-end speech recognition system architecture based on deep learning;
FIG. 2 is a schematic diagram of the structure of a VGG block;
FIG. 3 is a schematic diagram of an acoustic model output;
FIG. 4 is a signal processing diagram of a speech model;
FIG. 5 is a schematic diagram of an end-to-end speech recognition system process based on deep learning.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the deep learning based end-to-end speech recognition system of the present invention comprises:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, and is used for obtaining the normalized probability distribution of each time step after the two-dimensional FBank characteristics of audio are extracted and processed by the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer; then, calculating the entropy value of the normalized probability distribution of the time step through entropy value calculation, and determining to output a candidate pinyin sequence according to the calculated entropy value result;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
For an acoustic model, in the invention, an audio file with a wav format is input, and the model extracts two-dimensional Fbank characteristics of audio through an Fbank algorithm. The front part of the model adopts a VGG Net (Visual geometry group Network) architecture in CNN, and the Network architecture is very simple, and adopts a convolution kernel of 3 x 3 and a maximum pooling layer of 2 x 2 for many times. The VGG Net is composed of a plurality of VGG blocks, and the basic structure is as shown in fig. 2 for each VGG block. In the VGG Net structure, after each convolution is carried out in the first VGG blocks, the number of channels is doubled so as to reduce information loss caused by sampling of the maximum pooling layer. For the last two VGG blocks, convolutional layers are used instead of max pooling layers, with the goal of increasing the number of model layers to learn text information deeper in the acoustic information.
After the VGG Net structure outputs the multi-channel three-dimensional tensor, firstly, a plurality of channels in the output result of the VGG Net are merged into a single channel in a mode of splicing the data of the plurality of channels, so as to reduce the three-dimensional information to two dimensions and form a two-dimensional matrix, aiming at regarding each row of data as a time step, namely, the data of each time step is input to realize dimension reduction through a full connection layer behind a VGG Net structure so as to reduce the calculation amount of a subsequent model, then, through a bidirectional RNN layer (adopting a GRU structure), inputting information of each time step to learn context information of deeper layers in the audio data, then, through the second full-connected layer, the probability (logits) distribution of each time step is output, mapping the dimension of each time step in the RNN to the minimum acoustic unit number (wherein, the minimum acoustic unit adopts Chinese pinyin and an additional blank character representing pinyin interval); after the Softmax normalization, the output of each time step is a normalized probability distribution with the length of the minimum acoustic unit number, and in the probability distribution, each component represents the probability of corresponding to a certain character (namely Chinese pinyin or blank characters) on the time step; and finally, by a CTC structure, using the difference between the output of each time step and a real result as a final loss function of the acoustic model.
The acoustic model adopts a deep learning mode, so that the assumption of Gaussian distribution is avoided, a VGG Net framework in CNN is utilized, a simpler model structure is provided, and meanwhile, the model performance is improved by continuously deepening a network structure. In addition, the bidirectional GRU structure can fully utilize the context dependence information of the voice, and finally, entropy values are adopted to determine all candidate pinyin sequences which are finally output.
In the invention, the output of the acoustic model is the normalized probability distribution of each time step to each minimum acoustic unit, and the corresponding pinyin sequence can be obtained through the normalized probability distribution.
For each time step, it is necessary to calculate the entropy of the probability distribution to obtain the degree of confusion (i.e., the degree of uncertainty) at that time step. As in FIG. 3, the entropy of the probability distribution at the first time step from left to right is significantly greater than the entropy of the probability distributions at the other time steps. By setting a threshold, outputting the most possible pinyin prediction result and the next possible pinyin prediction result of the time step with lower confidence (namely higher chaos) at the same time, and obtaining a plurality of candidate pinyin sequences by permutation and combination; the number of pinyin sequences is the nth power of 2, and n represents the number of time steps with the entropy value of time step probability distribution exceeding a threshold value.
In the invention, for a language model, a Transformer encoder structure based on a self-attention mechanism is adopted, the output (namely a Chinese pinyin sequence) of an acoustic model is used as the input of the model, and a Chinese character sequence with the same length as the Chinese pinyin sequence is output through the self-attention mechanism of multi-head. Because the self-attention machine has stronger context learning capability and faster computing capability, the context information of the text can be efficiently learned, so that the model has stronger inference capability and faster convergence speed. Because the self-attention model of the Transformer does not need the dependency relationship among time steps, the parallel computation on different time steps can be realized through matrix multiplication, and the training time of the language model is greatly reduced.
In addition, for a plurality of outputs of the acoustic model, the language model of the present invention takes the outputs as inputs, and can obtain the outputs of a plurality of language models, and scores all the language model outputs by using an n-gram model which is counted in advance based on mass data, and sorts a plurality of candidate outputs, and the highest score is the final output, thereby obtaining the most smooth natural language text information, as shown in fig. 4.
As shown in fig. 5, it is assumed that the acoustic model can output characters "yu 3, yin1, shi2, shi4, _, bie2,", which are numbered 0 to 5 in sequence, where numerals represent tones and underlines represent blank characters.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (4)
1. An end-to-end speech recognition system based on deep learning, comprising:
the acoustic model sequentially comprises a VGG-Net layer, a first full connection layer, a bidirectional RNN layer, a second full connection layer, a Softmax layer and a CTC layer, is used for extracting two-dimensional FBank characteristics of audio, processing the two-dimensional FBank characteristics through the VGG-Net layer, the first full connection layer, the bidirectional RNN layer, the second full connection layer, the Softmax layer and the CTC layer to obtain normalized probability distribution of each time step, and outputting candidate pinyin sequences according to entropy results of the normalized probability distribution of the time steps;
the language model is connected with the acoustic model and comprises a Transformer coder and an n-gram model which are sequentially connected; the Transformer encoder is used for outputting Chinese character sequences with the same length as the Chinese pinyin sequences according to the input candidate pinyin sequences, and the n-gram model is used for processing the Chinese character sequences output by the Transformer encoder and selecting target Chinese character texts to output.
2. The deep learning based end-to-end speech recognition system of claim 1, wherein the VGGNet layer comprises a plurality of VGG blocks, the previous VGG blocks comprise two convolution kernels of 3 x 3 convolution layers and a maximum pooling layer of 2 x 2, and the number of channels is doubled after each convolution to reduce information loss due to sampling of the maximum pooling layer; the last two VGG blocks include two convolution kernels of 3 × 3 convolution layers, and the maximum pooling layer is not used, so that the number of model layers is increased to learn deeper information in the acoustic information.
3. The deep learning-based end-to-end speech recognition system of claim 1, wherein the bidirectional RNN layer is a GRU structure.
4. The deep learning-based end-to-end speech recognition system of claim 1, wherein the output at each time step after processing by the Softmax and CTC layers is a normalized probability distribution with a length of a minimum number of acoustic units, wherein each component represents a probability of corresponding to a character at the time step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911391159.XA CN111063336A (en) | 2019-12-30 | 2019-12-30 | End-to-end voice recognition system based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911391159.XA CN111063336A (en) | 2019-12-30 | 2019-12-30 | End-to-end voice recognition system based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111063336A true CN111063336A (en) | 2020-04-24 |
Family
ID=70304577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911391159.XA Pending CN111063336A (en) | 2019-12-30 | 2019-12-30 | End-to-end voice recognition system based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111063336A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112086087A (en) * | 2020-09-14 | 2020-12-15 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and device |
CN112116907A (en) * | 2020-10-22 | 2020-12-22 | 浙江同花顺智能科技有限公司 | Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium |
CN113011127A (en) * | 2021-02-08 | 2021-06-22 | 杭州网易云音乐科技有限公司 | Text phonetic notation method and device, storage medium and electronic equipment |
CN113160798A (en) * | 2021-04-28 | 2021-07-23 | 厦门大学 | Chinese civil aviation air traffic control voice recognition method and system |
CN113255888A (en) * | 2021-05-26 | 2021-08-13 | 东南大学 | End-to-end hand-sending equal-amplitude telegraph decoding system based on deep learning |
CN113327585A (en) * | 2021-05-31 | 2021-08-31 | 杭州芯声智能科技有限公司 | Automatic voice recognition method based on deep neural network |
CN113763519A (en) * | 2021-11-09 | 2021-12-07 | 江苏原力数字科技股份有限公司 | Voice-driven 3D character facial expression method based on deep learning |
CN114758649A (en) * | 2022-04-06 | 2022-07-15 | 北京百度网讯科技有限公司 | Voice recognition method, device, equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107408384A (en) * | 2015-11-25 | 2017-11-28 | 百度(美国)有限责任公司 | The end-to-end speech recognition of deployment |
JP2019159058A (en) * | 2018-03-12 | 2019-09-19 | 国立研究開発法人情報通信研究機構 | Speech recognition system, speech recognition method, learned model |
CN110415683A (en) * | 2019-07-10 | 2019-11-05 | 上海麦图信息科技有限公司 | A kind of air control voice instruction recognition method based on deep learning |
-
2019
- 2019-12-30 CN CN201911391159.XA patent/CN111063336A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107408384A (en) * | 2015-11-25 | 2017-11-28 | 百度(美国)有限责任公司 | The end-to-end speech recognition of deployment |
JP2019159058A (en) * | 2018-03-12 | 2019-09-19 | 国立研究開発法人情報通信研究機構 | Speech recognition system, speech recognition method, learned model |
CN110415683A (en) * | 2019-07-10 | 2019-11-05 | 上海麦图信息科技有限公司 | A kind of air control voice instruction recognition method based on deep learning |
Non-Patent Citations (3)
Title |
---|
LEI KANG ET AL.: "《Convolve,Attend and Spell:An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition》", 《GCPR 2018:PATTERN RECOGNITION》 * |
SHIYU ZHOU ET AL.: "《Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages》", 《ARXIV:1806.05059V2》 * |
XINPEI ZHOU ET AL.: "《Cascaded CNN-resBiLSTM-CTC:An End-to-End Acoustic Model For Speech Recognition》", 《ARXIV:1810.12001V2》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112086087A (en) * | 2020-09-14 | 2020-12-15 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and device |
CN112086087B (en) * | 2020-09-14 | 2024-03-12 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and device |
CN112116907A (en) * | 2020-10-22 | 2020-12-22 | 浙江同花顺智能科技有限公司 | Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium |
CN113011127A (en) * | 2021-02-08 | 2021-06-22 | 杭州网易云音乐科技有限公司 | Text phonetic notation method and device, storage medium and electronic equipment |
CN113160798A (en) * | 2021-04-28 | 2021-07-23 | 厦门大学 | Chinese civil aviation air traffic control voice recognition method and system |
CN113160798B (en) * | 2021-04-28 | 2024-04-16 | 厦门大学 | Chinese civil aviation air traffic control voice recognition method and system |
CN113255888A (en) * | 2021-05-26 | 2021-08-13 | 东南大学 | End-to-end hand-sending equal-amplitude telegraph decoding system based on deep learning |
CN113327585A (en) * | 2021-05-31 | 2021-08-31 | 杭州芯声智能科技有限公司 | Automatic voice recognition method based on deep neural network |
CN113763519A (en) * | 2021-11-09 | 2021-12-07 | 江苏原力数字科技股份有限公司 | Voice-driven 3D character facial expression method based on deep learning |
CN114758649A (en) * | 2022-04-06 | 2022-07-15 | 北京百度网讯科技有限公司 | Voice recognition method, device, equipment and medium |
CN114758649B (en) * | 2022-04-06 | 2024-04-19 | 北京百度网讯科技有限公司 | Voice recognition method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111063336A (en) | End-to-end voice recognition system based on deep learning | |
US11314921B2 (en) | Text error correction method and apparatus based on recurrent neural network of artificial intelligence | |
CN111199727B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111145728B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
KR102423302B1 (en) | Apparatus and method for calculating acoustic score in speech recognition, apparatus and method for learning acoustic model | |
CN109065032B (en) | External corpus speech recognition method based on deep convolutional neural network | |
CN111145729B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111429889A (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
CN111210807B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
CN110070855B (en) | Voice recognition system and method based on migrating neural network acoustic model | |
US20180068652A1 (en) | Apparatus and method for training a neural network language model, speech recognition apparatus and method | |
CN108389575B (en) | Audio data identification method and system | |
CN110675859A (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN112242144A (en) | Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium | |
CN112184859A (en) | End-to-end virtual object animation generation method and device, storage medium and terminal | |
CN111199149A (en) | Intelligent statement clarifying method and system for dialog system | |
CN111243591B (en) | Air control voice recognition method introducing external data correction | |
CN112349294A (en) | Voice processing method and device, computer readable medium and electronic equipment | |
CN105869622B (en) | Chinese hot word detection method and device | |
CN116303966A (en) | Dialogue behavior recognition system based on prompt learning | |
CN113327585B (en) | Automatic voice recognition method based on deep neural network | |
CN114333768A (en) | Voice detection method, device, equipment and storage medium | |
CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
JP7445089B2 (en) | Fast-emission low-latency streaming ASR using sequence-level emission regularization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200424 |
|
WD01 | Invention patent application deemed withdrawn after publication |