WO2018227780A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents

语音识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2018227780A1
WO2018227780A1 PCT/CN2017/100043 CN2017100043W WO2018227780A1 WO 2018227780 A1 WO2018227780 A1 WO 2018227780A1 CN 2017100043 W CN2017100043 W CN 2017100043W WO 2018227780 A1 WO2018227780 A1 WO 2018227780A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
voice data
filter bank
probability matrix
model
Prior art date
Application number
PCT/CN2017/100043
Other languages
English (en)
French (fr)
Inventor
梁浩
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to US16/348,807 priority Critical patent/US11062699B2/en
Publication of WO2018227780A1 publication Critical patent/WO2018227780A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer processing, and in particular, to a voice recognition method, apparatus, computer device, and storage medium.
  • Speech recognition also known as Automatic Speech Recognition (ASR)
  • ASR Automatic Speech Recognition
  • speech recognition technology is the premise of natural language processing, and can effectively promote the development of voice-activated interaction related fields and greatly facilitate people's lives, such as smart home and voice input.
  • the accuracy of speech recognition directly determines the effectiveness of technical applications.
  • the traditional speech recognition technology is based on GMM-HMM (hybrid Gaussian model and hidden Markov model).
  • GMM-HMM hybrid Gaussian model and hidden Markov model
  • DNN-HMM deep learning model and hidden Markov
  • a voice recognition method for example, a voice recognition method, apparatus, computer device, and storage medium are provided.
  • a speech recognition method comprising:
  • a speech recognition device comprising:
  • An acquiring module configured to acquire voice data to be identified
  • An extraction module configured to extract a Filter Bank feature and an MFCC feature in the voice data
  • a first output module configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model
  • a posterior probability matrix output module configured to use the Filter Bank feature as an input feature of the trained LSTM model with a connected unit, and obtain a posterior probability matrix of the LSTM model output of the connected unit, where the connecting unit is used Controlling the flow of information between layers in the LSTM model;
  • a second output module configured to use the posterior probability matrix and the first likelihood probability matrix as input data of the trained HMM model to obtain a second likelihood probability matrix of the output;
  • a decoding module configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the to-be-identified voice data according to the second likelihood probability matrix.
  • a computer device comprising a memory and a processor, wherein the memory stores a computer readable
  • the instructions when the computer readable instructions are executed by the processor, cause the processor to perform the following steps:
  • One or more non-transitory readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of: acquiring Voice data to be identified;
  • FIG. 1 is a block diagram showing the internal structure of a computer device in an embodiment
  • FIG. 3 is a flow chart of a voice recognition method in an embodiment
  • FIG. 4 is a flow chart of a method for obtaining a posterior probability matrix by having a connected unit LSTM model in one embodiment
  • FIG. 5 is a flow chart of a method for extracting Filter Bank features and MFCC features in voice data in an embodiment
  • FIG. 6 is a flow chart of a method for obtaining a posterior probability matrix by having a connected unit LSTM model in another embodiment
  • FIG. 7 is a flow chart of a GMM-HMM model and a method for establishing a connected unit LSTM model in one embodiment
  • Figure 8 is a flow chart of a speech recognition method in another embodiment
  • Figure 9 is a block diagram showing the structure of a voice recognition apparatus in an embodiment
  • FIG. 10 is a structural block diagram of a posterior probability matrix output module in an embodiment
  • Figure 11 is a block diagram showing the structure of a voice recognition apparatus in another embodiment
  • Figure 12 is a block diagram showing the structure of a speech recognition apparatus in still another embodiment.
  • FIG. 1 it is a schematic diagram of the internal structure of a computer device in one embodiment.
  • the computer device can be a terminal or a server.
  • the computer device includes a processor coupled through a system bus, a non-volatile storage medium, an internal memory, a network interface, a display screen, and an input device.
  • the non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the internal memory can store computer readable instructions that, when executed by the processor, cause the processor to perform a speech recognition method.
  • the network interface of the computer device is used for network communication.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing, and It can be an external keyboard, trackpad or mouse.
  • the touch layer and display form a touch screen.
  • speech recognition mainly consists of two parts: acoustic model and language model, and then combined with the dictionary to form the framework of speech recognition.
  • the process of speech recognition is the process of converting a sequence of input speech features into a sequence of characters based on a dictionary, an acoustic model, and a language model.
  • the role of the acoustic model is to obtain the mapping between phonetic features and phonemes.
  • the role of the language model is to obtain the mapping between words and words, words and sentences.
  • the role of the dictionary is to obtain the mapping between words and phonemes.
  • the process of specific speech recognition can be divided into three steps.
  • the first step is to identify the speech frame into a phoneme state, that is, to align the speech frame and the phoneme state.
  • the second step is to combine the phonemes according to the phoneme state.
  • the third step is to combine the phonemes into words.
  • the first step is the role of the acoustic model, which is the key point and the difficulty. The more accurate the alignment result of the speech frame and the phoneme state, the better the effect of speech recognition.
  • the phoneme state is a more detailed phone unit than the phoneme, usually a tone
  • the prime consists of three phoneme states.
  • a voice recognition method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
  • Step 302 Acquire voice data to be identified.
  • the voice data to be identified here is usually obtained by the interactive application to obtain audio data input by the user, including digital audio and text audio.
  • step 304 the Filter Bank feature and the MFCC feature in the voice data are extracted.
  • the Filter Bank feature and the MFCC (Mel frequency cepstrum coefficient) feature are parameters used in speech recognition to represent speech features.
  • Filter Bank is used for deep learning model and MFCC is used for mixed Gaussian model.
  • MFCC is used for mixed Gaussian model.
  • pre-emphasis processing is performed on the input voice data, and a high-frequency portion of the voice signal is used to improve the frequency spectrum by using a high-pass filter, and then the pre-emphasized voice data is framed and windowed, thereby The non-stationary speech signal is converted into a short-term stationary signal, and then the endpoint detection is used to distinguish between speech and noise, and extract a valid speech portion.
  • the preprocessed speech data is subjected to fast Fourier transform, thereby converting the time domain speech signal into the frequency domain energy spectrum for analysis, and then the energy spectrum.
  • the formant characteristics of the speech are highlighted by a set of Mel-scale triangular filter banks, and then the logarithmic energy of each filter bank output is calculated.
  • the filter bank output is characterized by the Filter Bank feature.
  • the calculated logarithmic energy is subjected to discrete cosine transform to obtain MFCC coefficients, that is, MFCC features.
  • Step 306 The MFCC feature is used as input data of the trained GMM-HMM model, and the first likelihood probability matrix of the output of the trained GMM-HMM model is obtained.
  • the acoustic model and the language model collectively realize the recognition of the voice.
  • the role of the acoustic model is to identify the alignment relationship between the speech frame and the phoneme state.
  • the GMM-HMM model is part of the acoustic model and is used to initially align the speech frame with the phoneme state.
  • the extracted MFCC feature of the voice data to be recognized is used as the weight of the trained GMM-HMM model.
  • the data is entered, and then the likelihood probability matrix of the model output is obtained. For convenience and subsequent differentiation, it is referred to herein as a "first likelihood probability matrix".
  • the likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state, that is, the alignment relationship between the speech frame and the phoneme state can be obtained according to the calculated likelihood probability matrix, but the alignment obtained by the GMM-HMM training is obtained.
  • the relationship is not very accurate, so the first likelihood probability matrix here is equivalent to a preliminary alignment of the speech frame and phoneme state.
  • the specific calculation formula of the GMM model is as follows:
  • x represents the extracted speech feature (MFCC) vector
  • ⁇ , D are the mean and variance matrix, respectively
  • K represents the order of the MFCC coefficients.
  • Step 308 using the Filter Bank feature as an input feature of the trained LSTM model with connected units, obtaining a posterior probability matrix of the LSTM model output with the connected unit, and the connecting unit is used to control the information flow between the layers in the LSTM model. .
  • the LSTM model belongs to the deep learning model and is also part of the acoustic model.
  • the LSTM with connected units is an innovative model based on the traditional LSTM model, which adds a connection unit between the layers of the traditional LSTM model, through which the layer can be controlled between layers The information flows, so the effective information can be filtered through the connection unit, and the LSTM model training level can be deeper and the level is higher, and the better the feature expression is obtained, the better the recognition effect is. Therefore, the LSTM model with connected units can not only improve the speed of recognizing speech, but also improve the accuracy of recognizing speech.
  • the connection unit is implemented by a sigmoid function.
  • the principle is that the output of the previous layer is controlled by a threshold composed of a sigmoid function to control the information flowing into the latter layer, that is, the output is used as an input of the LSTM network of the latter layer.
  • the value of this sigmoid function is determined by the state of the previous layer of neuron nodes, the output of the previous layer of neuron nodes, and the input of the next layer of neuron nodes.
  • the neuron node is responsible for the computational expression of the neural network model. Each node contains some computational relationships, which can be understood as a kind of calculation formula, which can be the same or different.
  • Step 310 The posterior probability matrix and the first likelihood probability matrix are used as input data of the trained HMM model, and the second likelihood probability matrix of the output is obtained.
  • the HMM (Hidden Markov) model is a statistical model used to describe a Markov process with implicit unknown parameters, the role of which is to determine the implicit parameters in the process from observable parameters.
  • the HMM model mainly involves five parameters, which are two state sets and three probability sets.
  • the two state sets are the hidden state and the observed state, and the three probability sets are the initial matrix, the transition matrix and the confusion matrix.
  • the transfer matrix is obtained by training, that is, once the training of the HMM model is completed, the transfer matrix is determined.
  • the observable speech feature Finter Bank feature
  • the observation state is mainly used as the observation state to calculate the correspondence relationship between the phoneme state and the speech frame (ie, the implied state).
  • the posterior probability matrix calculated by the LSTM model with connected units is the confusion matrix to be determined in the HMM model
  • the first likelihood probability matrix is the initial matrix to be determined. Therefore, using the posterior probability matrix and the first likelihood probability matrix as the input data of the trained HMM model, the second likelihood probability matrix of the output can be obtained.
  • the second likelihood probability matrix represents the final alignment relationship between the phoneme state and the speech frame. Subsequently, according to the determined second likelihood probability matrix, the target word sequence corresponding to the voice data to be recognized can be acquired in the phoneme decoding network.
  • Step 312 Acquire a target word sequence corresponding to the voice data to be recognized in the phoneme decoding network according to the second likelihood probability matrix.
  • the speech recognition process two parts are included, one is an acoustic model and the other is a language model.
  • the search algorithm can use the Viterbi algorithm ( Viterbi algorithm). This path That is, the word string corresponding to the voice data to be recognized can be output with the maximum probability, and thus the text contained in the voice data is determined.
  • the decoding network of the phoneme decoding level (ie, the phoneme decoding network) is implemented by a finite state machine (Finite State Transducer, FST) correlation algorithm, such as a deterministic algorithm, a minimization algorithm, and a sentence is divided into words. Then, the words are divided into phonemes (such as Chinese vowels, English phonetic symbols), and then the phoneme and pronunciation dictionary, grammar, etc. are aligned and calculated by the above method to obtain an output phoneme decoding network.
  • the phoneme decoding network contains all possible identified path representations. The decoding process is based on the input voice data, and the path of the huge network is deleted, and one or more candidate paths are obtained, and the data structure stored in a word network is stored. Medium, and then the final identification is to score the candidate path, the path with the highest score is the recognition result.
  • FST Finite State Transducer
  • the GMM-HMM model is firstly used to calculate the first likelihood probability matrix according to the extracted MFCC features, A likelihood probability matrix represents the alignment of the speech data on the phoneme state, and then further alignment is performed based on the previous preliminary alignment result using the LSTM, and the LSTM employs an innovative LSTM model with connected elements.
  • the model adds a connection unit between the layers of the traditional LSTM model, which can control the flow of information between the layers, and the connection unit can filter the effective information, which can not only improve the recognition speed, Moreover, the accuracy of recognition can be improved.
  • the connecting unit is implemented by using a sigmoid function; the filtering Bank feature is used as an input feature of the trained LSTM model with a connected unit, and the connected unit is obtained.
  • the posterior probability matrix of the LSTM model output, the connecting unit is configured to control the flow of information between the layers in the LSTM model, including:
  • step 308a the Filter Bank feature is used as an input feature of the trained LSTM model with connected units.
  • Step 308b determining a sigmoid function value corresponding to the connection unit between the layers according to the state and output of the previous layer of the neuron node in the LSTM model and the input of the next layer of the neuron node;
  • Step 308c Output a posterior probability matrix corresponding to the Filter Bank feature according to the sigmoid function value corresponding to the connection unit between the layers.
  • connection unit is implemented by a sigmoid function
  • the sigmoid function is used to control the flow of layer and layer information in the LSTM model, for example, to control whether flow and flow.
  • the determination of the function value corresponding to the sigmoid function is determined by the state of the previous layer of the neuron node, the output of the previous layer of the neuron node, and the input of the subsequent layer of the neuron node.
  • the function of the output control gate is to control the output flow of the neuron node.
  • is an operator that means that the corresponding elements of two matrices are multiplied.
  • the values of the bias term b and the weight matrix W have been determined after the model is completed, so according to the input, it is possible to determine how much information flows between the layers, and determine the flow of information between the layers. It is possible to obtain an output of the posterior probability matrix corresponding to the Filter Bank feature.
  • the step 304 of extracting the Filter Bank feature and the MFCC feature in the voice data includes:
  • Step 304A Perform Fourier transform on the speech data to be identified into an energy spectrum in the frequency domain.
  • the transformation of the speech signal in the time domain is generally difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe, different energy distributions, representing characteristics of different speeches. . Therefore, the speech data to be identified needs to be subjected to fast Fourier transform to obtain an energy distribution on the spectrum.
  • the spectrum of each frame is obtained by performing fast Fourier transform on each frame of the speech signal, and the power spectrum (ie, the energy spectrum) of the speech signal is obtained by modulo the square of the spectrum of the speech signal.
  • Step 304B the energy spectrum in the frequency domain is taken as an input characteristic of the triangular filter bank of the Meyer scale. Calculate the Filter Bank feature of the speech data to be identified.
  • the obtained energy spectrum of the frequency domain needs to be used as an input characteristic of the triangular filter bank of the Meyer scale, and the logarithm of the output of each triangular filter bank is calculated.
  • Energy the Filter Bank feature that gets the speech data to be recognized.
  • the Filter Bank feature also uses the energy spectrum corresponding to each frame of the speech signal as the input characteristic of the triangular filter bank of the Meyer scale, and then obtains the Filter Bank feature corresponding to each frame of the speech signal.
  • step 304C the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the voice data to be recognized.
  • the filter bank in order to obtain the MFCC feature of the speech data to be recognized, it is also necessary to perform discrete cosine transform on the logarithmic energy outputted by the filter bank to obtain a corresponding MFCC feature.
  • the MFCC feature corresponding to each frame of the speech signal is obtained by discrete Cosine transforming the Filter Bank feature corresponding to each frame of the speech signal.
  • the difference between the Filter Bank feature and the MFCC feature is that the Filter Bank feature has data correlation between different feature dimensions, while the MFCC feature is a feature obtained by using discrete cosine transform to remove the data correlation of the Filter Bank feature.
  • the Filter Bank feature is used as an input feature of the trained LSTM model with a connected unit, and the posterior probability matrix of the LSTM model output with the connected unit is obtained.
  • the step 308 of the connection unit for controlling the flow of information between the layers in the LSTM model includes:
  • Step 308A Acquire a Filter Bank feature corresponding to each frame of voice data in the to-be-identified voice data, and sort according to time.
  • the voice data is first framed, and then the Filter Bank features corresponding to each frame of the voice data are extracted, and sorted according to the time sequence. That is, the Filter Bank features of each frame are sorted according to the order in which each frame of the speech data to be recognized appears.
  • Step 308B using the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as the trained LSTM with the connected unit.
  • the input feature of the model controls the flow of information between the layer and the layer through the connection unit, and obtains the posterior probability of the phoneme state corresponding to each frame of the output voice data.
  • the input of the deep learning model adopts a multi-frame feature, which is more advantageous than the traditional mixed Gaussian model with only single frame input, because the framing before and after the speech frame is beneficial to obtain the context-related information for the current influences. Therefore, the Filter Bank feature of each frame of voice data and the preset frame number of each frame of voice data is generally used as an input feature of the trained LSTM model with connected units. For example, the current frame and the last 5 frames of the current frame are spliced, and a total of 11 frames of data are used as input features of the trained LSTM model with connected units, and the 11-frame speech feature sequence passes through each node in the LSTM with the connected unit. Point, output the posterior probability on the phoneme state corresponding to the frame voice data.
  • Step 308C Determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the posterior probability matrix corresponding to the to-be-identified voice data is determined.
  • the posterior probability matrix is composed of one posterior probability. Since the LSTM model with connection unit can contain both time dimension information and hierarchical latitude information, the model can better obtain the corresponding speech data corresponding to the traditional model with only the time dimension information. A posteriori probability matrix.
  • the method before the step of acquiring the voice data to be identified, the method further includes: step 301, establishment of a GMM-HMM model and establishment of a connected unit LSTM model. Specifically include:
  • step 301A the training corpus is used to train the GMM-HMM, and the variance and mean of the GMM-HMM model are determined through continuous iterative training, and the trained GMM-HMM model is generated according to the variance and the mean.
  • the GMM-HMM acoustic model is established by using single phoneme training and three phonemes for training.
  • the triphone training considers the influence of the related phonemes of the current phoneme, and can obtain more accurate alignment effect. Can produce better recognition results.
  • triphone training generally uses triphone training based on delta+delta-delta feature, linear discriminant analysis + tri-phone training of maximum likelihood linear feature conversion.
  • voice features in the input training prediction library are first normalized, and the default variance is normalized.
  • the speech feature normalization is to eliminate the deviation caused by the convolution noise of the telephone channel and the feature extraction calculation.
  • an initial GMM-HMM model is quickly obtained by using a small amount of feature data, and then the variance and mean of the mixed Gaussian model GMM-HMM are determined through continuous iterative training. Once the variance and mean are determined, the corresponding GMM-HMM model is corresponding. Ok.
  • Step 301B According to the MFCC feature extracted from the training corpus, the trained likelihood corpus corresponding to the training corpus is obtained by using the trained GMM-HMM model.
  • the voice data in the training prediction library is used for training, and the MFCC feature of the speech in the training corpus is extracted, and then as the input feature of the trained GMM-HMM model, the voice corresponding to the output training corpus is obtained.
  • Likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state.
  • the trained GMM-HMM output likelihood probability matrix is used as the initial alignment relationship of the subsequent training deep learning model, which facilitates the subsequent deep learning model. Get better results from deep learning.
  • Step 301C Train the LSTM model with the connected unit according to the Filter Bank feature and the likelihood probability matrix extracted in the training expected library, determine the weight matrix and the offset matrix corresponding to the connected unit LSTM model, according to the weight matrix and the offset The matrix generates the trained LSTM model with connected elements.
  • the alignment result calculated by the GMM-HMM described above ie, the likelihood probability matrix
  • the original speech feature is used here.
  • Filter Bank features the Filter Bank feature has data correlation with respect to MFCC features, so it has better speech feature representation.
  • the weight matrix and the offset matrix corresponding to each layer of the LSTM are determined by training the connected unit LSTM model.
  • the connection unit LSTM is also one of the deep neural network models, and the neural network layer generally falls into three categories: an input layer, a hidden layer, and an output layer, wherein the hidden layer has multiple layers.
  • the purpose of training the LSTM model with connected elements is to determine all the weights in each layer.
  • the weighting matrix and the offset matrix and the corresponding number of layers, the training algorithm can adopt the existing algorithms such as the forward propagation algorithm and the Viterbi algorithm, and the specific training algorithm is not limited herein.
  • a speech recognition method comprising the following steps:
  • Step 802 Acquire voice data to be identified.
  • Step 804 extracting Filter Bank features and MFCC features in the voice data.
  • Step 806 The MFCC feature is used as the input data of the trained GMM-HMM model, and the first likelihood probability matrix of the trained GMM-HMM model output is obtained.
  • Step 808 The Filter Bank feature and the first likelihood probability matrix are used as input data of the trained DNN-HMM model, and the second likelihood probability matrix of the trained DNN-HMM model output is obtained.
  • Step 810 using the Filter Bank feature as an input feature of the trained LSTM model with the connected unit, obtaining a posterior probability matrix of the LSTM model output with the connected unit, and the connecting unit is used to control the information flow between the layers in the LSTM model.
  • Step 812 The posterior probability matrix and the second likelihood probability matrix are used as input data of the trained HMM model, and the third likelihood probability matrix of the output is obtained.
  • Step 814 Acquire a target word sequence corresponding to the voice data to be recognized in the phoneme decoding network according to the third likelihood probability matrix.
  • the preliminary alignment result (first likelihood probability matrix) is obtained through the trained GMM-HMM model, and then the DNN-HMM after training is further aligned. Can get better alignment. Since the deep neural network model can obtain better speech feature representation than the traditional mixed Gaussian model, using the deep neural network model for further forced alignment can further improve the accuracy.
  • the result of the further alignment (second likelihood probability matrix) is then substituted into the innovative LSTM-HMM model with connected elements, and the final alignment result (third likelihood probability matrix) can be obtained.
  • the alignment result herein refers to the alignment relationship between the speech frame and the phoneme state.
  • the above-mentioned mixed Gaussian model or deep learning model is part of the acoustic model, and the role of the acoustic model is to acquire speech.
  • the alignment relationship between the frame and the phoneme state facilitates the subsequent acquisition of the target word sequence corresponding to the speech data to be recognized in the phoneme decoding network in conjunction with the language model.
  • a voice recognition apparatus comprising:
  • the obtaining module 902 is configured to acquire voice data to be identified.
  • the extraction module 904 is configured to extract the Filter Bank feature and the MFCC feature in the voice data.
  • the first output module 906 is configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model.
  • the posterior probability matrix output module 908 uses the Filter Bank feature as an input feature of the trained LSTM model with a connected unit to obtain a posterior probability matrix of the LSTM model output of the connected unit, the connecting unit is used for Controlling the flow of information between layers in the LSTM model.
  • the second output module 910 is configured to use the posterior probability matrix and the first likelihood probability matrix as input data of the trained HMM model to obtain an output second likelihood probability matrix.
  • the decoding module 912 is configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the voice data to be identified according to the second likelihood probability matrix.
  • the extraction module is further configured to perform Fourier transform on the speech data to be recognized into an energy spectrum in the frequency domain, and calculate the energy spectrum in the frequency domain as an input characteristic of the triangular filter group of the Meyer scale.
  • the Filter Bank feature of the speech data to be recognized is obtained, and the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
  • connection unit is implemented by a sigmoid function
  • posterior probability matrix output module 908 is further configured to use the Filter Bank feature as an input feature of the trained LSTM model with a connected unit
  • the state and output of the previous layer of neuron nodes in the LSTM model and the input of the subsequent layer of neuron nodes determine the value of the sigmoid function corresponding to the connection unit between the layers; according to the connection unit between the layers
  • the corresponding sigmoid function value outputs a posterior probability matrix corresponding to the Filter Bank feature.
  • the posterior probability matrix output module 908 includes:
  • the sorting module 908A is configured to acquire Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sort them by time.
  • the posterior probability output module 908B is configured to use the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as the trained LSTM model with the connected unit.
  • the input feature controls the flow of information between the layer and the layer through the connecting unit to obtain a posterior probability of the phoneme state corresponding to each frame of the output voice data.
  • the determining module 908C is configured to determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the voice recognition apparatus further includes:
  • the GMM-HMM model training module 914 is used to train the GMM-HMM model by using the training corpus, and determine the variance and mean of the GMM-HMM model through continuous iterative training, and generate the trained GMM-HMM model according to the variance and the mean.
  • the likelihood probability matrix obtaining module 916 is configured to obtain the likelihood probability matrix corresponding to the training corpus according to the MFCC feature extracted from the training corpus and the trained GMM-HMM model.
  • the LSTM model training module 918 is configured to train the connected unit LSTM model according to the Filter Bank feature and the likelihood probability matrix extracted in the training expected library, and determine a weight matrix and an offset matrix corresponding to the connected unit LSTM model, according to the weight The matrix and the offset matrix generate a trained LSTM model with connected elements.
  • a voice recognition apparatus comprising:
  • the obtaining module 1202 is configured to acquire voice data to be identified.
  • the extracting module 1204 is configured to extract the Filter Bank feature and the MFCC feature in the voice data.
  • the first output module 1206 is configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model.
  • the second output module 1208 takes the Filter Bank feature and the first likelihood probability matrix as input data of the trained DNN-HMM model, and acquires a second likelihood probability matrix of the trained DNN-HMM output.
  • a posteriori probability matrix output module 1210 configured to use the Filter Bank feature as a training
  • An input feature of the LSTM model having the connection unit acquires a posterior probability matrix of the LSTM model output of the connection unit, the connection unit is configured to control information flow between layers in the LSTM model.
  • the third output module 1212 is configured to use the posterior probability matrix and the second likelihood probability matrix as input data of the trained HMM model to obtain an output third likelihood probability matrix.
  • the decoding module 1214 is configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the voice data to be identified according to the third likelihood probability matrix.
  • the network interface may be an Ethernet card or a wireless network card.
  • the above modules may be embedded in the hardware in the processor or in the memory in the server, or may be stored in the memory in the server, so that the processor calls the corresponding operations of the above modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • the speech recognition device described above can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.
  • a computer device is proposed.
  • the internal structure of the computer device may correspond to the structure as shown in FIG. 1, that is, the computer device may be a server or a terminal, and includes a memory, a processor, and a storage device.
  • a computer program on the memory and operable on the processor, the processor executing the computer program to: acquire voice data to be identified, extract Filter Bank features and MFCC in the voice data Feature, using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output, and using the Filter Bank feature as a trained connection
  • the matrix and the first likelihood probability matrix are used as input data of the trained HMM model, and the second likelihood probability matrix of the output is obtained.
  • the connecting unit is implemented by a sigmoid function; the performing, by the processor, the using the Filter Bank feature as an input feature of the trained LSTM model with a connected unit, acquiring the connection a posterior probability matrix of the LSTM model output of the unit, the connecting unit is configured to control the flow of information between the layers in the LSTM model, comprising: using the Filter Bank feature as the trained LSTM with connected units The input characteristic of the model; determining the value of the sigmoid function corresponding to the connection unit between the layers according to the state and output of the previous layer of the neuron node in the LSTM model and the input of the next layer of the neuron node, according to the layer The sigmoid function value corresponding to the connection unit between the layers outputs a posterior probability matrix corresponding to the Filter Bank feature.
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Generating the energy spectrum of the frequency domain as the input characteristic of the triangular filter bank of the Meyer scale, and calculating the Filter Bank feature of the speech data to be recognized; and performing the discrete cosine transform on the Filter Bank feature to obtain the speech data to be recognized MFCC features.
  • the processor performs the Filter Bank feature as an input feature of the trained LSTM model with a connected unit, and acquires a posterior probability matrix of the LSTM model output of the connected unit.
  • the step of the connection unit for controlling information flow between layers in the LSTM model includes: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; and each frame of voice data.
  • the Fliter Bank feature and the Filter Bank feature of the preset number of frames corresponding to each frame of voice data are used as input features of the trained LSTM model with connected units, and the information between the layers is controlled by the connection unit. Flowing, obtaining a posterior probability of a phoneme state corresponding to each frame of voice data outputted; determining a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the processor executing the computer program is further configured to implement the following steps: using a training corpus pair
  • the GMM-HMM model is trained to determine the variance and mean of the GMM-HMM model through continuous iterative training; the trained GMM-HMM model is generated according to the variance and the mean; and the MFCC features extracted according to the training corpus Obtaining a likelihood probability matrix corresponding to the training corpus using the trained GMM-HMM model; and the LSTM model having the connected unit according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix Performing training to determine a weight matrix and an offset matrix corresponding to the LSTM model having the connected unit; and generating a trained LSTM model having a connected unit according to the weight matrix and the offset matrix.
  • a computer readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the steps of: acquiring voice data to be recognized; extracting Filter Bank in the voice data Feature and MFCC feature; using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output; using the Filter Bank feature as a training
  • the posterior probability matrix and the first likelihood probability matrix are used as input data of the trained HMM model, and the second likelihood probability matrix of the output is obtained; and the second likelihood probability matrix is acquired in the phoneme decoding network according to the second probability matrix
  • a sequence of target words corresponding to the recognized voice data is described.
  • the connecting unit is implemented by a sigmoid function; the performing, by the processor, the using the Filter Bank feature as an input feature of the trained LSTM model with a connected unit, acquiring the connection a posterior probability matrix of the LSTM model output of the unit, the connecting unit is configured to control the flow of information between the layers in the LSTM model, comprising: using the Filter Bank feature as the trained LSTM with connected units An input characteristic of the model; determining a value of a sigmoid function corresponding to a connection unit between the layers according to a state and an output of the previous layer of the neuron node in the LSTM model and an input of the next layer of the neuron node; The sigmoid function value corresponding to the connection unit between the layers outputs a posterior probability matrix corresponding to the Filter Bank feature.
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Generating the energy spectrum of the frequency domain as the input characteristic of the triangular filter bank of the Meyer scale, and calculating the Filter Bank feature of the speech data to be recognized; and performing the discrete cosine transform on the Filter Bank feature to obtain the speech data to be recognized MFCC features.
  • the processor performs the Filter Bank feature as an input feature of the trained LSTM model with a connected unit, and acquires a posterior probability matrix of the LSTM model output of the connected unit.
  • the step of the connection unit for controlling information flow between layers in the LSTM model includes: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; and each frame of voice data.
  • the Fliter Bank feature and the Filter Bank feature of the preset number of frames corresponding to each frame of voice data are used as input features of the trained LSTM model with connected units, and the information between the layers is controlled by the connection unit. Flowing, obtaining a posterior probability of a phoneme state corresponding to each frame of voice data outputted; determining a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the executing the computer program by the processor is further used to implement the following steps: training the GMM-HMM model by using a training corpus, through continuous Iterative training determines the variance and mean of the GMM-HMM model; generates the trained GMM-HMM model according to the variance and the mean; according to the MFCC feature extracted from the training corpus, uses the trained GMM-HMM model to obtain a likelihood probability matrix corresponding to the training corpus; training the LSTM model with the connected unit according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix, and determining the connected unit
  • the LSTM model corresponds to a weight matrix and an offset matrix; and the trained LSTM model with connected units is generated according to the weight matrix and the offset matrix.
  • the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, or a read-only memory (ROM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Operations Research (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法,包括:获取待识别的语音数据(302);提取语音数据中的Filter Bank特征和MFCC特征(304);将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵(306);将Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取具有连接单元的LSTM模型输出的后验概率矩阵,连接单元用于控制LSTM模型中层与层之间的信息流动(308);将后验概率矩阵和第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵(310);根据第二似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列(312)。

Description

语音识别方法、装置、计算机设备及存储介质
本申请要求于2017年6月12日提交中国专利局、申请号为2017104450769、发明名称为“语音识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机处理领域,特别是涉及一种语音识别方法、装置、计算机设备及存储介质。
背景技术
语音识别,也被称为自动语音识别(Automatic Speech Recognition,ASR),其目标是让机器通过识别和理解,把语音信号变成文字,是现代人工智能发展的重要分支。语音识别技术的实现是自然语言处理的前提,并能有效推动声控交互相关领域的发展并极大方便人们的生活,如智能家居、语音输入。语音识别的准确度直接决定了技术应用的有效性。
传统的语音识别技术是基于GMM-HMM(混合高斯模型和隐马尔科夫模型)进行声学模型的建立,近年来,随着深度学习技术的发展,基于DNN-HMM(深度学习模型和隐马尔科夫模型)进行声学模型的建立相对于GMM-HMM在识别准确度上虽然有了很大的提升,但是还有待于进一步提高语音识别的准确度。
发明内容
根据本申请的各种实施例,提供一种语音识别方法、装置、计算机设备及存储介质。
一种语音识别方法,包括:
获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一种语音识别装置,包括:
获取模块,用于获取待识别的语音数据;
提取模块,用于提取所述语音数据中的Filter Bank特征和MFCC特征;
第一输出模块,用于将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
后验概率矩阵输出模块,用于将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
第二输出模块,用于将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
解码模块,用于根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读 指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中计算机设备的内部结构框图;
图2为一个实施例中语音识别的架构图;
图3为一个实施例中语音识别方法的流程图;
图4为一个实施例中通过具有连接单元LSTM模型获取后验概率矩阵的方法流程图;
图5为一个实施例中提取语音数据中的Filter Bank特征和MFCC特征的方法流程图;
图6为另一个实施例中通过具有连接单元LSTM模型获取后验概率矩阵的方法流程图;
图7为一个实施例中GMM-HMM模型和具有连接单元LSTM模型建立的方法流程图;
图8为另一个实施例中语音识别方法的流程图;
图9为一个实施例中语音识别装置的结构框图;
图10为一个实施例中后验概率矩阵输出模块的结构框图;
图11为另一个实施例中语音识别装置的结构框图;
图12为又一个实施例中语音识别装置的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及 实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,为一个实施例中计算机设备的内部结构示意图。该计算机设备可以是终端也可以是服务器。参照图1,该计算机设备包括通过***总线连接的处理器、非易失性存储介质、内存储器、网络接口、显示屏和输入装置。其中,该计算机设备的非易失性存储介质可存储操作***和计算机可读指令,该计算机可读指令被执行时,可使得处理器执行一种语音识别方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该内存储器中可储存有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种语音识别方法。计算机设备的网络接口用于进行网络通信。计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。触摸层和显示屏构成触控屏。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
首先,介绍一下语音识别的框架,如图2所示,语音识别主要包括两个部分:声学模型和语言模型,然后结合字典就构成了语音识别的框架。语音识别的过程就是根据字典、声学模型和语言模型,将输入的语音特征序列转换为字符序列的过程。其中,声学模型的作用是得到语音特征与音素的映射,语言模型的作用是得到词与词、词与句子的映射,字典的作用是得到字词与音素之间的映射。具体语音识别的过程可以分为三步,第一步,把语音帧识别成音素状态,即进行语音帧和音素状态上的对齐。第二步是根据音素状态组合成音素。第三步是把音素组合成单词。其中,第一步是声学模型的作用,是重点也是难点,语音帧与音素状态的对齐结果越准确,就意味着语音识别的效果就会越好。其中,音素状态是比音素更细致的语音单位,通常一个音 素由3个音素状态构成。
如图3所示,在一个实施例中,提出了一种语音识别方法,该方法可应用于终端或服务器中,具体包括以下步骤:
步骤302,获取待识别的语音数据。
在本实施例中,这里待识别的语音数据通常是通过交互应用获取到用户输入的音频数据,包括数字的音频和文字的音频。
步骤304,提取语音数据中的Filter Bank特征和MFCC特征。
在本实施例中,Filter Bank(滤波器组)特征和MFCC(Mel frequency cepstrum coefficient,梅尔倒谱系数)特征都是语音识别中用来表示语音特征的参数。其中,Filter Bank用于深度学***滑,然后将经过预加重处理的语音数据进行分帧加窗,从而将非平稳的语音信号转变为短时平稳的信号,接着通过端点检测,区分语音与噪声,并提取出有效的语音部分。为了提取语音数据中的Filter Bank特征和MFCC特征,首先,将经过预处理的语音数据进行快速傅里叶变换,从而将时域的语音信号转换为频域的能量谱进行分析,然后将能量谱通过一组梅尔尺度的三角滤波器组,突出语音的共振峰特征,之后计算每个滤波器组输出的对数能量,该滤波器组输出的特征就是Filter Bank特征。进一步的,将计算得到的对数能量经离散余弦变换得到MFCC系数,即MFCC特征。
步骤306,将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
在本实施例中,声学模型和语言模型共同实现对语音的识别。其中,声学模型的作用是用于识别语音帧与音素状态的对齐关系。GMM-HMM模型属于声学模型的一部分,用于将语音帧与音素状态进行初步对齐。具体地,将提取的待识别的语音数据的MFCC特征作为训练后的GMM-HMM模型的输 入数据,然后获取该模型输出的似然概率矩阵,为了便于和后续进行区分,这里称为“第一似然概率矩阵”。似然概率矩阵表示的是语音帧与音素状态上的对齐关系,即根据计算得到的似然概率矩阵就可以得到语音帧与音素状态上的对齐关系,只不过,通过GMM-HMM训练得到的对齐关系并不十分准确,所以这里通过第一似然概率矩阵相当于对语音帧和音素状态进行了初步对齐。GMM模型具体的计算公式如下:
Figure PCTCN2017100043-appb-000001
其中,x表示提取的语音特征(MFCC)向量,μ,D分别为均值和方差矩阵,K表示MFCC系数的阶数。
步骤308,将Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取具有连接单元的LSTM模型输出的后验概率矩阵,连接单元用于控制LSTM模型中层与层之间的信息流动。
在本实施例中,LSTM模型属于深度学习模型,也属于声学模型的一部分。具有连接单元的LSTM是在传统的LSTM模型的基础上提出的创新性的模型,该模型通过在传统的LSTM模型的层与层之间增加连接单元,通过该连接单元可以控制层与层之间的信息流动,所以通过该连接单元可以实现对有效信息的筛选,而且通过该连接单元可以使得LSTM模型训练的层次更深,层次越多,获得的特征表达越好,识别效果也就越好。所以具有连接单元的LSTM模型不但可以提高识别语音的速度,而且可以提高识别语音的准确度。具体地,连接单元是通过sigmoid函数来实现的,原理是将前一层的输出通过一个由sigmoid函数构成的门限来控制流入到后一层的信息,即输出作为后一层LSTM网络的输入。这个sigmoid函数的值是由前一层神经元节点的状态、前一层神经元节点的输出、后一层神经元节点的输入共同决定的。其中,神经元节点是负责神经网络模型的计算表达,每个节点包含一些计算关系,可以理解为一种计算公式,可以相同,也可以不同。每一层LSTM中的神经元结点的数量是根据输入的特征的帧数以及特征向量来决定的,比如, 如果输入是拼接了前后5帧,那么总共有11帧输入向量,而每一帧对应的特征向量是由提取的语音特征决定的,比如,如果提取的Filter Bank特征为83维的特征向量,那么相应的训练得到的LSTM模型中每一层的神经元节点为11x83=913个。
步骤310,将后验概率矩阵和第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵。
在本实施例中,HMM(隐马尔科夫)模型是统计模型,它用来描述一个含有隐含未知参数的马尔科夫过程,作用是从可观察的参数中确定该过程中的隐含参数。HMM模型中主要涉及5个参数,分别是2个状态集合和3个概率集合。其中,2个状态集合分别为隐藏状态和观察状态,三个概率集合为初始矩阵,转移矩阵和混淆矩阵。其中,转移矩阵是训练得到的,也就是说,一旦HMM模型训练完成,该转移矩阵就确定了。在该实施例中,主要是采用可观察的语音特征(Filter Bank特征)作为观察状态,来计算确定音素状态与语音帧的对应关系(即隐含状态)。如果想要确定音素状态与语音帧的对应关系,还需要确定两个参数,那就是初始矩阵和混淆矩阵。其中,通过具有连接单元的LSTM模型计算得到的后验概率矩阵就是HMM模型中需要确定的混淆矩阵,第一似然概率矩阵就是需要确定的初始矩阵。所以将后验概率矩阵和第一似然概率矩阵作为训练后的HMM模型的输入数据,就可以获取输出的第二似然概率矩阵。该第二似然概率矩阵表示的是音素状态与语音帧的最终的对齐关系。后续根据该确定的第二似然概率矩阵就可以在音素解码网络中获取与待识别的语音数据对应的目标词序列。
步骤312,根据第二似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
在本实施例中,在语音识别过程中,包括两个部分,一个是声学模型,一个是语言模型。在语音识别前,首先需要根据训练后的声学模型和语言模型以及字典建一个音素级别的解码网络,根据搜索算法在该网络中寻找最佳的一条路径,其中,搜索算法可以采用维特比算法(Viterbi算法)。这个路径 就是能够以最大概率输出与待识别语音数据对应的词串,这样就确定了这个语音数据中所包含的文字了。其中,音素解码级别的解码网络(即音素解码网络)是通过有限状态机(Finite State Transducer,FST)相关算法来完成的,如确定化算法determination、最小化算法minimization,通过将句子拆分成词、再将词拆分为音素(如汉语的声韵母、英文的音标),然后将音素和发音词典、语法等通过上述方法进行对齐计算,得到输出的音素解码网络。音素解码网络中包含了所有可能识别的路径表达,解码的过程就是根据输入的语音数据,对这个庞大网络进行路径的删减,得到一个或多个候选路径,存储在一种词网络的数据结构中,然后最后的识别就是对候选路径进行打分,分数最高的路径为识别结果。
在本实施例中,通过将混合高斯模型GMM和深度学习模型中的长短时递归神经网络LSTM结合进行语音识别,先采用GMM-HMM模型根据提取的MFCC特征计算得到第一似然概率矩阵,第一似然概率矩阵表示对语音数据在音素状态上对齐结果,然后再使用LSTM在之前初步对齐结果的基础上进行进一步的对齐,且该LSTM采用的是创新性的具有连接单元的LSTM模型,该模型通过在传统的LSTM模型的层与层之间增加了连接单元,该连接单元可以控制层与层之间信息流动,通过该连接单元可以实现对有效信息的筛选,不但可以提高识别的速度,而且可以提高识别的准确度。
如图4所示,在一个实施例中,所述连接单元采用sigmoid函数来实现;所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:
步骤308a,将Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征。
步骤308b,根据LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值;
步骤308c,根据层与层之间的连接单元所对应的sigmoid函数值,输出与Filter Bank特征对应的后验概率矩阵。
在本实施例中,连接单元是采用sigmoid函数来实现的,在LSTM模型中通过sigmoid函数来控制层与层信息的流动,比如,控制是否流动以及流动多少。其中,sigmoid函数对应的函数值的确定是由前一层神经元节点的状态、前一层神经元节点的输出和后一层神经元节点的输入来决定的。具体地,sigmoid函数表示为:σ(x=)+e-x
Figure PCTCN2017100043-appb-000002
其中,X表示连接单元在该层的输入,t表示t时刻,d表示该连接单元的输出,l表示该连接单元的前一层,1+1表示该连接单元的下一层,b表示偏置项,W表示权重矩阵,其中,Wx是与输入相关的权重矩阵,Wc是与输出相关的权重矩阵,Wl是与层次相关的权重矩阵,c表示LSTM输出控制们的输出,LSTM有三个门限控制,输入控制门、遗忘控制门、输出控制门,输出控制门的作用是控制该神经元节点的输出流动量。⊙是一种运算符,表示两个矩阵对应元素相乘。其中,偏置项b和权重矩阵W的值在模型完成训练后已经确定了,所以根据输入就可以确定层与层之间的信息流动了多少,在确定了层与层之间的信息流动,就可以获取输出的与Filter Bank特征对应的后验概率矩阵。
如图5所示,在一个实施例中,提取语音数据中的Filter Bank特征和MFCC特征的步骤304包括:
步骤304A,将待识别的语音数据进行傅里叶变换转换为频域的能量谱。
在本实施例中,由于语音信号在时域上的变换通常都很难看出信号的特性,所以通常需要将它转换为频域上的能量分布来观察,不同的能量分布,代表不同语音的特性。所以需要将待识别的语音数据经过快速傅里叶变换以得到频谱上的能量分布。其中,是通过将每一帧语音信号进行快速傅里叶变换得到每一帧的频谱,对语音信号的频谱取模平方得到语音信号的功率谱(即能量谱)。
步骤304B,将频域的能量谱作为梅尔尺度的三角滤波器组的输入特征, 计算得到待识别语音数据的Filter Bank特征。
在本实施例中,为了得到待识别语音数据的Filter Bank特征,需要将得到的频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算每个三角滤波器组输出的对数能量,即得到待识别语音数据的Filter Bank特征。其中,Filter Bank特征也是通过将每一帧语音信号对应的能量谱作为梅尔尺度的三角滤波器组的输入特征,然后得到每一帧语音信号对应的Filter Bank特征。
步骤304C,将Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在本实施例中,为了得到待识别语音数据的MFCC特征,还需要将经过滤波器组输出的对数能量进行离散余弦变换得到相应的MFCC特征。通过将每一帧语音信号对应的Filter Bank特征经过离散余弦变换得到每一帧语音信号对应的MFCC特征。其中,Filter Bank特征与MFCC特征的区别在于,Filter Bank特征在不同特征维度之间存在数据相关性,而MFCC特征则是采用离散余弦变换去除Filter Bank特征的数据相关性所得到的特征。
如图6所示,在一个实施例中,将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤308包括:
步骤308A,获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序。
在本实施例中,在提取待识别语音数据中的Filter Bank特征时是通过先将语音数据进行分帧处理,然后提取每一帧语音数据对应的Filter Bank特征,并按照时间的先后顺序排序,即按照待识别语音数据中每一帧出现的先后顺序将对应的每一帧的Filter Bank特征进行排序。
步骤308B,将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为训练后的具有连接单元的LSTM 模型的输入特征,通过连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率。
在本实施例中,深度学习模型的输入采用的是多帧特征,相对于传统的只有单帧输入的混合高斯模型更有优势,因为通过拼接前后语音帧有利于获取到上下文相关信息对当前的影响。所以一般是将每一帧语音数据和与每一帧语音数据的前后预设帧数的Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征。比如,将当前帧和该当前帧的前后5帧进行拼接,共11帧数据作为训练后的具有连接单元的LSTM模型的输入特征,这11帧语音特征序列通过具有连接单元的LSTM中的各个结点,输出该帧语音数据对应的音素状态上的后验概率。
步骤308C,根据每一帧语音数据对应的后验概率确定待识别语音数据对应的后验概率矩阵。
在本实施例中,当获取到每一帧语音数据对应的后验概率后就确定待识别语音数据对应的后验概率矩阵。后验概率矩阵是有一个个后验概率组成的。由于通过具有连接单元LSTM模型既可以包含有时间维度的信息,又可以包含有层次纬度的信息,所以相对于之前只有时间维度信息的传统模型,该模型能更好的得到待识别语音数据对应的后验概率矩阵。
如图7所示,在一个实施例中,在获取待识别的语音数据的步骤之前还包括:步骤301,GMM-HMM模型的建立和具有连接单元LSTM模型的建立。具体包括:
步骤301A,采用训练语料库对GMM-HMM进行训练,通过不断的迭代训练确定GMM-HMM模型对应的方差和均值,根据方差和均值生成训练后的GMM-HMM模型。
在本实施例中,GMM-HMM声学模型的建立依次采用了单音素训练以及三音素进行训练,其中,三音素训练考虑了当前音素的前后相关音素影响,能够得到更加准确的对齐效果,也就能产生更好的识别结果。根据特征和作 用的不用,三音素训练一般采用基于delta+delta-delta特征的三音素训练,线性判别分析+最大似然线性特征转换的三音素训练。具体地,首先对输入的训练预料库中的语音特征进行归一化,默认对方差进行归一化。语音特征归一化是为了消除电话信道等卷积噪声在特征提取计算造成的偏差。然后利用少量特征数据快速得到一个初始化的GMM-HMM模型,然后通过不断的迭代训练确定混合高斯模型GMM-HMM对应的方差和均值,一旦方差和均值确定,那么相应的GMM-HMM的模型就相应的确定了。
步骤301B,根据训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到训练语料库对应的似然概率矩阵。
在本实施例中,采用训练预料库中的语音数据进行训练,提取训练语料库中语音的MFCC特征,然后作为上述训练后的GMM-HMM模型的输入特征,获取到输出的训练语料库中语音对应的似然概率矩阵。似然概率矩阵代表的是语音帧与音素状态上的对齐关系,通过训练后的GMM-HMM输出似然概率矩阵目的是将其作为后续训练深度学习模型的初始对齐关系,便于后续深度学习模型能够得到更好的深度学习的结果。
步骤301C,根据训练预料库中提取的Filter Bank特征和似然概率矩阵对具有连接单元的LSTM模型进行训练,确定与具有连接单元LSTM模型对应的权重矩阵和偏置矩阵,根据权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
在本实施例中,将上述通过GMM-HMM计算得到的对齐结果(即似然概率矩阵)和原始语音特征一起作为具有连接单元LSTM模型的输入特征进行训练,其中,这里的原始语音特征采用的Filter Bank特征,相对于MFCC特征,Filter Bank特征具有数据相关性,所以具有更好的语音特征表达。通过对具有连接单元LSTM模型进行训练,确定每一层LSTM对应的权重矩阵和偏置矩阵。具体地,具有连接单元LSTM也属于深度神经网络模型中的一种,神经网络层一般分为三类:输入层、隐藏层和输出层,其中,隐含层有多层。训练具有连接单元LSTM模型的目的就是为了确定每一层中所有的权 重矩阵和偏置矩阵以及相应的层数,训练的算法可以采用前向传播算法、维特比算法等现有的算法,这里不对具体的训练算法进行限定。
如图8所示,在一个实施例中,提出了一种语音识别方法,该方法包括以下步骤:
步骤802,获取待识别的语音数据。
步骤804,提取语音数据中的Filter Bank特征和MFCC特征。
步骤806,将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
步骤808,将Filter Bank特征和第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取训练后的DNN-HMM模型输出的第二似然概率矩阵。
步骤810,将Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取具有连接单元的LSTM模型输出的后验概率矩阵,连接单元用于控制LSTM模型中层与层之间的信息流动。
步骤812,将后验概率矩阵和第二似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第三似然概率矩阵。
步骤814,根据第三似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
在本实施例中,为了能得到更准确的识别效果,在通过训练后的GMM-HMM模型得到初步对齐结果(第一似然概率矩阵),再经过训练后的DNN-HMM进行进一步的对齐,能够获取更好的对齐效果。由于深度神经网络模型比传统的混合高斯模型能得到更好的语音特征表达,因此使用深度神经网络模型做进一步强制对齐能进一步提高准确率。然后将该进一步对齐的结果(第二似然概率矩阵)代入具有创新型的具有连接单元LSTM-HMM模型,可以获取到最后的对齐结果(第三似然概率矩阵)。需要说明的是,这里的对齐结果是指语音帧与音素状态的对齐关系。上述不管是混合高斯模型还是深度学习模型等都是声学模型的一部分,而声学模型的作用就是获取语音 帧与音素状态的对齐关系,便于后续结合语言模型在音素解码网络中获取与待识别语音数据对应的目标词序列。
如图9所示,在一个实施例中,提出了一种语音识别装置,该装置包括:
获取模块902,用于获取待识别的语音数据。
提取模块904,用于提取语音数据中的Filter Bank特征和MFCC特征。
第一输出模块906,用于将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
后验概率矩阵输出模块908,将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动。
第二输出模块910,用于将后验概率矩阵和第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵。
解码模块912,用于根据第二似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
在一个实施例中,提取模块还用于将待识别的语音数据进行傅里叶变换转换为频域的能量谱,将频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征,将Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在一个实施例中,连接单元采用sigmoid函数来实现;所述后验概率矩阵输出模块908还用于将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值;根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
如图10所示,在一个实施例中,后验概率矩阵输出模块908包括:
排序模块908A,用于获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序。
后验概率输出模块908B,用于将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率。
确定模块908C,用于根据每一帧语音数据对应的后验概率确定待识别语音数据对应的后验概率矩阵。
如图11所示,在一个实施例中,上述语音识别装置还包括:
GMM-HMM模型训练模块914,用于采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定GMM-HMM模型对应的方差和均值,根据方差和均值生成训练后的GMM-HMM模型。
似然概率矩阵获取模块916,用于根据训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到训练语料库对应的似然概率矩阵。
LSTM模型训练模块918,用于根据训练预料库中提取的Filter Bank特征和似然概率矩阵对具有连接单元LSTM模型进行训练,确定与具有连接单元LSTM模型对应的权重矩阵和偏置矩阵,根据权重矩阵和偏置矩阵生成训练后的具有连接单元LSTM模型。
如图12所示,在一个实施例中,提出了一种语音识别装置,该装置包括:
获取模块1202,用于获取待识别的语音数据。
提取模块1204,用于提取语音数据中的Filter Bank特征和MFCC特征。
第一输出模块1206,用于将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
第二输出模块1208,将Filter Bank特征和第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取训练后DNN-HMM输出的第二似然概率矩阵。
后验概率矩阵输出模块1210,用于将所述Filter Bank特征作为训练后的 具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动。
第三输出模块1212,用于将后验概率矩阵和第二似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第三似然概率矩阵。
解码模块1214,用于根据第三似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
上述语言识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。其中,网络接口可以是以太网卡或无线网卡等。上述各模块可以硬件形式内嵌于或独立于服务器中的处理器中,也可以以软件形式存储于服务器中的存储器中,以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。
上述语言识别装置可以实现为一种计算机程序的形式,计算机程序可以在如图1所示的计算机设备上运行。
在一个实施例中,提出一种计算机设备,计算机设备的内部结构可对应于如图1所示的结构,即该计算机设备既可以是服务器也可以是终端,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:获取待识别的语音数据,提取所述语音数据中的Filter Bank特征和MFCC特征,将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵,将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵,根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目 标词序列。
在一个实施例中,所述连接单元采用sigmoid函数来实现;所述处理器所执行的所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值,根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
在一个实施例中,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在一个实施例中,所述处理器所执行的将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤包括:获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
在一个实施例中,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:采用训练语料库对 GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;根据所述方差和均值生成训练后的GMM-HMM模型;根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵;根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
在一个实施例中,提出了一种计算机可读存储介质,其上存储有计算机指令,该指令被处理器执行时实现以下步骤:获取待识别的语音数据;提取所述语音数据中的Filter Bank特征和MFCC特征;将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
在一个实施例中,所述连接单元采用sigmoid函数来实现;所述处理器所执行的所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值;根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
在一个实施例中,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在一个实施例中,所述处理器所执行的将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤包括:获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
在一个实施例中,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;根据所述方差和均值生成训练后的GMM-HMM模型;根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵;根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储 于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音识别方法,包括:
    获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
    将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
    根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  2. 根据权利要求1所述的方法,其特征在于,所述连接单元采用sigmoid函数来实现;所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:
    将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;
    根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值;
    根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
  3. 根据权利要求1所述的方法,其特征在于,所述提取所述语音数据中的Filter Bank特征和MFCC特征的步骤包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得 到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  4. 根据权利要求1所述的方法,其特征在于,将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  5. 根据权利要求1所述的方法,其特征在于,在所述获取待识别的语音数据的步骤之前还包括:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
  6. 一种语音识别装置,包括:
    获取模块,用于获取待识别的语音数据;
    提取模块,用于提取所述语音数据中的Filter Bank特征和MFCC特征;
    第一输出模块,用于将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    后验概率矩阵输出模块,用于将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
    第二输出模块,用于将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
    解码模块,用于根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  7. 根据权利要求6所述的装置,其特征在于,所述连接单元采用sigmoid函数来实现;所述后验概率矩阵输出模块还用于将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值;根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
  8. 根据权利要求6所述的装置,其特征在于,所述提取模块包括:
    转换模块,用于将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    计算模块,用于将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    变换模块,用于将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  9. 根据权利要求6所述的装置,其特征在于,所述后验概率矩阵输出模块包括:
    排序模块,用于获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    后验概率输出模块,用于将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;
    确定模块,用于根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  10. 根据权利要求6所述的装置,其特征在于,所述装置还包括:
    GMM-HMM模型训练模块,用于采用训练语料库对GMM-HMM进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值,根据所述方差和均值生成训练后的GMM-HMM模型;
    似然概率矩阵获取模块,用于根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    LSTM模型训练模块,用于根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵,根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接 单元用于控制所述LSTM模型中层与层之间的信息流动;
    将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
    根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述连接单元采用sigmoid函数来实现;
    所述处理器所执行的所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值,根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
  13. 根据权利要求11所述的计算机设备,其特征在于,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  14. 根据权利要求11所述的计算机设备,其特征在于,所述处理器所执行的将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  15. 根据权利要求11所述的计算机设备,其特征在于,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机可读指令时还用于实现以下步骤:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所 述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动;
    将所述后验概率矩阵和所述第一似然概率矩阵作为训练后的HMM模型的输入数据,获取输出的第二似然概率矩阵;及
    根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  17. 根据权利要求16所述的存储介质,其特征在于,所述连接单元采用sigmoid函数来实现;
    所述处理器所执行的所述将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动,包括:
    将所述Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征;
    根据所述LSTM模型中前一层神经元节点的状态和输出以及后一层神经元节点的输入确定层与层之间的连接单元所对应的sigmoid函数值,根据所述层与层之间的连接单元所对应的sigmoid函数值,输出与所述Filter Bank特征对应的后验概率矩阵。
  18. 根据权利要求16所述的存储介质,其特征在于,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC 特征。
  19. 根据权利要求16所述的存储介质,其特征在于,所述处理器所执行的将所述Filter Bank特征作为训练后的具有连接单元的LSTM模型的输入特征,获取所述具有连接单元的LSTM模型输出的后验概率矩阵,所述连接单元用于控制所述LSTM模型中层与层之间的信息流动的步骤包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的具有连接单元的LSTM模型的输入特征,通过所述连接单元控制层与层之间的信息流动,获取输出的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  20. 根据权利要求16所述的存储介质,其特征在于,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机可读指令时还用于实现以下步骤:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述具有连接单元的LSTM模型进行训练,确定与所述具有连接单元的LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的具有连接单元的LSTM模型。
PCT/CN2017/100043 2017-06-12 2017-08-31 语音识别方法、装置、计算机设备及存储介质 WO2018227780A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/348,807 US11062699B2 (en) 2017-06-12 2017-08-31 Speech recognition with trained GMM-HMM and LSTM models

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710445076.9A CN107633842B (zh) 2017-06-12 2017-06-12 语音识别方法、装置、计算机设备及存储介质
CN201710445076.9 2017-06-12

Publications (1)

Publication Number Publication Date
WO2018227780A1 true WO2018227780A1 (zh) 2018-12-20

Family

ID=61099105

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100043 WO2018227780A1 (zh) 2017-06-12 2017-08-31 语音识别方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
US (1) US11062699B2 (zh)
CN (1) CN107633842B (zh)
WO (1) WO2018227780A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354344A (zh) * 2020-03-09 2020-06-30 第四范式(北京)技术有限公司 语音识别模型的训练方法、装置、电子设备及存储介质
CN113314104A (zh) * 2021-05-31 2021-08-27 北京市商汤科技开发有限公司 交互对象驱动和音素处理方法、装置、设备以及存储介质
CN113393832A (zh) * 2021-06-03 2021-09-14 清华大学深圳国际研究生院 一种基于全局情感编码的虚拟人动画合成方法及***
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510976B (zh) * 2017-02-24 2021-03-19 芋头科技(杭州)有限公司 一种多语言混合语音识别方法
US11663450B2 (en) 2017-02-28 2023-05-30 Microsoft Technology Licensing, Llc Neural network processing with chained instructions
CN110310647B (zh) * 2017-09-29 2022-02-25 腾讯科技(深圳)有限公司 一种语音身份特征提取器、分类器训练方法及相关设备
US10839809B1 (en) * 2017-12-12 2020-11-17 Amazon Technologies, Inc. Online training with delayed feedback
CN108364634A (zh) * 2018-03-05 2018-08-03 苏州声通信息科技有限公司 基于深度神经网络后验概率算法的口语发音评测方法
CN108564940B (zh) * 2018-03-20 2020-04-28 平安科技(深圳)有限公司 语音识别方法、服务器及计算机可读存储介质
CN109087630B (zh) * 2018-08-29 2020-09-15 深圳追一科技有限公司 语音识别的方法及相关装置
CN109887484B (zh) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 一种基于对偶学习的语音识别与语音合成方法及装置
CN110111774A (zh) * 2019-05-13 2019-08-09 广西电网有限责任公司南宁供电局 机器人语音识别方法和装置
CN114341979A (zh) * 2019-05-14 2022-04-12 杜比实验室特许公司 用于基于卷积神经网络的语音源分离的方法和装置
CN110277088B (zh) * 2019-05-29 2024-04-09 平安科技(深圳)有限公司 智能语音识别方法、装置及计算机可读存储介质
US11335347B2 (en) * 2019-06-03 2022-05-17 Amazon Technologies, Inc. Multiple classifications of audio data
KR20210010133A (ko) * 2019-07-19 2021-01-27 삼성전자주식회사 음성 인식 방법, 음성 인식을 위한 학습 방법 및 그 장치들
CN110751958A (zh) * 2019-09-25 2020-02-04 电子科技大学 一种基于rced网络的降噪方法
CN110738991A (zh) * 2019-10-11 2020-01-31 东南大学 基于柔性可穿戴传感器的语音识别设备
CN111091809B (zh) * 2019-10-31 2023-05-23 国家计算机网络与信息安全管理中心 一种深度特征融合的地域性口音识别方法及装置
CN111241832B (zh) * 2020-01-15 2023-08-15 北京百度网讯科技有限公司 核心实体标注方法、装置及电子设备
CN113409793B (zh) * 2020-02-28 2024-05-17 阿里巴巴集团控股有限公司 语音识别方法及智能家居***、会议设备、计算设备
CN111402893A (zh) * 2020-03-23 2020-07-10 北京达佳互联信息技术有限公司 语音识别模型确定方法、语音识别方法及装置、电子设备
CN111524503B (zh) * 2020-04-15 2023-01-17 上海明略人工智能(集团)有限公司 音频数据的处理方法、装置、音频识别设备和存储介质
CN111696522B (zh) * 2020-05-12 2024-02-23 天津大学 基于hmm和dnn的藏语语音识别方法
CN112037772B (zh) * 2020-09-04 2024-04-02 平安科技(深圳)有限公司 基于多模态的响应义务检测方法、***及装置
CN112216270B (zh) * 2020-10-09 2024-02-06 携程计算机技术(上海)有限公司 语音音素的识别方法及***、电子设备及存储介质
CN112435653A (zh) * 2020-10-14 2021-03-02 北京地平线机器人技术研发有限公司 语音识别方法、装置和电子设备
CN112509557B (zh) * 2020-11-24 2023-03-31 杭州一知智能科技有限公司 一种基于非确定化词图生成的语音识别方法及其***
CN113191178B (zh) * 2020-12-04 2022-10-21 中国船舶重工集团公司第七一五研究所 一种基于听觉感知特征深度学习的水声目标识别方法
CN112632977B (zh) * 2020-12-23 2023-06-06 昆明学院 一种彝语语音数据自动标注方法
CN113643692B (zh) * 2021-03-25 2024-03-26 河南省机械设计研究院有限公司 基于机器学习的plc语音识别方法
CN113345431B (zh) * 2021-05-31 2024-06-07 平安科技(深圳)有限公司 跨语言语音转换方法、装置、设备及介质
CN113421545B (zh) * 2021-06-30 2023-09-29 平安科技(深圳)有限公司 多模态的语音合成方法、装置、设备及存储介质
CN113488052B (zh) * 2021-07-22 2022-09-02 深圳鑫思威科技有限公司 无线语音传输和ai语音识别互操控方法
CN113595693B (zh) * 2021-07-26 2024-07-12 大连大学 一种基于改进有效信噪比的混合自动重传请求方法
CN113380235B (zh) * 2021-08-13 2021-11-16 中国科学院自动化研究所 基于知识迁移的电话信道虚假语音鉴别方法及存储介质
CN113506575B (zh) * 2021-09-09 2022-02-08 深圳市友杰智新科技有限公司 流式语音识别的处理方法、装置和计算机设备
CN113780408A (zh) * 2021-09-09 2021-12-10 安徽农业大学 一种基于音频特征的生猪状态识别方法
CN113782054B (zh) * 2021-09-22 2023-09-15 应急管理部国家自然灾害防治研究院 基于智能语音技术的闪电哨声波自动识别方法及***
CN114566155B (zh) * 2022-03-14 2024-07-12 成都启英泰伦科技有限公司 一种连续语音识别的特征缩减方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105513591A (zh) * 2015-12-21 2016-04-20 百度在线网络技术(北京)有限公司 用lstm循环神经网络模型进行语音识别的方法和装置
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
CN105810192A (zh) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 语音识别方法及其***
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法
US20170161256A1 (en) * 2015-12-04 2017-06-08 Mitsubishi Electric Research Laboratories, Inc. Method and System for Role Dependent Context Sensitive Spoken and Textual Language Understanding with Neural Networks

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW418383B (en) * 1998-09-23 2001-01-11 Ind Tech Res Inst Telephone voice recognition system and method and the channel effect compensation device using the same
US7103547B2 (en) * 2001-05-07 2006-09-05 Texas Instruments Incorporated Implementing a high accuracy continuous speech recognizer on a fixed-point processor
US20040148160A1 (en) * 2003-01-23 2004-07-29 Tenkasi Ramabadran Method and apparatus for noise suppression within a distributed speech recognition system
KR100835993B1 (ko) * 2006-11-30 2008-06-09 한국전자통신연구원 마스킹 확률을 이용한 음성 인식 전처리 방법 및 전처리장치
JP2011107650A (ja) * 2009-11-20 2011-06-02 Casio Computer Co Ltd 音声特徴量算出装置、音声特徴量算出方法及び音声特徴量算出プログラム並びに音声認識装置
AU2013223662B2 (en) * 2012-02-21 2016-05-26 Tata Consultancy Services Limited Modified mel filter bank structure using spectral characteristics for sound analysis
US9454958B2 (en) * 2013-03-07 2016-09-27 Microsoft Technology Licensing, Llc Exploiting heterogeneous data in deep neural network-based speech recognition systems
US9665823B2 (en) * 2013-12-06 2017-05-30 International Business Machines Corporation Method and system for joint training of hybrid neural networks for acoustic modeling in automatic speech recognition
US9620108B2 (en) * 2013-12-10 2017-04-11 Google Inc. Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers
US10181098B2 (en) * 2014-06-06 2019-01-15 Google Llc Generating representations of input sequences using neural networks
US20160035344A1 (en) * 2014-08-04 2016-02-04 Google Inc. Identifying the language of a spoken utterance
US10783900B2 (en) * 2014-10-03 2020-09-22 Google Llc Convolutional, long short-term memory, fully connected deep neural networks
US9508340B2 (en) * 2014-12-22 2016-11-29 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN104538028B (zh) * 2014-12-25 2017-10-17 清华大学 一种基于深度长短期记忆循环神经网络的连续语音识别方法
KR102305584B1 (ko) * 2015-01-19 2021-09-27 삼성전자주식회사 언어 모델 학습 방법 및 장치, 언어 인식 방법 및 장치
CN105244020B (zh) * 2015-09-24 2017-03-22 百度在线网络技术(北京)有限公司 韵律层级模型训练方法、语音合成方法及装置
CN106803422B (zh) * 2015-11-26 2020-05-12 中国科学院声学研究所 一种基于长短时记忆网络的语言模型重估方法
CN105869624B (zh) * 2016-03-29 2019-05-10 腾讯科技(深圳)有限公司 数字语音识别中语音解码网络的构建方法及装置
US10387769B2 (en) * 2016-06-30 2019-08-20 Samsung Electronics Co., Ltd. Hybrid memory cell unit and recurrent neural network including hybrid memory cell units
US20180129937A1 (en) * 2016-11-04 2018-05-10 Salesforce.Com, Inc. Quasi-recurrent neural network
CN106653056B (zh) * 2016-11-16 2020-04-24 中国科学院自动化研究所 基于lstm循环神经网络的基频提取模型及训练方法
CN106782602B (zh) * 2016-12-01 2020-03-17 南京邮电大学 基于深度神经网络的语音情感识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105810192A (zh) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 语音识别方法及其***
US20170161256A1 (en) * 2015-12-04 2017-06-08 Mitsubishi Electric Research Laboratories, Inc. Method and System for Role Dependent Context Sensitive Spoken and Textual Language Understanding with Neural Networks
CN105513591A (zh) * 2015-12-21 2016-04-20 百度在线网络技术(北京)有限公司 用lstm循环神经网络模型进行语音识别的方法和装置
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备
CN106328122A (zh) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 一种利用长短期记忆模型递归神经网络的语音识别方法

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354344A (zh) * 2020-03-09 2020-06-30 第四范式(北京)技术有限公司 语音识别模型的训练方法、装置、电子设备及存储介质
CN111354344B (zh) * 2020-03-09 2023-08-22 第四范式(北京)技术有限公司 语音识别模型的训练方法、装置、电子设备及存储介质
CN113314104A (zh) * 2021-05-31 2021-08-27 北京市商汤科技开发有限公司 交互对象驱动和音素处理方法、装置、设备以及存储介质
CN113314104B (zh) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 交互对象驱动和音素处理方法、装置、设备以及存储介质
CN113393832A (zh) * 2021-06-03 2021-09-14 清华大学深圳国际研究生院 一种基于全局情感编码的虚拟人动画合成方法及***
CN113393832B (zh) * 2021-06-03 2023-10-10 清华大学深圳国际研究生院 一种基于全局情感编码的虚拟人动画合成方法及***
CN113782000A (zh) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN113782000B (zh) * 2021-09-29 2022-04-12 北京中科智加科技有限公司 一种基于多任务的语种识别方法
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***
CN114626412B (zh) * 2022-02-28 2024-04-02 长沙融创智胜电子科技有限公司 用于无人值守传感器***的多类别目标识别方法及***

Also Published As

Publication number Publication date
US11062699B2 (en) 2021-07-13
CN107633842A (zh) 2018-01-26
US20190266998A1 (en) 2019-08-29
CN107633842B (zh) 2018-08-31

Similar Documents

Publication Publication Date Title
WO2018227780A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2018227781A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
CN111312245B (zh) 一种语音应答方法、装置和存储介质
TW201935464A (zh) 基於記憶性瓶頸特徵的聲紋識別的方法及裝置
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
CN111798840B (zh) 语音关键词识别方法和装置
WO2020029404A1 (zh) 语音处理方法及装置、计算机装置及可读存储介质
WO2019019252A1 (zh) 声学模型训练方法、语音识别方法、装置、设备及介质
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
Vadwala et al. Survey paper on different speech recognition algorithm: challenges and techniques
CN114550703A (zh) 语音识别***的训练方法和装置、语音识别方法和装置
Benelli et al. A low power keyword spotting algorithm for memory constrained embedded systems
CN112735404A (zh) 一种语音反讽检测方法、***、终端设备和存储介质
Kumar et al. Machine learning based speech emotions recognition system
Singh et al. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning
Biswas et al. Speech recognition using weighted finite-state transducers
CN113823265A (zh) 一种语音识别方法、装置和计算机设备
CN115376547B (zh) 发音评测方法、装置、计算机设备和存储介质
Banjara et al. Nepali speech recognition using cnn and sequence models
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Bhatia et al. Speech-to-text conversion using GRU and one hot vector encodings
Hu et al. Speaker Recognition Based on 3DCNN-LSTM.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17913985

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17913985

Country of ref document: EP

Kind code of ref document: A1