WO2018227781A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents

语音识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2018227781A1
WO2018227781A1 PCT/CN2017/100049 CN2017100049W WO2018227781A1 WO 2018227781 A1 WO2018227781 A1 WO 2018227781A1 CN 2017100049 W CN2017100049 W CN 2017100049W WO 2018227781 A1 WO2018227781 A1 WO 2018227781A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability matrix
feature
trained
voice data
filter bank
Prior art date
Application number
PCT/CN2017/100049
Other languages
English (en)
French (fr)
Inventor
梁浩
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018227781A1 publication Critical patent/WO2018227781A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer processing, and in particular, to a voice recognition method, apparatus, computer device, and storage medium.
  • Speech recognition also known as Automatic Speech Recognition (ASR)
  • ASR Automatic Speech Recognition
  • speech recognition technology is the premise of natural language processing, and can effectively promote the development of voice-activated interaction related fields and greatly facilitate people's lives, such as smart home and voice input.
  • the accuracy of speech recognition directly determines the effectiveness of technical applications.
  • the traditional speech recognition technology is based on GMM-HMM (hybrid Gaussian model and hidden Markov model).
  • GMM-HMM hybrid Gaussian model and hidden Markov model
  • DNN-HMM deep learning model and hidden Markov
  • a voice recognition method for example, a voice recognition method, apparatus, computer device, and storage medium are provided.
  • a speech recognition method comprising:
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; according to the posterior
  • the probability matrix and the first likelihood probability matrix are calculated by using the trained HMM model to obtain a target likelihood probability matrix;
  • a speech recognition device comprising:
  • An acquiring module configured to acquire voice data to be identified
  • An extraction module configured to extract a Filter Bank feature and an MFCC feature in the voice data
  • An output module configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model;
  • a first calculation module configured to use the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, and perform time dimension and hierarchical dimension respectively, and obtain an output posterior probability including time dimension and hierarchical dimension information.
  • a second calculating module configured to calculate a target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix
  • a decoding module configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the to-be-identified voice data according to the target likelihood probability matrix.
  • a computer device comprising a memory and a processor, wherein the memory stores computer readable instructions, the computer readable instructions being executed by the processor, such that the processor performs the step of: acquiring a speech to be recognized Data
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • One or more non-volatile readable storage media storing computer-executable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of: acquiring Voice data to be identified;
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • FIG. 1 is a block diagram showing the internal structure of a computer device in an embodiment
  • FIG. 3 is a flow chart of a voice recognition method in an embodiment
  • FIG. 4 is a schematic structural diagram of a two-dimensional LSTM in one embodiment.
  • FIG. 5 is a flowchart of a method for calculating a target likelihood probability matrix by using a trained HMM model according to a posterior probability matrix and a first likelihood probability matrix in one embodiment
  • FIG. 6 is a flow chart of a method for extracting Filter Bank features and MFCC features in voice data in an embodiment
  • FIG. 7 is a flow chart of a method for obtaining a posterior probability matrix by a two-dimensional LSTM model in one embodiment
  • FIG. 8 is a flow chart of a method for establishing a GMM-HMM model and a two-dimensional LSTM model in one embodiment
  • Figure 9 is a block diagram showing the structure of a voice recognition apparatus in an embodiment
  • FIG. 10 is a structural block diagram of a first computing module in an embodiment
  • Figure 11 is a block diagram showing the structure of a voice recognition apparatus in another embodiment.
  • FIG. 1 it is a schematic diagram of the internal structure of a computer device in one embodiment.
  • the computer device can be a terminal or a server.
  • the computer device includes a total through the system Wire-connected processor, non-volatile storage media, internal memory, network interface, display, and input devices.
  • the non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the internal memory can store computer readable instructions that, when executed by the processor, cause the processor to perform a speech recognition method.
  • the network interface of the computer device is used for network communication.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing, and It can be an external keyboard, trackpad or mouse.
  • the touch layer and display form a touch screen.
  • speech recognition mainly consists of two parts: acoustic model and language model, and then combined with the dictionary to form the framework of speech recognition.
  • the process of speech recognition is the process of converting a sequence of input speech features into a sequence of characters based on a dictionary, an acoustic model, and a language model.
  • the role of the acoustic model is to obtain the mapping between phonetic features and phonemes.
  • the role of the language model is to obtain the mapping between words and words, words and sentences.
  • the role of the dictionary is to obtain the mapping between words and phonemes.
  • the process of specific speech recognition can be divided into three steps.
  • the first step is to identify the speech frame into a phoneme state, that is, to align the speech frame and the phoneme state.
  • the second step is to combine the phonemes according to the phoneme state.
  • the third step is to combine the phonemes into words.
  • the first step is the role of the acoustic model, which is the key point and the difficulty. The more accurate the alignment result of the speech frame and the phoneme state, the better the effect of speech recognition.
  • the phoneme state is a finer phone unit than the phoneme, and usually one phoneme is composed of three phoneme states.
  • a voice recognition method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
  • Step 302 Acquire voice data to be identified.
  • the voice data to be identified here is usually obtained by the interactive application to obtain audio data input by the user, including digital audio and text audio.
  • step 304 the Filter Bank feature and the MFCC feature in the voice data are extracted.
  • the Filter Bank feature and the MFCC (Mel frequency cepstrum coefficient) feature are parameters used in speech recognition to represent speech features.
  • Filter Bank is used for deep learning model and MFCC is used for mixed Gaussian model.
  • MFCC is used for mixed Gaussian model.
  • pre-emphasis processing is performed on the input voice data, and a high-frequency portion of the voice signal is used to improve the frequency spectrum by using a high-pass filter, and then the pre-emphasized voice data is framed and windowed, thereby The non-stationary speech signal is converted into a short-term stationary signal, and then the endpoint detection is used to distinguish between speech and noise, and extract a valid speech portion.
  • the preprocessed speech data is subjected to fast Fourier transform, thereby converting the time domain speech signal into the frequency domain energy spectrum for analysis, and then the energy spectrum.
  • the formant characteristics of the speech are highlighted by a set of Mel-scale triangular filter banks, and then the logarithmic energy of each filter bank output is calculated.
  • the filter bank output is characterized by the Filter Bank feature.
  • the calculated logarithmic energy is subjected to discrete cosine transform to obtain MFCC coefficients, that is, MFCC features.
  • Step 306 The MFCC feature is used as input data of the trained GMM-HMM model, and the first likelihood probability matrix of the output of the trained GMM-HMM model is obtained.
  • the acoustic model and the language model collectively realize the recognition of the voice.
  • the role of the acoustic model is to identify the alignment relationship between the speech frame and the phoneme state.
  • the GMM-HMM model is part of the acoustic model and is used to initially align the speech frame with the phoneme state.
  • the extracted MFCC feature of the voice data to be recognized is used as the input data of the trained GMM-HMM model, and then the likelihood probability matrix of the model output is obtained, which is referred to as “first” for convenience and subsequent differentiation.
  • Likelihood probability matrix Likelihood probability matrix
  • the likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state, that is, the alignment relationship between the speech frame and the phoneme state can be obtained according to the calculated likelihood probability matrix, but the alignment obtained by the GMM-HMM training is obtained. Relationship is not very accurate Indeed, the first likelihood probability matrix here is equivalent to a preliminary alignment of the speech frame and phoneme state.
  • the specific calculation formula of the GMM model is as follows:
  • x represents the extracted speech feature (MFCC) vector
  • ⁇ , D are the mean and variance matrix, respectively
  • K represents the order of the MFCC coefficients.
  • Step 308 The Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • the LSTM model belongs to the deep learning model and is also part of the acoustic model.
  • Two-dimensional LSTM is an innovative model based on the traditional LSTM model, which includes not only the time dimension but also the hierarchical dimension. Therefore, the model has a better recognition effect than the traditional LSTM model.
  • the Filter Bank feature as the input feature of the trained two-dimensional LSTM model
  • the use of the time dimension and the hierarchical dimension is achieved by calculating the two dimensions of the same input (speech feature) and finally combining the results.
  • the time dimension is calculated first, and then the input as the hierarchical dimension is output.
  • each LSTM neuron node has both time and level information.
  • FIG. 4 it is a schematic structural diagram of a two-dimensional LSTM in one embodiment.
  • t represents time
  • l represents layer
  • T refers to time dimension
  • timeLSTM timeLSTM
  • D hierarchical latitude DepthLSTM.
  • the output is:
  • Step 310 Apply the trained HMM mode according to the posterior probability matrix and the first likelihood probability matrix.
  • the type calculation yields a target likelihood probability matrix.
  • the HMM (Hidden Markov) model is a statistical model used to describe a Markov process with implicit unknown parameters, the role of which is to determine the implicit parameters in the process from observable parameters.
  • the HMM model mainly involves five parameters, which are two state sets and three probability sets.
  • the two state sets are the hidden state and the observed state, and the three probability sets are the initial matrix, the transition matrix and the confusion matrix.
  • the transfer matrix is obtained by training, that is, once the training of the HMM model is completed, the transfer matrix is determined.
  • the observable speech feature Finter Bank feature
  • the observation state is mainly used as the observation state to calculate the correspondence relationship between the phoneme state and the speech frame (ie, the implied state).
  • the posterior probability matrix calculated by the two-dimensional LSTM model is the confusion matrix to be determined in the HMM model
  • the first likelihood probability matrix is the initial matrix to be determined. Therefore, using the posterior probability matrix and the first likelihood probability matrix as the input data of the trained HMM model, the target likelihood probability matrix of the output can be obtained.
  • the target likelihood probability matrix represents the final alignment relationship between the phoneme state and the speech frame. Subsequently, according to the determined target likelihood probability matrix, the target word sequence corresponding to the voice data to be recognized can be acquired in the phoneme decoding network.
  • Step 312 Acquire a target word sequence corresponding to the voice data to be recognized in the phoneme decoding network according to the target likelihood probability matrix.
  • the speech recognition process two parts are included, one is an acoustic model and the other is a language model.
  • the search algorithm can use the Viterbi algorithm ( Viterbi algorithm). This path is to output the word string corresponding to the voice data to be recognized with the maximum probability, thus determining the text contained in the voice data.
  • the decoding network of the phoneme decoding level (ie, the phoneme decoding network) is implemented by a finite state machine (Finite State Transducer, FST) correlation algorithm, such as a deterministic algorithm, a minimization algorithm, and a sentence is divided into words. Then split the words into phonemes (such as Chinese vowels, English phonetic symbols), and then use phonemes and pronunciation dictionaries, The grammar and the like are aligned by the above method to obtain an output phoneme decoding network.
  • the phoneme decoding network contains all possible identified path representations. The decoding process is based on the input voice data, and the path of the huge network is deleted, and one or more candidate paths are obtained, and the data structure stored in a word network is stored. Medium, and then the final identification is to score the candidate path, the path with the highest score is the recognition result.
  • FST Finite State Transducer
  • the above speech recognition method combines the long Gaussian recurrent neural network LSTM in the mixed Gaussian model GMM and the deep learning model, and first calculates the first likelihood probability matrix according to the extracted MFCC features by using the GMM-HMM model, the first likelihood The probability matrix represents the alignment of the speech data on the phoneme state, and then further alignment based on the previous preliminary alignment results using LSTM, which is beneficial to improve the accuracy of speech recognition, and the LSTM adopts an innovative two-dimensional LSTM includes both time dimension information and hierarchical latitude information, so LSTM with traditional time dimension information has better speech feature representation, which further improves the accuracy of speech recognition.
  • the step 310 of calculating the target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix comprises:
  • Step 310A The Filter Bank feature and the first likelihood probability matrix are used as input data of the trained DNN-HMM model, and the second likelihood probability matrix of the trained DNN-HMM output is obtained.
  • Step 310B The posterior probability matrix and the second likelihood probability matrix are used as input data of the trained HMM model, and the target likelihood probability matrix is calculated.
  • the preliminary alignment result (first likelihood probability matrix) is obtained through the trained GMM-HMM model, and then the DNN-HMM after training is further aligned. Can get better alignment. Since the deep neural network model can obtain better speech feature representation than the traditional mixed Gaussian model, using the deep neural network model for further forced alignment can further improve the accuracy. Then the further pair The homogeneous result (second likelihood probability matrix) is substituted into the innovative two-dimensional LSTM-HMM model, and the final alignment result (target likelihood probability matrix) can be obtained. It should be noted that the alignment result herein refers to the alignment relationship between the speech frame and the phoneme state.
  • the above-mentioned mixed Gaussian model or deep learning model is part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, so as to facilitate the subsequent acquisition of the speech data to be recognized in the phoneme decoding network.
  • the corresponding target word sequence is part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, so as to facilitate the subsequent acquisition of the speech data to be recognized in the phoneme decoding network.
  • the GMM-HMM model is first used to calculate the first appearance based on the extracted MFCC features.
  • the probability matrix, the first likelihood probability matrix represents the preliminary alignment result of the speech frame and the phoneme state, and then further aligns with the DNN-HMM model based on this, and then uses LSTM to perform the final based on the previous alignment result.
  • the one-step alignment improves the speech recognition by combining the GMM-HMM, DNN-HMM model and LSTM model, and the LSTM uses an innovative two-dimensional LSTM that includes both time-dimensional information and hierarchical latitude. Information, compared with the traditional LSTM with only time dimension information, has better speech feature representation, which is beneficial to further improve the effect of speech recognition.
  • the step 304 of extracting the Filter Bank feature and the MFCC feature in the voice data includes:
  • Step 304A Perform Fourier transform on the speech data to be identified into an energy spectrum in the frequency domain.
  • the transformation of the speech signal in the time domain is generally difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe, different energy distributions, representing characteristics of different speeches. . Therefore, the speech data to be identified needs to be subjected to fast Fourier transform to obtain an energy distribution on the spectrum.
  • the spectrum of each frame is obtained by performing fast Fourier transform on each frame of the speech signal, and the power spectrum (ie, the energy spectrum) of the speech signal is obtained by modulo the square of the spectrum of the speech signal.
  • step 304B the energy spectrum in the frequency domain is used as an input characteristic of the triangular filter bank of the Meyer scale, and the Filter Bank feature of the speech data to be recognized is calculated.
  • the obtained energy spectrum of the frequency domain needs to be used as an input characteristic of the triangular filter bank of the Meyer scale, and the logarithm of the output of each triangular filter bank is calculated.
  • Energy the Filter Bank feature that gets the speech data to be recognized.
  • the Filter Bank feature also uses the energy spectrum corresponding to each frame of the speech signal as the input characteristic of the triangular filter bank of the Meyer scale, and then obtains the Filter Bank feature corresponding to each frame of the speech signal.
  • step 304C the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the voice data to be recognized.
  • the filter bank in order to obtain the MFCC feature of the speech data to be recognized, it is also necessary to perform discrete cosine transform on the logarithmic energy outputted by the filter bank to obtain a corresponding MFCC feature.
  • the MFCC feature corresponding to each frame of the speech signal is obtained by discrete Cosine transforming the Filter Bank feature corresponding to each frame of the speech signal.
  • the difference between the Filter Bank feature and the MFCC feature is that the Filter Bank feature has data correlation between different feature dimensions, while the MFCC feature is a feature obtained by using discrete cosine transform to remove the data correlation of the Filter Bank feature.
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output includes the time dimension and the hierarchical dimension information.
  • Step 308 of the posterior probability matrix includes:
  • Step 308A Acquire a Filter Bank feature corresponding to each frame of voice data in the to-be-identified voice data, and sort according to time.
  • the voice data is first framed, and then the Filter Bank features corresponding to each frame of the voice data are extracted, and sorted according to the time sequence. That is, the Filter Bank features of each frame are sorted according to the order in which each frame of the speech data to be recognized appears.
  • Step 308B using the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as input features of the trained two-dimensional LSTM model, respectively performing time dimension and hierarchical latitude The calculation obtains a posterior probability of the phoneme state corresponding to each frame of voice data including the time dimension and the latitude latitude information.
  • the input of the deep learning model adopts a multi-frame feature, which is more advantageous than the traditional mixed Gaussian model with only single frame input, because the framing before and after the speech frame is beneficial to obtain the context-related information for the current influences. Therefore, the Filter Bank feature of each frame of voice data and the preset number of frames before and after the frame is generally used as an input feature of the trained two-dimensional LSTM model. For example, the current frame and the five frames before and after the frame are spliced, and a total of 11 frames of data are used as input features of the trained two-dimensional LSTM model, and the 11-frame speech feature sequence is outputted through each node in the two-dimensional LSTM. The posterior probability of the phoneme state corresponding to the speech data.
  • Step 308C Determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the posterior probability matrix corresponding to the to-be-identified voice data is determined.
  • the posterior probability matrix is composed of one posterior probability. Since the two-dimensional LSTM model can contain both time dimension information and hierarchical latitude information, the model can better obtain the corresponding speech data corresponding to the traditional model with only the time dimension information. Probability matrix.
  • the method before the step of acquiring the voice data to be identified, the method further includes: step 301, establishment of a GMM-HMM model, and establishment of a two-dimensional LSTM model. Specifically include:
  • step 30lA the training corpus is used to train the GMM-HMM model, and the variance and mean of the GMM-HMM model are determined through continuous iterative training, and the trained GMM-HMM model is generated according to the variance and the mean.
  • the GMM-HMM acoustic model is established by using single phoneme training and three phonemes for training.
  • the triphone training considers the influence of the related phonemes of the current phoneme, and can obtain more accurate alignment effect. Can produce better recognition results.
  • triphone training generally uses triphone training based on delta+delta-delta feature, linear discriminant analysis + trisality training of maximum likelihood linear feature conversion.
  • voice features in the input training prediction library are first normalized, and the default variance is normalized.
  • Speech feature normalization The purpose is to eliminate the deviation caused by the convolution noise of the telephone channel and the feature extraction calculation.
  • an initial GMM-HMM model is quickly obtained by using a small amount of feature data, and then the variance and mean of the mixed Gaussian model GMM-HMM are determined through continuous iterative training. Once the variance and mean are determined, the corresponding GMM-HMM model is corresponding. Ok.
  • Step 301B According to the MFCC feature extracted from the training corpus, the trained likelihood corpus corresponding to the training corpus is obtained by using the trained GMM-HMM model.
  • the voice data in the training prediction library is used for training, and the MFCC feature of the speech in the training corpus is extracted, and then as the input feature of the trained GMM-HMM model, the voice corresponding to the output training corpus is obtained.
  • Likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state.
  • the trained GMM-HMM output likelihood probability matrix is used as the initial alignment relationship of the subsequent training deep learning model, which facilitates the subsequent deep learning model. Get better results from deep learning.
  • Step 301C training the two-dimensional LSTM model according to the Filter Bank feature and the likelihood probability matrix extracted in the training prediction library, determining a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model, and generating training according to the weight matrix and the offset matrix. After the two-dimensional LSTM model.
  • the alignment result calculated by the GMM-HMM described above (ie, the likelihood probability matrix) and the original speech feature are trained together as input features of the two-dimensional LSTM model, wherein the original speech feature is filtered.
  • Bank feature, the Filter Bank feature has data correlation with respect to MFCC features, so it has better speech feature representation.
  • the weight matrix and offset matrix corresponding to each layer of LSTM are determined.
  • the two-dimensional LSTM also belongs to one of the deep neural network models, and the neural network layer generally falls into three categories: an input layer, a hidden layer, and an output layer.
  • the purpose of training the two-dimensional LSTM model is to determine all the weight matrix and offset matrix and the corresponding number of layers in each layer.
  • the training algorithm can use existing algorithms such as forward propagation algorithm and Viterbi algorithm. The training algorithm is limited.
  • a voice recognition apparatus comprising:
  • the obtaining module 902 is configured to acquire voice data to be identified.
  • the extraction module 904 is configured to extract the Filter Bank feature and the MFCC feature in the voice data.
  • the output module 906 is configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model.
  • the first calculation module 908 is configured to use the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, and perform time dimension and hierarchical dimension calculation respectively, and obtain an output posterior probability matrix including time dimension and hierarchical dimension information. .
  • the second calculating module 910 is configured to calculate the target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix.
  • the decoding module 912 is configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the voice data to be identified according to the second likelihood probability matrix.
  • the second calculating module 910 is further configured to use the Filter Bank feature and the first likelihood probability matrix as input data of the trained DNN-HMM model to obtain the trained DNN-HMM output.
  • the second likelihood probability matrix is obtained by using the posterior probability matrix and the second likelihood probability matrix as input data of the trained HMM model, and calculating a target likelihood probability matrix.
  • the extraction module 904 is further configured to perform Fourier transform on the speech data to be recognized into an energy spectrum in the frequency domain, and use the energy spectrum in the frequency domain as an input characteristic of the triangular filter group of the Meyer scale.
  • the Filter Bank feature of the speech data to be recognized is calculated, and the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
  • the first computing module 908 includes:
  • the sorting module 908A is configured to acquire Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sort them by time.
  • the posterior probability calculation module 908B is configured to use the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as input features of the trained two-dimensional LSTM model, respectively. The calculation of the time dimension and the hierarchical latitude is performed, and the posterior probability of the phoneme state corresponding to each frame of the voice data including the time dimension and the hierarchical latitude information is obtained.
  • the determining module 908C is configured to determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the voice recognition apparatus further includes:
  • the GMM-HMM model training module 914 is used to train the GMM-HMM model by using the training corpus, and determine the variance and mean of the GMM-HMM model through continuous iterative training, and generate the trained GMM-HMM model according to the variance and the mean.
  • the likelihood probability matrix obtaining module 916 is configured to obtain the likelihood probability matrix corresponding to the training corpus according to the MFCC feature extracted from the training corpus and the trained GMM-HMM model.
  • the two-dimensional LSTM model training module 918 is configured to train the two-dimensional LSTM model according to the Filter Bank feature and the likelihood probability matrix extracted in the training prediction library, and determine a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model, according to the weight The matrix and the offset matrix generate a trained two-dimensional LSTM model.
  • the network interface may be an Ethernet card or a wireless network card.
  • the above modules may be embedded in the hardware in the processor or in the memory in the server, or may be stored in the memory in the server, so that the processor calls the corresponding operations of the above modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • a computer device is proposed.
  • the internal structure of the computer device may correspond to the structure as shown in FIG. 1, that is, the computer device may be a server or a terminal, and includes a memory, a processor, and a storage device.
  • a computer program on the memory and operable on the processor, the processor executing the computer program to: acquire voice data to be recognized; extract Filter Bank features and MFCC in the voice data Feature; using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output; using the Filter Bank feature as a trained two-dimensional
  • the input features of the LSTM model are respectively calculated by time dimension and hierarchical dimension, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; and the posterior probability matrix and the first likelihood probability matrix are adopted according to the posterior probability matrix
  • the trained HMM model calculates the target likelihood a probability matrix; acquiring, in the phoneme decoding network, a target word sequence
  • the performing, by the processor, the target likelihood probability matrix is calculated by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix, including:
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Generating the energy spectrum of the frequency domain as the input characteristic of the triangular filter bank of the Meyer scale, and calculating the Filter Bank feature of the speech data to be recognized; and performing the discrete cosine transform on the Filter Bank feature to obtain the speech data to be recognized MFCC features.
  • the performing, by the processor, the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, respectively calculating a time dimension and a hierarchical dimension, and obtaining an output containing a time dimension And the posterior probability matrix of the hierarchical dimension information comprising: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; Fliter Bank characteristics of each frame of voice data and voice data of each frame The corresponding Filter Bank feature of the preset preset frame number is used as the input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical latitude are respectively calculated, and each frame containing the time dimension and the hierarchical latitude information is obtained.
  • a posterior probability of the phoneme state corresponding to the voice data determining a posterior probability matrix corresponding to the to-be-identified voice data according to the posterior probability corresponding to each frame of the voice data.
  • the executing the computer program by the processor is further used to implement the following steps: training the GMM-HMM model by using a training corpus, through continuous Iterative training determines the GMM-HMM model Corresponding variance and mean value; generating a trained GMM-HMM model according to the variance and the mean value; obtaining a likelihood corresponding to the training corpus according to the MFCC feature extracted in the training corpus, using the trained GMM-HMM model a probability matrix; training the two-dimensional LSTM model according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix, and determining a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model; The weight matrix and the offset matrix generate a trained two-dimensional LSTM model.
  • a computer readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the steps of: acquiring voice data to be recognized; extracting Filter Bank in the voice data Feature and MFCC feature; using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output; using the Filter Bank feature as a training
  • the input features of the two-dimensional LSTM model are respectively calculated by time dimension and hierarchical dimension, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; according to the posterior probability matrix and the first likelihood
  • the probability matrix is calculated by using the trained HMM model to obtain a target likelihood probability matrix; and the target word sequence corresponding to the to-be-identified voice data is acquired in the phoneme decoding network according to the target likelihood probability matrix.
  • the performing, by the processor, the target likelihood probability matrix is calculated by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix, including:
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Spectrum; the energy spectrum of the frequency domain is used as the output of the triangular filter bank of the Meyer scale Into the feature, the Filter Bank feature of the speech data to be recognized is calculated; the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
  • the performing, by the processor, the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, respectively calculating a time dimension and a hierarchical dimension, and obtaining an output containing a time dimension And the posterior probability matrix of the hierarchical dimension information comprising: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; Fliter Bank characteristics of each frame of voice data and voice data of each frame The corresponding Filter Bank feature of the preset preset frame number is used as the input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical latitude are respectively calculated, and each frame containing the time dimension and the hierarchical latitude information is obtained.
  • a posterior probability of the phoneme state corresponding to the voice data determining a posterior probability matrix corresponding to the to-be-identified voice data according to the posterior probability corresponding to each frame of the voice data.
  • the executing the computer program by the processor is further used to implement the following steps: training the GMM-HMM model by using a training corpus, through continuous Iterative training determines the variance and mean of the GMM-HMM model; generates the trained GMM-HMM model according to the variance and the mean; according to the MFCC feature extracted from the training corpus, uses the trained GMM-HMM model to obtain a likelihood probability matrix corresponding to the training corpus; training the two-dimensional LSTM model according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix, and determining to correspond to the two-dimensional LSTM model a weight matrix and an offset matrix; generating a trained two-dimensional LSTM model based on the weight matrix and the offset matrix.
  • the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, or a read-only memory (ROM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

本申请提出了一种语音识别方法,该方法包括:获取待识别的语音数据;提取语音数据中的Filter Bank特征和MFCC特征;将MFCC特征作为GMM-HMM模型的输入数据,获取第一似然概率矩阵;将Filter Bank特征作为二维LSTM模型的输入特征,获取后验概率矩阵;将后验概率矩阵和第一似然概率矩阵作为HMM模型的输入数据,获取第二似然概率矩阵,根据第二似然概率矩阵在音素解码网络中获取对应的目标词序列。

Description

语音识别方法、装置、计算机设备及存储介质
本申请要求于2017年6月12日提交中国专利局、申请号为2017104387727、发明名称为“语音识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机处理领域,特别是涉及一种语音识别方法、装置、计算机设备及存储介质。
背景技术
语音识别,也被称为自动语音识别(Automatic Speech Recognition,ASR),其目标是让机器通过识别和理解,把语音信号变成文字,是现代人工智能发展的重要分支。语音识别技术的实现是自然语言处理的前提,并能有效推动声控交互相关领域的发展并极大方便人们的生活,如智能家居、语音输入。语音识别的准确度直接决定了技术应用的有效性。
传统的语音识别技术是基于GMM-HMM(混合高斯模型和隐马尔科夫模型)进行声学模型的建立,近年来,随着深度学习技术的发展,基于DNN-HMM(深度学习模型和隐马尔科夫模型)进行声学模型的建立相对于GMM-HMM在识别准确度上虽然有了很大的提升,但是还有待于进一步提高语音识别的准确度。
发明内容
根据本申请的各种实施例,提供一种语音识别方法、装置、计算机设备及存储介质。
一种语音识别方法,包括:
获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一种语音识别装置,包括:
获取模块,用于获取待识别的语音数据;
提取模块,用于提取所述语音数据中的Filter Bank特征和MFCC特征;
输出模块,用于将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
第一计算模块,用于将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
第二计算模块,用于根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
解码模块,用于根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
一个或多个存储有计算机可执行指令的非易失性可读存储介质,所述计算机可执行指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取待识别的语音数据;
提取所述语音数据中的Filter Bank特征和MFCC特征;
将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中计算机设备的内部结构框图;
图2为一个实施例中语音识别的架构图;
图3为一个实施例中语音识别方法的流程图;
图4为一个实施例中二维LSTM的结构示意图
图5为一个实施例中根据后验概率矩阵和第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵的方法流程图;
图6为一个实施例中提取语音数据中的Filter Bank特征和MFCC特征的方法流程图;
图7为一个实施例中通过二维LSTM模型获取后验概率矩阵的方法流程图;
图8为一个实施例中GMM-HMM模型和二维LSTM模型建立的方法流程图;
图9为一个实施例中语音识别装置的结构框图;
图10为一个实施例中第一计算模块的结构框图;
图11为另一个实施例中语音识别装置的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,为一个实施例中计算机设备的内部结构示意图。该计算机设备可以是终端也可以是服务器。参照图1,该计算机设备包括通过***总 线连接的处理器、非易失性存储介质、内存储器、网络接口、显示屏和输入装置。其中,该计算机设备的非易失性存储介质可存储操作***和计算机可读指令,该计算机可读指令被执行时,可使得处理器执行一种语音识别方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该内存储器中可储存有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种语音识别方法。计算机设备的网络接口用于进行网络通信。计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。触摸层和显示屏构成触控屏。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
首先,介绍一下语音识别的框架,如图2所示,语音识别主要包括两个部分:声学模型和语言模型,然后结合字典就构成了语音识别的框架。语音识别的过程就是根据字典、声学模型和语言模型,将输入的语音特征序列转换为字符序列的过程。其中,声学模型的作用是得到语音特征与音素的映射,语言模型的作用是得到词与词、词与句子的映射,字典的作用是得到字词与音素之间的映射。具体语音识别的过程可以分为三步,第一步,把语音帧识别成音素状态,即进行语音帧和音素状态上的对齐。第二步是根据音素状态组合成音素。第三步是把音素组合成单词。其中,第一步是声学模型的作用,是重点也是难点,语音帧与音素状态的对齐结果越准确,就意味着语音识别的效果就会越好。其中,音素状态是比音素更细致的语音单位,通常一个音素由3个音素状态构成。
如图3所示,在一个实施例中,提出了一种语音识别方法,该方法可应用于终端或服务器中,具体包括以下步骤:
步骤302,获取待识别的语音数据。
在本实施例中,这里待识别的语音数据通常是通过交互应用获取到用户输入的音频数据,包括数字的音频和文字的音频。
步骤304,提取语音数据中的Filter Bank特征和MFCC特征。
在本实施例中,Filter Bank(滤波器组)特征和MFCC(Mel frequency cepstrum coefficient,梅尔倒谱系数)特征都是语音识别中用来表示语音特征的参数。其中,Filter Bank用于深度学***滑,然后将经过预加重处理的语音数据进行分帧加窗,从而将非平稳的语音信号转变为短时平稳的信号,接着通过端点检测,区分语音与噪声,并提取出有效的语音部分。为了提取语音数据中的Filter Bank特征和MFCC特征,首先,将经过预处理的语音数据进行快速傅里叶变换,从而将时域的语音信号转换为频域的能量谱进行分析,然后将能量谱通过一组梅尔尺度的三角滤波器组,突出语音的共振峰特征,之后计算每个滤波器组输出的对数能量,该滤波器组输出的特征就是Filter Bank特征。进一步的,将计算得到的对数能量经离散余弦变换得到MFCC系数,即MFCC特征。
步骤306,将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
在本实施例中,声学模型和语言模型共同实现对语音的识别。其中,声学模型的作用是用于识别语音帧与音素状态的对齐关系。GMM-HMM模型属于声学模型的一部分,用于将语音帧与音素状态进行初步对齐。具体地,将提取的待识别的语音数据的MFCC特征作为训练后的GMM-HMM模型的输入数据,然后获取该模型输出的似然概率矩阵,为了便于和后续进行区分,这里称为“第一似然概率矩阵”。似然概率矩阵表示的是语音帧与音素状态上的对齐关系,即根据计算得到的似然概率矩阵就可以得到语音帧与音素状态上的对齐关系,只不过,通过GMM-HMM训练得到的对齐关系并不十分准 确,所以这里通过第一似然概率矩阵相当于对语音帧和音素状态进行了初步对齐。GMM模型具体的计算公式如下:
Figure PCTCN2017100049-appb-000001
其中,x表示提取的语音特征(MFCC)向量,μ,D分别为均值和方差矩阵,K表示MFCC系数的阶数。
步骤308,将Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵。
在本实施例中,LSTM模型属于深度学习模型,也属于声学模型的一部分。二维LSTM是在传统的LSTM模型的基础上提出的创新性的模型,该模型不仅包括时间维度还包括层次维度。所以相对于传统的LSTM模型该模型具有更好的识别效果。通过将Filter Bank特征作为训练后的二维LSTM模型的输入特征,使用时间维度和层次维度是通过对相同输入(语音特征)分别进行两个维度的计算,最后再将结果融合输出来实现的。其中,在LSTM每一层都是先进行时间维度的计算,然后输出作为层次维度的输入。这样每个LSTM神经元节点就同时拥有了时间和层次信息。如图4所示,为一个实施例中二维LSTM的结构示意图。参考图4,首先定义输入
Figure PCTCN2017100049-appb-000002
其中,t表示时间,l表示层,T指时间维度,timeLSTM,D指层次纬度DepthLSTM。输出为:
Figure PCTCN2017100049-appb-000003
其中,c表示节点的状态,θ指LSTM其他所有参数。简单的理解就是将相同的输入(语音特征)每次计算时做两遍,一边专注于时间维度,一遍专注于层次维度,其输出与只使用时间维度计算的传统LSTM形式一样。通过二维LSTM模型后,获取到输出的包含有时间维度信息和层次维度信息的后验概率矩阵。
步骤310,根据后验概率矩阵和第一似然概率矩阵采用训练后的HMM模 型计算得到目标似然概率矩阵。
在本实施例中,HMM(隐马尔科夫)模型是统计模型,它用来描述一个含有隐含未知参数的马尔科夫过程,作用是从可观察的参数中确定该过程中的隐含参数。HMM模型中主要涉及5个参数,分别是2个状态集合和3个概率集合。其中,2个状态集合分别为隐藏状态和观察状态,三个概率集合为初始矩阵,转移矩阵和混淆矩阵。其中,转移矩阵是训练得到的,也就是说,一旦HMM模型训练完成,该转移矩阵就确定了。在该实施例中,主要是采用可观察的语音特征(Filter Bank特征)作为观察状态,来计算确定音素状态与语音帧的对应关系(即隐含状态)。如果想要确定音素状态与语音帧的对应关系,还需要确定两个参数,那就是初始矩阵和混淆矩阵。其中,通过二维LSTM模型计算得到的后验概率矩阵就是HMM模型中需要确定的混淆矩阵,第一似然概率矩阵就是需要确定的初始矩阵。所以将后验概率矩阵和第一似然概率矩阵作为训练后的HMM模型的输入数据,就可以获取输出的目标似然概率矩阵。该目标似然概率矩阵表示的是音素状态与语音帧的最终的对齐关系。后续根据该确定的目标似然概率矩阵就可以在音素解码网络中获取与待识别的语音数据对应的目标词序列。
步骤312,根据目标似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
在本实施例中,在语音识别过程中,包括两个部分,一个是声学模型,一个是语言模型。在语音识别前,首先需要根据训练后的声学模型和语言模型以及字典建一个音素级别的解码网络,根据搜索算法在该网络中寻找最佳的一条路径,其中,搜索算法可以采用维特比算法(Viterbi算法)。这个路径就是能够以最大概率输出与待识别语音数据对应的词串,这样就确定了这个语音数据中所包含的文字了。其中,音素解码级别的解码网络(即音素解码网络)是通过有限状态机(Finite State Transducer,FST)相关算法来完成的,如确定化算法determination、最小化算法minimization,通过将句子拆分成词、再将词拆分为音素(如汉语的声韵母、英文的音标),然后将音素和发音词典、 语法等通过上述方法进行对齐计算,得到输出的音素解码网络。音素解码网络中包含了所有可能识别的路径表达,解码的过程就是根据输入的语音数据,对这个庞大网络进行路径的删减,得到一个或多个候选路径,存储在一种词网络的数据结构中,然后最后的识别就是对候选路径进行打分,分数最高的路径为识别结果。
上述语音识别方法,将混合高斯模型GMM和深度学习模型中的长短时递归神经网络LSTM进行了结合,先采用GMM-HMM模型根据提取的MFCC特征计算得到第一似然概率矩阵,第一似然概率矩阵表示对语音数据在音素状态上对齐结果,然后再使用LSTM在之前初步对齐结果的基础上进行进一步的对齐,有利于提高语音识别的准确度,且该LSTM采用的是创新性的二维LSTM,既包括时间维度的信息又包括层次纬度的信息,所以相对于传统的只有时间维度信息的LSTM具有更好的语音特征表达,进一步了提高语音识别的准确度。
如图5所示,在一个实施例中,根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵的步骤310包括:
步骤310A,将Filter Bank特征和第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取训练后DNN-HMM输出的第二似然概率矩阵。
步骤310B,将后验概率矩阵和第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
在本实施例中,为了能得到更准确的识别效果,在通过训练后的GMM-HMM模型得到初步对齐结果(第一似然概率矩阵),然后再经过训练后的DNN-HMM进行进一步的对齐,能够获取更好的对齐效果。由于深度神经网络模型比传统的混合高斯模型能得到更好的语音特征表达,因此使用深度神经网络模型做进一步强制对齐能进一步提高准确率。然后将该进一步对 齐的结果(第二似然概率矩阵)代入具有创新型的二维LSTM-HMM模型,可以获取到最后的对齐结果(目标似然概率矩阵)。需要说明的是,这里的对齐结果是指语音帧与音素状态的对齐关系。上述不管是混合高斯模型还是深度学习模型等都是声学模型的一部分,而声学模型的作用就是获取语音帧与音素状态的对齐关系,便于后续结合语言模型在音素解码网络中获取与待识别语音数据对应的目标词序列。
在本实施例中,通过将混合高斯模型GMM-HMM,和深度学习模型DNN-HMM以及长短时递归神经网络LSTM结合进行语音识别,先采用GMM-HMM模型根据提取的MFCC特征计算得到第一似然概率矩阵,第一似然概率矩阵表示对语音帧与音素状态的初步对齐结果,然后在此基础上采用DNN-HMM模型进行进一步的对齐,之后再使用LSTM在之前对齐结果的基础上进行最后一步的对齐,通过将GMM-HMM,DNN-HMM模型以及LSTM模型进行结合提高了语音识别的效果,且该LSTM采用的是创新性的二维LSTM,既包括时间维度的信息又包括层次纬度的信息,相对于传统的只有时间维度信息的LSTM具有更好的语音特征表达,有利于进一步提高语音识别的效果。
如图6所示,在一个实施例中,提取语音数据中的Filter Bank特征和MFCC特征的步骤304包括:
步骤304A,将待识别的语音数据进行傅里叶变换转换为频域的能量谱。
在本实施例中,由于语音信号在时域上的变换通常都很难看出信号的特性,所以通常需要将它转换为频域上的能量分布来观察,不同的能量分布,代表不同语音的特性。所以需要将待识别的语音数据经过快速傅里叶变换以得到频谱上的能量分布。其中,是通过将每一帧语音信号进行快速傅里叶变换得到每一帧的频谱,对语音信号的频谱取模平方得到语音信号的功率谱(即能量谱)。
步骤304B,将频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征。
在本实施例中,为了得到待识别语音数据的Filter Bank特征,需要将得到的频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算每个三角滤波器组输出的对数能量,即得到待识别语音数据的Filter Bank特征。其中,Filter Bank特征也是通过将每一帧语音信号对应的能量谱作为梅尔尺度的三角滤波器组的输入特征,然后得到每一帧语音信号对应的Filter Bank特征。
步骤304C,将Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在本实施例中,为了得到待识别语音数据的MFCC特征,还需要将经过滤波器组输出的对数能量进行离散余弦变换得到相应的MFCC特征。通过将每一帧语音信号对应的Filter Bank特征经过离散余弦变换得到每一帧语音信号对应的MFCC特征。其中,Filter Bank特征与MFCC特征的区别在于,Filter Bank特征在不同特征维度之间存在数据相关性,而MFCC特征则是采用离散余弦变换去除Filter Bank特征的数据相关性所得到的特征。
如图7所示,在一个实施例中,将Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵的步骤308包括:
步骤308A,获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序。
在本实施例中,在提取待识别语音数据中的Filter Bank特征时是通过先将语音数据进行分帧处理,然后提取每一帧语音数据对应的Filter Bank特征,并按照时间的先后顺序排序,即按照待识别语音数据中每一帧出现的先后顺序将对应的每一帧的Filter Bank特征进行排序。
步骤308B,将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率。
在本实施例中,深度学习模型的输入采用的是多帧特征,相对于传统的只有单帧输入的混合高斯模型更有优势,因为通过拼接前后语音帧有利于获取到上下文相关信息对当前的影响。所以一般是将每一帧语音数据和该帧的前后预设帧数的Filter Bank特征作为训练后的二维LSTM模型的输入特征。比如,将当前帧和该帧的前后5帧进行拼接,共11帧数据作为训练后的二维LSTM模型的输入特征,这11帧语音特征序列通过二维LSTM中的各个结点,输出该帧语音数据对应的音素状态上的后验概率。
步骤308C,根据每一帧语音数据对应的后验概率确定待识别语音数据对应的后验概率矩阵。
在本实施例中,当获取到每一帧语音数据对应的后验概率后就确定待识别语音数据对应的后验概率矩阵。后验概率矩阵是有一个个后验概率组成的。由于通过二维LSTM模型既可以包含有时间维度的信息,又可以包含有层次纬度的信息,所以相对于之前只有时间维度信息的传统模型,该模型能更好的得到待识别语音数据对应的后验概率矩阵。
如图8所示,在一个实施例中,在获取待识别的语音数据的步骤之前还包括:步骤301,GMM-HMM模型的建立和二维LSTM模型的建立。具体包括:
步骤30lA,采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定GMM-HMM模型对应的方差和均值,根据方差和均值生成训练后的GMM-HMM模型。
在本实施例中,GMM-HMM声学模型的建立依次采用了单音素训练以及三音素进行训练,其中,三音素训练考虑了当前音素的前后相关音素影响,能够得到更加准确的对齐效果,也就能产生更好的识别结果。根据特征和作用的不用,三音素训练一般采用基于delta+delta-delta特征的三音素训练,线性判别分析+最大似然线性特征转换的三音素训练。具体地,首先对输入的训练预料库中的语音特征进行归一化,默认对方差进行归一化。语音特征归一 化是为了消除电话信道等卷积噪声在特征提取计算造成的偏差。然后利用少量特征数据快速得到一个初始化的GMM-HMM模型,然后通过不断的迭代训练确定混合高斯模型GMM-HMM对应的方差和均值,一旦方差和均值确定,那么相应的GMM-HMM的模型就相应的确定了。
步骤301B,根据训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到训练语料库对应的似然概率矩阵。
在本实施例中,采用训练预料库中的语音数据进行训练,提取训练语料库中语音的MFCC特征,然后作为上述训练后的GMM-HMM模型的输入特征,获取到输出的训练语料库中语音对应的似然概率矩阵。似然概率矩阵代表的是语音帧与音素状态上的对齐关系,通过训练后的GMM-HMM输出似然概率矩阵目的是将其作为后续训练深度学习模型的初始对齐关系,便于后续深度学习模型能够得到更好的深度学习的结果。
步骤301C,根据训练预料库中提取的Filter Bank特征和似然概率矩阵对二维LSTM模型进行训练,确定与二维LSTM模型对应的权重矩阵和偏置矩阵,根据权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
在本实施例中,将上述通过GMM-HMM计算得到的对齐结果(即似然概率矩阵)和原始语音特征一起作为二维LSTM模型的输入特征进行训练,其中,这里的原始语音特征采用的Filter Bank特征,相对于MFCC特征,Filter Bank特征具有数据相关性,所以具有更好的语音特征表达。通过对二维LSTM模型进行训练,确定每一层LSTM对应的权重矩阵和偏置矩阵。具体地,二维LSTM也属于深度神经网络模型中的一种,神经网络层一般分为三类:输入层、隐藏层和输出层。训练二维LSTM模型的目的就是为了确定每一层中所有的权重矩阵和偏置矩阵以及相应的层数,训练的算法可以采用前向传播算法、维特比算法等现有的算法,这里不对具体的训练算法进行限定。
如图9所示,在一个实施例中,提出了一种语音识别装置,该装置包括:
获取模块902,用于获取待识别的语音数据。
提取模块904,用于提取语音数据中的Filter Bank特征和MFCC特征。
输出模块906,用于将MFCC特征作为训练后的GMM-HMM模型的输入数据,获取训练后的GMM-HMM模型输出的第一似然概率矩阵。
第一计算模块908,用于将Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵。
第二计算模块910,用于根据后验概率矩阵和第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵。
解码模块912,用于根据第二似然概率矩阵在音素解码网络中获取与待识别的语音数据对应的目标词序列。
在一个实施例中,第二计算模块910还用于将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
在一个实施例中,提取模块904还用于将待识别的语音数据进行傅里叶变换转换为频域的能量谱,将频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征,将Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
如图10所示,在一个实施例中,第一计算模块908包括:
排序模块908A,用于获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序。
后验概率计算模块908B,用于将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率。
确定模块908C,用于根据每一帧语音数据对应的后验概率确定待识别语音数据对应的后验概率矩阵。
如图11所示,在一个实施例中,上述语音识别装置还包括:
GMM-HMM模型训练模块914,用于采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定GMM-HMM模型对应的方差和均值,根据方差和均值生成训练后的GMM-HMM模型。
似然概率矩阵获取模块916,用于根据训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到训练语料库对应的似然概率矩阵。
二维LSTM模型训练模块918,用于根据训练预料库中提取的Filter Bank特征和似然概率矩阵对二维LSTM模型进行训练,确定与二维LSTM模型对应的权重矩阵和偏置矩阵,根据权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
上述语言识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。其中,网络接口可以是以太网卡或无线网卡等。上述各模块可以硬件形式内嵌于或独立于服务器中的处理器中,也可以以软件形式存储于服务器中的存储器中,以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。
在一个实施例中,提出一种计算机设备,计算机设备的内部结构可对应于如图1所示的结构,即该计算机设备既可以是服务器也可以是终端,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:获取待识别的语音数据;提取所述语音数据中的Filter Bank特征和MFCC特征;将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然 概率矩阵;根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
在一个实施例中,所述处理器所执行的所述根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵,包括:
将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
在一个实施例中,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在一个实施例中,所述处理器所执行的所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵,包括:获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率;根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
在一个实施例中,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型 对应的方差和均值;根据所述方差和均值生成训练后的GMM-HMM模型;根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
在一个实施例中,提出了一种计算机可读存储介质,其上存储有计算机指令,该指令被处理器执行时实现以下步骤:获取待识别的语音数据;提取所述语音数据中的Filter Bank特征和MFCC特征;将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
在一个实施例中,所述处理器所执行的所述根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵,包括:
将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
在一个实施例中,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;将所述频域的能量谱作为梅尔尺度的三角滤波器组的输 入特征,计算得到待识别语音数据的Filter Bank特征;将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
在一个实施例中,所述处理器所执行的所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵,包括:获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率;根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
在一个实施例中,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;根据所述方差和均值生成训练后的GMM-HMM模型;根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这 些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音识别方法,包括:
    获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
    根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
    根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵的步骤包括:
    将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵;
    将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
  3. 根据权利要求1所述的方法,其特征在于,所述提取所述语音数据中的Filter Bank特征和MFCC特征的步骤包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特 征。
  4. 根据权利要求1所述的方法,其特征在于,所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵的步骤包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  5. 根据权利要求1所述的方法,其特征在于,在所述获取待识别的语音数据的步骤之前还包括:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
  6. 一种语音识别装置,其特征在于,包括:
    获取模块,用于获取待识别的语音数据;
    提取模块,用于提取所述语音数据中的Filter Bank特征和MFCC特征;
    输出模块,用于将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    第一计算模块,用于将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
    第二计算模块,用于根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
    解码模块,用于根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  7. 根据权利要求6所述的装置,其特征在于,所述第二计算模块还用于将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
  8. 根据权利要求6所述的装置,其特征在于,所述提取模块还用于将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱,将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征,将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  9. 根据权利要求6所述的装置,其特征在于,所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵的步骤包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后 预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  10. 根据权利要求6所述的装置,其特征在于,在所述获取待识别的语音数据的步骤之前还包括:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
    根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模 型计算得到目标似然概率矩阵;及
    根据所述目标似然概率矩阵在音素解码网络中获取与所述待识别的语音数据对应的目标词序列。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器所执行的所述根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵,包括:
    将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
  13. 根据权利要求11所述的计算机设备,其特征在于,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  14. 根据权利要求11所述的计算机设备,其特征在于,所述处理器所执行的所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵,包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬 度信息的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应的后验概率矩阵。
  15. 根据权利要求11所述的计算机设备,其特征在于,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
  16. 一个或多个存储有计算机可执行指令的非易失性可读存储介质,所述计算机可执行指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取待识别的语音数据;
    提取所述语音数据中的Filter Bank特征和MFCC特征;
    将所述MFCC特征作为训练后的GMM-HMM模型的输入数据,获取所述训练后的GMM-HMM模型输出的第一似然概率矩阵;
    将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵;
    根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵;及
    根据所述第二似然概率矩阵在音素解码网络中获取与所述待识别的语音 数据对应的目标词序列。
  17. 根据权利要求16所述的计算机设备,其特征在于,所述处理器所执行的所述根据所述后验概率矩阵和所述第一似然概率矩阵采用训练后的HMM模型计算得到目标似然概率矩阵,包括:
    将所述Filter Bank特征和所述第一似然概率矩阵作为训练后的DNN-HMM模型的输入数据,获取所述训练后DNN-HMM输出的第二似然概率矩阵,将所述后验概率矩阵和所述第二似然概率矩阵作为训练后的HMM模型的输入数据,计算得到目标似然概率矩阵。
  18. 根据权利要求16所述的计算机设备,其特征在于,所述处理器所执行的所述提取所述语音数据中的Filter Bank特征和MFCC特征,包括:
    将所述待识别的语音数据进行傅里叶变换转换为频域的能量谱;
    将所述频域的能量谱作为梅尔尺度的三角滤波器组的输入特征,计算得到待识别语音数据的Filter Bank特征;
    将所述Filter Bank特征经过离散余弦变换得到待识别语音数据的MFCC特征。
  19. 根据权利要求16所述的计算机设备,其特征在于,所述处理器所执行的所述将所述Filter Bank特征作为训练后的二维LSTM模型的输入特征,分别进行时间维度和层次维度的计算,获取输出的包含有时间维度和层次维度信息的后验概率矩阵,包括:
    获取待识别语音数据中每一帧语音数据对应的Filter Bank特征并按照时间排序;
    将每一帧语音数据的Fliter Bank特征以及与每一帧语音数据对应的前后预设帧数的Filter Bank特征作为所述训练后的二维LSTM模型的输入特征,分别进行时间维度和层次纬度的计算,获取输出的包含有时间维度和层次纬度信息的每一帧语音数据对应的音素状态上的后验概率;
    根据所述每一帧语音数据对应的后验概率确定所述待识别语音数据对应 的后验概率矩阵。
  20. 根据权利要求16所述的计算机设备,其特征在于,在所述获取待识别的语音数据的步骤之前,所述处理器执行所述计算机程序是还用于实现以下步骤:
    采用训练语料库对GMM-HMM模型进行训练,通过不断的迭代训练确定所述GMM-HMM模型对应的方差和均值;
    根据所述方差和均值生成训练后的GMM-HMM模型;
    根据所述训练语料库中提取的MFCC特征,采用训练后的GMM-HMM模型获取到所述训练语料库对应的似然概率矩阵;
    根据所述训练预料库中提取的Filter Bank特征和所述似然概率矩阵对所述二维LSTM模型进行训练,确定与所述二维LSTM模型对应的权重矩阵和偏置矩阵;
    根据所述权重矩阵和偏置矩阵生成训练后的二维LSTM模型。
PCT/CN2017/100049 2017-06-12 2017-08-31 语音识别方法、装置、计算机设备及存储介质 WO2018227781A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710438772.7A CN107331384B (zh) 2017-06-12 2017-06-12 语音识别方法、装置、计算机设备及存储介质
CN201710438772.7 2017-06-12

Publications (1)

Publication Number Publication Date
WO2018227781A1 true WO2018227781A1 (zh) 2018-12-20

Family

ID=60194261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100049 WO2018227781A1 (zh) 2017-06-12 2017-08-31 语音识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN107331384B (zh)
WO (1) WO2018227781A1 (zh)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993651B (zh) * 2017-12-29 2021-01-19 深圳和而泰数据资源与云技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN108154371A (zh) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 电子装置、身份验证的方法及存储介质
CN108319666B (zh) * 2018-01-19 2021-09-28 国网浙江省电力有限公司营销服务中心 一种基于多模态舆情分析的供电服务评估方法
CN108417207B (zh) * 2018-01-19 2020-06-30 苏州思必驰信息科技有限公司 一种深度混合生成网络自适应方法及***
CN108492820B (zh) * 2018-03-20 2021-08-10 华南理工大学 基于循环神经网络语言模型和深度神经网络声学模型的中文语音识别方法
CN110491388A (zh) * 2018-05-15 2019-11-22 视联动力信息技术股份有限公司 一种音频数据的处理方法和终端
CN108831445A (zh) * 2018-05-21 2018-11-16 四川大学 四川方言识别方法、声学模型训练方法、装置及设备
CN108694951B (zh) * 2018-05-22 2020-05-22 华南理工大学 一种基于多流分层融合变换特征和长短时记忆网络的说话人辨识方法
CN108805224B (zh) * 2018-05-28 2021-10-01 中国人民解放军国防科技大学 具备可持续学习能力的多符号手绘草图识别方法及装置
CN109308912B (zh) * 2018-08-02 2024-02-20 平安科技(深圳)有限公司 音乐风格识别方法、装置、计算机设备及存储介质
CN109830277B (zh) * 2018-12-12 2024-03-15 平安科技(深圳)有限公司 一种跳绳监测方法、电子装置及存储介质
CN109559749B (zh) * 2018-12-24 2021-06-18 思必驰科技股份有限公司 用于语音识别***的联合解码方法及***
CN109657874A (zh) * 2018-12-29 2019-04-19 安徽数升数据科技有限公司 一种基于长短时记忆模型的电力中长期负荷预测方法
CN109637524A (zh) * 2019-01-18 2019-04-16 徐州工业职业技术学院 一种人工智能交互方法及人工智能交互装置
CN109887484B (zh) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 一种基于对偶学习的语音识别与语音合成方法及装置
CN110053055A (zh) * 2019-03-04 2019-07-26 平安科技(深圳)有限公司 一种机器人及其回答问题的方法、存储介质
CN110033758B (zh) * 2019-04-24 2021-09-24 武汉水象电子科技有限公司 一种基于小训练集优化解码网络的语音唤醒实现方法
CN110047468B (zh) * 2019-05-20 2022-01-25 北京达佳互联信息技术有限公司 语音识别方法、装置及存储介质
CN110556125B (zh) * 2019-10-15 2022-06-10 出门问问信息科技有限公司 基于语音信号的特征提取方法、设备及计算机存储介质
CN110992929A (zh) * 2019-11-26 2020-04-10 苏宁云计算有限公司 一种基于神经网络的语音关键词检测方法、装置及***
CN110929804B (zh) * 2019-12-03 2024-04-09 无限极(中国)有限公司 一种栽培品产地识别方法、装置、设备及介质
CN111698552A (zh) * 2020-05-15 2020-09-22 完美世界(北京)软件科技发展有限公司 一种视频资源的生成方法和装置
CN112435653A (zh) * 2020-10-14 2021-03-02 北京地平线机器人技术研发有限公司 语音识别方法、装置和电子设备
CN112750428A (zh) * 2020-12-29 2021-05-04 平安普惠企业管理有限公司 语音交互方法、装置和计算机设备
CN113643692B (zh) * 2021-03-25 2024-03-26 河南省机械设计研究院有限公司 基于机器学习的plc语音识别方法
CN113643718B (zh) * 2021-08-16 2024-06-18 贝壳找房(北京)科技有限公司 音频数据处理方法和装置
CN113763960B (zh) * 2021-11-09 2022-04-26 深圳市友杰智新科技有限公司 模型输出的后处理方法、装置和计算机设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备
CN106557809A (zh) * 2015-09-30 2017-04-05 富士通株式会社 神经网络***及对该神经网络***进行训练的方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540957B2 (en) * 2014-12-15 2020-01-21 Baidu Usa Llc Systems and methods for speech transcription
CN105810192B (zh) * 2014-12-31 2019-07-02 展讯通信(上海)有限公司 语音识别方法及其***
CN104900232A (zh) * 2015-04-20 2015-09-09 东南大学 一种基于双层gmm结构和vts特征补偿的孤立词识别方法
CN105931633A (zh) * 2016-05-30 2016-09-07 深圳市鼎盛智能科技有限公司 语音识别的方法及***

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557809A (zh) * 2015-09-30 2017-04-05 富士通株式会社 神经网络***及对该神经网络***进行训练的方法
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HSU, WEI-NING ET AL.: "A prioritized grid long short-term memory RNN for speech recognition", IEEE PROC. 2016 SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 13 December 2016 (2016-12-13) - 16 December 2016 (2016-12-16), San Diego, California, pages 467 - 473, XP033061780, DOI: 10.1109/SLT.2016.7846305 *
LI,JINYU ET AL.: "Exploring multidimensional LSTMS for large vocabulary ASR", IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2016, pages 4940 - 4944, XP032901543, DOI: 10.1109/ICASSP.2016.7472617 *

Also Published As

Publication number Publication date
CN107331384A (zh) 2017-11-07
CN107331384B (zh) 2018-05-04

Similar Documents

Publication Publication Date Title
WO2018227781A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2018227780A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
US11030998B2 (en) Acoustic model training method, speech recognition method, apparatus, device and medium
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
US11875775B2 (en) Voice conversion system and training method therefor
CN111312245B (zh) 一种语音应答方法、装置和存储介质
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
WO2020029404A1 (zh) 语音处理方法及装置、计算机装置及可读存储介质
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
CN109377981B (zh) 音素对齐的方法及装置
CN114550703A (zh) 语音识别***的训练方法和装置、语音识别方法和装置
Peguda et al. Speech to sign language translation for Indian languages
Singh et al. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning
CN113823265A (zh) 一种语音识别方法、装置和计算机设备
CN113539239B (zh) 语音转换方法、装置、存储介质及电子设备
Kurian et al. Connected digit speech recognition system for Malayalam language
Hao et al. Denoi-spex+: a speaker extraction network based speech dialogue system
Vyas et al. Study of Speech Recognition Technology and its Significance in Human-Machine Interface
Abudubiyaz et al. The acoustical and language modeling issues on Uyghur speech recognition
Zou et al. End to End Speech Recognition Based on ResNet-BLSTM
Wang et al. Artificial Intelligence and Machine Learning Application in NPP MCR Speech Monitoring System
Mendiratta et al. Robust feature extraction and recognition model for automatic speech recognition system on news report dataset
Huang et al. A speaker recognition method based on GMM using non-negative matrix factorization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17914080

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17914080

Country of ref document: EP

Kind code of ref document: A1