WO2018227781A1 - Procédé de reconnaissance vocale, appareil, dispositif informatique et support de stockage - Google Patents

Procédé de reconnaissance vocale, appareil, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2018227781A1
WO2018227781A1 PCT/CN2017/100049 CN2017100049W WO2018227781A1 WO 2018227781 A1 WO2018227781 A1 WO 2018227781A1 CN 2017100049 W CN2017100049 W CN 2017100049W WO 2018227781 A1 WO2018227781 A1 WO 2018227781A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability matrix
feature
trained
voice data
filter bank
Prior art date
Application number
PCT/CN2017/100049
Other languages
English (en)
Chinese (zh)
Inventor
梁浩
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018227781A1 publication Critical patent/WO2018227781A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer processing, and in particular, to a voice recognition method, apparatus, computer device, and storage medium.
  • Speech recognition also known as Automatic Speech Recognition (ASR)
  • ASR Automatic Speech Recognition
  • speech recognition technology is the premise of natural language processing, and can effectively promote the development of voice-activated interaction related fields and greatly facilitate people's lives, such as smart home and voice input.
  • the accuracy of speech recognition directly determines the effectiveness of technical applications.
  • the traditional speech recognition technology is based on GMM-HMM (hybrid Gaussian model and hidden Markov model).
  • GMM-HMM hybrid Gaussian model and hidden Markov model
  • DNN-HMM deep learning model and hidden Markov
  • a voice recognition method for example, a voice recognition method, apparatus, computer device, and storage medium are provided.
  • a speech recognition method comprising:
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; according to the posterior
  • the probability matrix and the first likelihood probability matrix are calculated by using the trained HMM model to obtain a target likelihood probability matrix;
  • a speech recognition device comprising:
  • An acquiring module configured to acquire voice data to be identified
  • An extraction module configured to extract a Filter Bank feature and an MFCC feature in the voice data
  • An output module configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model;
  • a first calculation module configured to use the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, and perform time dimension and hierarchical dimension respectively, and obtain an output posterior probability including time dimension and hierarchical dimension information.
  • a second calculating module configured to calculate a target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix
  • a decoding module configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the to-be-identified voice data according to the target likelihood probability matrix.
  • a computer device comprising a memory and a processor, wherein the memory stores computer readable instructions, the computer readable instructions being executed by the processor, such that the processor performs the step of: acquiring a speech to be recognized Data
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • One or more non-volatile readable storage media storing computer-executable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of: acquiring Voice data to be identified;
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • FIG. 1 is a block diagram showing the internal structure of a computer device in an embodiment
  • FIG. 3 is a flow chart of a voice recognition method in an embodiment
  • FIG. 4 is a schematic structural diagram of a two-dimensional LSTM in one embodiment.
  • FIG. 5 is a flowchart of a method for calculating a target likelihood probability matrix by using a trained HMM model according to a posterior probability matrix and a first likelihood probability matrix in one embodiment
  • FIG. 6 is a flow chart of a method for extracting Filter Bank features and MFCC features in voice data in an embodiment
  • FIG. 7 is a flow chart of a method for obtaining a posterior probability matrix by a two-dimensional LSTM model in one embodiment
  • FIG. 8 is a flow chart of a method for establishing a GMM-HMM model and a two-dimensional LSTM model in one embodiment
  • Figure 9 is a block diagram showing the structure of a voice recognition apparatus in an embodiment
  • FIG. 10 is a structural block diagram of a first computing module in an embodiment
  • Figure 11 is a block diagram showing the structure of a voice recognition apparatus in another embodiment.
  • FIG. 1 it is a schematic diagram of the internal structure of a computer device in one embodiment.
  • the computer device can be a terminal or a server.
  • the computer device includes a total through the system Wire-connected processor, non-volatile storage media, internal memory, network interface, display, and input devices.
  • the non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the internal memory can store computer readable instructions that, when executed by the processor, cause the processor to perform a speech recognition method.
  • the network interface of the computer device is used for network communication.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing, and It can be an external keyboard, trackpad or mouse.
  • the touch layer and display form a touch screen.
  • speech recognition mainly consists of two parts: acoustic model and language model, and then combined with the dictionary to form the framework of speech recognition.
  • the process of speech recognition is the process of converting a sequence of input speech features into a sequence of characters based on a dictionary, an acoustic model, and a language model.
  • the role of the acoustic model is to obtain the mapping between phonetic features and phonemes.
  • the role of the language model is to obtain the mapping between words and words, words and sentences.
  • the role of the dictionary is to obtain the mapping between words and phonemes.
  • the process of specific speech recognition can be divided into three steps.
  • the first step is to identify the speech frame into a phoneme state, that is, to align the speech frame and the phoneme state.
  • the second step is to combine the phonemes according to the phoneme state.
  • the third step is to combine the phonemes into words.
  • the first step is the role of the acoustic model, which is the key point and the difficulty. The more accurate the alignment result of the speech frame and the phoneme state, the better the effect of speech recognition.
  • the phoneme state is a finer phone unit than the phoneme, and usually one phoneme is composed of three phoneme states.
  • a voice recognition method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
  • Step 302 Acquire voice data to be identified.
  • the voice data to be identified here is usually obtained by the interactive application to obtain audio data input by the user, including digital audio and text audio.
  • step 304 the Filter Bank feature and the MFCC feature in the voice data are extracted.
  • the Filter Bank feature and the MFCC (Mel frequency cepstrum coefficient) feature are parameters used in speech recognition to represent speech features.
  • Filter Bank is used for deep learning model and MFCC is used for mixed Gaussian model.
  • MFCC is used for mixed Gaussian model.
  • pre-emphasis processing is performed on the input voice data, and a high-frequency portion of the voice signal is used to improve the frequency spectrum by using a high-pass filter, and then the pre-emphasized voice data is framed and windowed, thereby The non-stationary speech signal is converted into a short-term stationary signal, and then the endpoint detection is used to distinguish between speech and noise, and extract a valid speech portion.
  • the preprocessed speech data is subjected to fast Fourier transform, thereby converting the time domain speech signal into the frequency domain energy spectrum for analysis, and then the energy spectrum.
  • the formant characteristics of the speech are highlighted by a set of Mel-scale triangular filter banks, and then the logarithmic energy of each filter bank output is calculated.
  • the filter bank output is characterized by the Filter Bank feature.
  • the calculated logarithmic energy is subjected to discrete cosine transform to obtain MFCC coefficients, that is, MFCC features.
  • Step 306 The MFCC feature is used as input data of the trained GMM-HMM model, and the first likelihood probability matrix of the output of the trained GMM-HMM model is obtained.
  • the acoustic model and the language model collectively realize the recognition of the voice.
  • the role of the acoustic model is to identify the alignment relationship between the speech frame and the phoneme state.
  • the GMM-HMM model is part of the acoustic model and is used to initially align the speech frame with the phoneme state.
  • the extracted MFCC feature of the voice data to be recognized is used as the input data of the trained GMM-HMM model, and then the likelihood probability matrix of the model output is obtained, which is referred to as “first” for convenience and subsequent differentiation.
  • Likelihood probability matrix Likelihood probability matrix
  • the likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state, that is, the alignment relationship between the speech frame and the phoneme state can be obtained according to the calculated likelihood probability matrix, but the alignment obtained by the GMM-HMM training is obtained. Relationship is not very accurate Indeed, the first likelihood probability matrix here is equivalent to a preliminary alignment of the speech frame and phoneme state.
  • the specific calculation formula of the GMM model is as follows:
  • x represents the extracted speech feature (MFCC) vector
  • ⁇ , D are the mean and variance matrix, respectively
  • K represents the order of the MFCC coefficients.
  • Step 308 The Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained.
  • the LSTM model belongs to the deep learning model and is also part of the acoustic model.
  • Two-dimensional LSTM is an innovative model based on the traditional LSTM model, which includes not only the time dimension but also the hierarchical dimension. Therefore, the model has a better recognition effect than the traditional LSTM model.
  • the Filter Bank feature as the input feature of the trained two-dimensional LSTM model
  • the use of the time dimension and the hierarchical dimension is achieved by calculating the two dimensions of the same input (speech feature) and finally combining the results.
  • the time dimension is calculated first, and then the input as the hierarchical dimension is output.
  • each LSTM neuron node has both time and level information.
  • FIG. 4 it is a schematic structural diagram of a two-dimensional LSTM in one embodiment.
  • t represents time
  • l represents layer
  • T refers to time dimension
  • timeLSTM timeLSTM
  • D hierarchical latitude DepthLSTM.
  • the output is:
  • Step 310 Apply the trained HMM mode according to the posterior probability matrix and the first likelihood probability matrix.
  • the type calculation yields a target likelihood probability matrix.
  • the HMM (Hidden Markov) model is a statistical model used to describe a Markov process with implicit unknown parameters, the role of which is to determine the implicit parameters in the process from observable parameters.
  • the HMM model mainly involves five parameters, which are two state sets and three probability sets.
  • the two state sets are the hidden state and the observed state, and the three probability sets are the initial matrix, the transition matrix and the confusion matrix.
  • the transfer matrix is obtained by training, that is, once the training of the HMM model is completed, the transfer matrix is determined.
  • the observable speech feature Finter Bank feature
  • the observation state is mainly used as the observation state to calculate the correspondence relationship between the phoneme state and the speech frame (ie, the implied state).
  • the posterior probability matrix calculated by the two-dimensional LSTM model is the confusion matrix to be determined in the HMM model
  • the first likelihood probability matrix is the initial matrix to be determined. Therefore, using the posterior probability matrix and the first likelihood probability matrix as the input data of the trained HMM model, the target likelihood probability matrix of the output can be obtained.
  • the target likelihood probability matrix represents the final alignment relationship between the phoneme state and the speech frame. Subsequently, according to the determined target likelihood probability matrix, the target word sequence corresponding to the voice data to be recognized can be acquired in the phoneme decoding network.
  • Step 312 Acquire a target word sequence corresponding to the voice data to be recognized in the phoneme decoding network according to the target likelihood probability matrix.
  • the speech recognition process two parts are included, one is an acoustic model and the other is a language model.
  • the search algorithm can use the Viterbi algorithm ( Viterbi algorithm). This path is to output the word string corresponding to the voice data to be recognized with the maximum probability, thus determining the text contained in the voice data.
  • the decoding network of the phoneme decoding level (ie, the phoneme decoding network) is implemented by a finite state machine (Finite State Transducer, FST) correlation algorithm, such as a deterministic algorithm, a minimization algorithm, and a sentence is divided into words. Then split the words into phonemes (such as Chinese vowels, English phonetic symbols), and then use phonemes and pronunciation dictionaries, The grammar and the like are aligned by the above method to obtain an output phoneme decoding network.
  • the phoneme decoding network contains all possible identified path representations. The decoding process is based on the input voice data, and the path of the huge network is deleted, and one or more candidate paths are obtained, and the data structure stored in a word network is stored. Medium, and then the final identification is to score the candidate path, the path with the highest score is the recognition result.
  • FST Finite State Transducer
  • the above speech recognition method combines the long Gaussian recurrent neural network LSTM in the mixed Gaussian model GMM and the deep learning model, and first calculates the first likelihood probability matrix according to the extracted MFCC features by using the GMM-HMM model, the first likelihood The probability matrix represents the alignment of the speech data on the phoneme state, and then further alignment based on the previous preliminary alignment results using LSTM, which is beneficial to improve the accuracy of speech recognition, and the LSTM adopts an innovative two-dimensional LSTM includes both time dimension information and hierarchical latitude information, so LSTM with traditional time dimension information has better speech feature representation, which further improves the accuracy of speech recognition.
  • the step 310 of calculating the target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix comprises:
  • Step 310A The Filter Bank feature and the first likelihood probability matrix are used as input data of the trained DNN-HMM model, and the second likelihood probability matrix of the trained DNN-HMM output is obtained.
  • Step 310B The posterior probability matrix and the second likelihood probability matrix are used as input data of the trained HMM model, and the target likelihood probability matrix is calculated.
  • the preliminary alignment result (first likelihood probability matrix) is obtained through the trained GMM-HMM model, and then the DNN-HMM after training is further aligned. Can get better alignment. Since the deep neural network model can obtain better speech feature representation than the traditional mixed Gaussian model, using the deep neural network model for further forced alignment can further improve the accuracy. Then the further pair The homogeneous result (second likelihood probability matrix) is substituted into the innovative two-dimensional LSTM-HMM model, and the final alignment result (target likelihood probability matrix) can be obtained. It should be noted that the alignment result herein refers to the alignment relationship between the speech frame and the phoneme state.
  • the above-mentioned mixed Gaussian model or deep learning model is part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, so as to facilitate the subsequent acquisition of the speech data to be recognized in the phoneme decoding network.
  • the corresponding target word sequence is part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, so as to facilitate the subsequent acquisition of the speech data to be recognized in the phoneme decoding network.
  • the GMM-HMM model is first used to calculate the first appearance based on the extracted MFCC features.
  • the probability matrix, the first likelihood probability matrix represents the preliminary alignment result of the speech frame and the phoneme state, and then further aligns with the DNN-HMM model based on this, and then uses LSTM to perform the final based on the previous alignment result.
  • the one-step alignment improves the speech recognition by combining the GMM-HMM, DNN-HMM model and LSTM model, and the LSTM uses an innovative two-dimensional LSTM that includes both time-dimensional information and hierarchical latitude. Information, compared with the traditional LSTM with only time dimension information, has better speech feature representation, which is beneficial to further improve the effect of speech recognition.
  • the step 304 of extracting the Filter Bank feature and the MFCC feature in the voice data includes:
  • Step 304A Perform Fourier transform on the speech data to be identified into an energy spectrum in the frequency domain.
  • the transformation of the speech signal in the time domain is generally difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe, different energy distributions, representing characteristics of different speeches. . Therefore, the speech data to be identified needs to be subjected to fast Fourier transform to obtain an energy distribution on the spectrum.
  • the spectrum of each frame is obtained by performing fast Fourier transform on each frame of the speech signal, and the power spectrum (ie, the energy spectrum) of the speech signal is obtained by modulo the square of the spectrum of the speech signal.
  • step 304B the energy spectrum in the frequency domain is used as an input characteristic of the triangular filter bank of the Meyer scale, and the Filter Bank feature of the speech data to be recognized is calculated.
  • the obtained energy spectrum of the frequency domain needs to be used as an input characteristic of the triangular filter bank of the Meyer scale, and the logarithm of the output of each triangular filter bank is calculated.
  • Energy the Filter Bank feature that gets the speech data to be recognized.
  • the Filter Bank feature also uses the energy spectrum corresponding to each frame of the speech signal as the input characteristic of the triangular filter bank of the Meyer scale, and then obtains the Filter Bank feature corresponding to each frame of the speech signal.
  • step 304C the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the voice data to be recognized.
  • the filter bank in order to obtain the MFCC feature of the speech data to be recognized, it is also necessary to perform discrete cosine transform on the logarithmic energy outputted by the filter bank to obtain a corresponding MFCC feature.
  • the MFCC feature corresponding to each frame of the speech signal is obtained by discrete Cosine transforming the Filter Bank feature corresponding to each frame of the speech signal.
  • the difference between the Filter Bank feature and the MFCC feature is that the Filter Bank feature has data correlation between different feature dimensions, while the MFCC feature is a feature obtained by using discrete cosine transform to remove the data correlation of the Filter Bank feature.
  • the Filter Bank feature is used as an input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical dimension are respectively calculated, and the output includes the time dimension and the hierarchical dimension information.
  • Step 308 of the posterior probability matrix includes:
  • Step 308A Acquire a Filter Bank feature corresponding to each frame of voice data in the to-be-identified voice data, and sort according to time.
  • the voice data is first framed, and then the Filter Bank features corresponding to each frame of the voice data are extracted, and sorted according to the time sequence. That is, the Filter Bank features of each frame are sorted according to the order in which each frame of the speech data to be recognized appears.
  • Step 308B using the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as input features of the trained two-dimensional LSTM model, respectively performing time dimension and hierarchical latitude The calculation obtains a posterior probability of the phoneme state corresponding to each frame of voice data including the time dimension and the latitude latitude information.
  • the input of the deep learning model adopts a multi-frame feature, which is more advantageous than the traditional mixed Gaussian model with only single frame input, because the framing before and after the speech frame is beneficial to obtain the context-related information for the current influences. Therefore, the Filter Bank feature of each frame of voice data and the preset number of frames before and after the frame is generally used as an input feature of the trained two-dimensional LSTM model. For example, the current frame and the five frames before and after the frame are spliced, and a total of 11 frames of data are used as input features of the trained two-dimensional LSTM model, and the 11-frame speech feature sequence is outputted through each node in the two-dimensional LSTM. The posterior probability of the phoneme state corresponding to the speech data.
  • Step 308C Determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the posterior probability matrix corresponding to the to-be-identified voice data is determined.
  • the posterior probability matrix is composed of one posterior probability. Since the two-dimensional LSTM model can contain both time dimension information and hierarchical latitude information, the model can better obtain the corresponding speech data corresponding to the traditional model with only the time dimension information. Probability matrix.
  • the method before the step of acquiring the voice data to be identified, the method further includes: step 301, establishment of a GMM-HMM model, and establishment of a two-dimensional LSTM model. Specifically include:
  • step 30lA the training corpus is used to train the GMM-HMM model, and the variance and mean of the GMM-HMM model are determined through continuous iterative training, and the trained GMM-HMM model is generated according to the variance and the mean.
  • the GMM-HMM acoustic model is established by using single phoneme training and three phonemes for training.
  • the triphone training considers the influence of the related phonemes of the current phoneme, and can obtain more accurate alignment effect. Can produce better recognition results.
  • triphone training generally uses triphone training based on delta+delta-delta feature, linear discriminant analysis + trisality training of maximum likelihood linear feature conversion.
  • voice features in the input training prediction library are first normalized, and the default variance is normalized.
  • Speech feature normalization The purpose is to eliminate the deviation caused by the convolution noise of the telephone channel and the feature extraction calculation.
  • an initial GMM-HMM model is quickly obtained by using a small amount of feature data, and then the variance and mean of the mixed Gaussian model GMM-HMM are determined through continuous iterative training. Once the variance and mean are determined, the corresponding GMM-HMM model is corresponding. Ok.
  • Step 301B According to the MFCC feature extracted from the training corpus, the trained likelihood corpus corresponding to the training corpus is obtained by using the trained GMM-HMM model.
  • the voice data in the training prediction library is used for training, and the MFCC feature of the speech in the training corpus is extracted, and then as the input feature of the trained GMM-HMM model, the voice corresponding to the output training corpus is obtained.
  • Likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state.
  • the trained GMM-HMM output likelihood probability matrix is used as the initial alignment relationship of the subsequent training deep learning model, which facilitates the subsequent deep learning model. Get better results from deep learning.
  • Step 301C training the two-dimensional LSTM model according to the Filter Bank feature and the likelihood probability matrix extracted in the training prediction library, determining a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model, and generating training according to the weight matrix and the offset matrix. After the two-dimensional LSTM model.
  • the alignment result calculated by the GMM-HMM described above (ie, the likelihood probability matrix) and the original speech feature are trained together as input features of the two-dimensional LSTM model, wherein the original speech feature is filtered.
  • Bank feature, the Filter Bank feature has data correlation with respect to MFCC features, so it has better speech feature representation.
  • the weight matrix and offset matrix corresponding to each layer of LSTM are determined.
  • the two-dimensional LSTM also belongs to one of the deep neural network models, and the neural network layer generally falls into three categories: an input layer, a hidden layer, and an output layer.
  • the purpose of training the two-dimensional LSTM model is to determine all the weight matrix and offset matrix and the corresponding number of layers in each layer.
  • the training algorithm can use existing algorithms such as forward propagation algorithm and Viterbi algorithm. The training algorithm is limited.
  • a voice recognition apparatus comprising:
  • the obtaining module 902 is configured to acquire voice data to be identified.
  • the extraction module 904 is configured to extract the Filter Bank feature and the MFCC feature in the voice data.
  • the output module 906 is configured to use the MFCC feature as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix of the output of the trained GMM-HMM model.
  • the first calculation module 908 is configured to use the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, and perform time dimension and hierarchical dimension calculation respectively, and obtain an output posterior probability matrix including time dimension and hierarchical dimension information. .
  • the second calculating module 910 is configured to calculate the target likelihood probability matrix by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix.
  • the decoding module 912 is configured to acquire, in the phoneme decoding network, a target word sequence corresponding to the voice data to be identified according to the second likelihood probability matrix.
  • the second calculating module 910 is further configured to use the Filter Bank feature and the first likelihood probability matrix as input data of the trained DNN-HMM model to obtain the trained DNN-HMM output.
  • the second likelihood probability matrix is obtained by using the posterior probability matrix and the second likelihood probability matrix as input data of the trained HMM model, and calculating a target likelihood probability matrix.
  • the extraction module 904 is further configured to perform Fourier transform on the speech data to be recognized into an energy spectrum in the frequency domain, and use the energy spectrum in the frequency domain as an input characteristic of the triangular filter group of the Meyer scale.
  • the Filter Bank feature of the speech data to be recognized is calculated, and the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
  • the first computing module 908 includes:
  • the sorting module 908A is configured to acquire Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sort them by time.
  • the posterior probability calculation module 908B is configured to use the Fliter Bank feature of each frame of voice data and the Filter Bank feature of the preset frame number corresponding to each frame of voice data as input features of the trained two-dimensional LSTM model, respectively. The calculation of the time dimension and the hierarchical latitude is performed, and the posterior probability of the phoneme state corresponding to each frame of the voice data including the time dimension and the hierarchical latitude information is obtained.
  • the determining module 908C is configured to determine a posterior probability matrix corresponding to the to-be-identified voice data according to a posterior probability corresponding to each frame of voice data.
  • the voice recognition apparatus further includes:
  • the GMM-HMM model training module 914 is used to train the GMM-HMM model by using the training corpus, and determine the variance and mean of the GMM-HMM model through continuous iterative training, and generate the trained GMM-HMM model according to the variance and the mean.
  • the likelihood probability matrix obtaining module 916 is configured to obtain the likelihood probability matrix corresponding to the training corpus according to the MFCC feature extracted from the training corpus and the trained GMM-HMM model.
  • the two-dimensional LSTM model training module 918 is configured to train the two-dimensional LSTM model according to the Filter Bank feature and the likelihood probability matrix extracted in the training prediction library, and determine a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model, according to the weight The matrix and the offset matrix generate a trained two-dimensional LSTM model.
  • the network interface may be an Ethernet card or a wireless network card.
  • the above modules may be embedded in the hardware in the processor or in the memory in the server, or may be stored in the memory in the server, so that the processor calls the corresponding operations of the above modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • a computer device is proposed.
  • the internal structure of the computer device may correspond to the structure as shown in FIG. 1, that is, the computer device may be a server or a terminal, and includes a memory, a processor, and a storage device.
  • a computer program on the memory and operable on the processor, the processor executing the computer program to: acquire voice data to be recognized; extract Filter Bank features and MFCC in the voice data Feature; using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output; using the Filter Bank feature as a trained two-dimensional
  • the input features of the LSTM model are respectively calculated by time dimension and hierarchical dimension, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; and the posterior probability matrix and the first likelihood probability matrix are adopted according to the posterior probability matrix
  • the trained HMM model calculates the target likelihood a probability matrix; acquiring, in the phoneme decoding network, a target word sequence
  • the performing, by the processor, the target likelihood probability matrix is calculated by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix, including:
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Generating the energy spectrum of the frequency domain as the input characteristic of the triangular filter bank of the Meyer scale, and calculating the Filter Bank feature of the speech data to be recognized; and performing the discrete cosine transform on the Filter Bank feature to obtain the speech data to be recognized MFCC features.
  • the performing, by the processor, the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, respectively calculating a time dimension and a hierarchical dimension, and obtaining an output containing a time dimension And the posterior probability matrix of the hierarchical dimension information comprising: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; Fliter Bank characteristics of each frame of voice data and voice data of each frame The corresponding Filter Bank feature of the preset preset frame number is used as the input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical latitude are respectively calculated, and each frame containing the time dimension and the hierarchical latitude information is obtained.
  • a posterior probability of the phoneme state corresponding to the voice data determining a posterior probability matrix corresponding to the to-be-identified voice data according to the posterior probability corresponding to each frame of the voice data.
  • the executing the computer program by the processor is further used to implement the following steps: training the GMM-HMM model by using a training corpus, through continuous Iterative training determines the GMM-HMM model Corresponding variance and mean value; generating a trained GMM-HMM model according to the variance and the mean value; obtaining a likelihood corresponding to the training corpus according to the MFCC feature extracted in the training corpus, using the trained GMM-HMM model a probability matrix; training the two-dimensional LSTM model according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix, and determining a weight matrix and an offset matrix corresponding to the two-dimensional LSTM model; The weight matrix and the offset matrix generate a trained two-dimensional LSTM model.
  • a computer readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the steps of: acquiring voice data to be recognized; extracting Filter Bank in the voice data Feature and MFCC feature; using the MFCC feature as input data of the trained GMM-HMM model, acquiring a first likelihood probability matrix of the trained GMM-HMM model output; using the Filter Bank feature as a training
  • the input features of the two-dimensional LSTM model are respectively calculated by time dimension and hierarchical dimension, and the output posterior probability matrix including the time dimension and the hierarchical dimension information is obtained; according to the posterior probability matrix and the first likelihood
  • the probability matrix is calculated by using the trained HMM model to obtain a target likelihood probability matrix; and the target word sequence corresponding to the to-be-identified voice data is acquired in the phoneme decoding network according to the target likelihood probability matrix.
  • the performing, by the processor, the target likelihood probability matrix is calculated by using the trained HMM model according to the posterior probability matrix and the first likelihood probability matrix, including:
  • the extracting, by the processor, the Filter Bank feature and the MFCC feature in the voice data comprising: converting the voice data to be identified into a frequency domain energy by Fourier transform Spectrum; the energy spectrum of the frequency domain is used as the output of the triangular filter bank of the Meyer scale Into the feature, the Filter Bank feature of the speech data to be recognized is calculated; the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
  • the performing, by the processor, the Filter Bank feature as an input feature of the trained two-dimensional LSTM model, respectively calculating a time dimension and a hierarchical dimension, and obtaining an output containing a time dimension And the posterior probability matrix of the hierarchical dimension information comprising: acquiring Filter Bank features corresponding to each frame of voice data in the to-be-identified voice data and sorting according to time; Fliter Bank characteristics of each frame of voice data and voice data of each frame The corresponding Filter Bank feature of the preset preset frame number is used as the input feature of the trained two-dimensional LSTM model, and the time dimension and the hierarchical latitude are respectively calculated, and each frame containing the time dimension and the hierarchical latitude information is obtained.
  • a posterior probability of the phoneme state corresponding to the voice data determining a posterior probability matrix corresponding to the to-be-identified voice data according to the posterior probability corresponding to each frame of the voice data.
  • the executing the computer program by the processor is further used to implement the following steps: training the GMM-HMM model by using a training corpus, through continuous Iterative training determines the variance and mean of the GMM-HMM model; generates the trained GMM-HMM model according to the variance and the mean; according to the MFCC feature extracted from the training corpus, uses the trained GMM-HMM model to obtain a likelihood probability matrix corresponding to the training corpus; training the two-dimensional LSTM model according to the Filter Bank feature extracted from the training prediction library and the likelihood probability matrix, and determining to correspond to the two-dimensional LSTM model a weight matrix and an offset matrix; generating a trained two-dimensional LSTM model based on the weight matrix and the offset matrix.
  • the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, or a read-only memory (ROM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un procédé de reconnaissance vocale, ledit procédé comprenant : l'obtention de données vocales à reconnaître; l'extraction d'une caractéristique de banc de filtres et d'une caractéristique MFCC à partir de données vocales; l'utilisation de la caractéristique MFCC en tant que donnée d'entrée d'un modèle GMM-HMM, et l'obtention d'une première matrice de probabilité de similitude; l'utilisation de la caractéristique de banc de filtres en tant que caractéristique d'entrée d'un modèle LSTM bidimensionnel, et l'obtention d'une matrice de probabilité postérieure; l'utilisation de la matrice de probabilité postérieure et de la première matrice de probabilité de similitude en tant que données d'entrée d'un modèle HMM, l'obtention d'une deuxième matrice de probabilité de similitude, et en fonction de la deuxième matrice de probabilité de similitude, l'obtention d'une séquence de mots cibles correspondants à partir d'un réseau de décodage de phonème.
PCT/CN2017/100049 2017-06-12 2017-08-31 Procédé de reconnaissance vocale, appareil, dispositif informatique et support de stockage WO2018227781A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710438772.7A CN107331384B (zh) 2017-06-12 2017-06-12 语音识别方法、装置、计算机设备及存储介质
CN201710438772.7 2017-06-12

Publications (1)

Publication Number Publication Date
WO2018227781A1 true WO2018227781A1 (fr) 2018-12-20

Family

ID=60194261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/100049 WO2018227781A1 (fr) 2017-06-12 2017-08-31 Procédé de reconnaissance vocale, appareil, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN107331384B (fr)
WO (1) WO2018227781A1 (fr)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993651B (zh) * 2017-12-29 2021-01-19 深圳和而泰数据资源与云技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN108154371A (zh) * 2018-01-12 2018-06-12 平安科技(深圳)有限公司 电子装置、身份验证的方法及存储介质
CN108319666B (zh) * 2018-01-19 2021-09-28 国网浙江省电力有限公司营销服务中心 一种基于多模态舆情分析的供电服务评估方法
CN108417207B (zh) * 2018-01-19 2020-06-30 苏州思必驰信息科技有限公司 一种深度混合生成网络自适应方法及***
CN108492820B (zh) * 2018-03-20 2021-08-10 华南理工大学 基于循环神经网络语言模型和深度神经网络声学模型的中文语音识别方法
CN110491388A (zh) * 2018-05-15 2019-11-22 视联动力信息技术股份有限公司 一种音频数据的处理方法和终端
CN108831445A (zh) * 2018-05-21 2018-11-16 四川大学 四川方言识别方法、声学模型训练方法、装置及设备
CN108694951B (zh) * 2018-05-22 2020-05-22 华南理工大学 一种基于多流分层融合变换特征和长短时记忆网络的说话人辨识方法
CN108805224B (zh) * 2018-05-28 2021-10-01 中国人民解放军国防科技大学 具备可持续学习能力的多符号手绘草图识别方法及装置
CN109308912B (zh) * 2018-08-02 2024-02-20 平安科技(深圳)有限公司 音乐风格识别方法、装置、计算机设备及存储介质
CN109830277B (zh) * 2018-12-12 2024-03-15 平安科技(深圳)有限公司 一种跳绳监测方法、电子装置及存储介质
CN109559749B (zh) * 2018-12-24 2021-06-18 思必驰科技股份有限公司 用于语音识别***的联合解码方法及***
CN109657874A (zh) * 2018-12-29 2019-04-19 安徽数升数据科技有限公司 一种基于长短时记忆模型的电力中长期负荷预测方法
CN109637524A (zh) * 2019-01-18 2019-04-16 徐州工业职业技术学院 一种人工智能交互方法及人工智能交互装置
CN109887484B (zh) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 一种基于对偶学习的语音识别与语音合成方法及装置
CN110053055A (zh) * 2019-03-04 2019-07-26 平安科技(深圳)有限公司 一种机器人及其回答问题的方法、存储介质
CN110033758B (zh) * 2019-04-24 2021-09-24 武汉水象电子科技有限公司 一种基于小训练集优化解码网络的语音唤醒实现方法
CN110047468B (zh) * 2019-05-20 2022-01-25 北京达佳互联信息技术有限公司 语音识别方法、装置及存储介质
CN110556125B (zh) * 2019-10-15 2022-06-10 出门问问信息科技有限公司 基于语音信号的特征提取方法、设备及计算机存储介质
CN110992929A (zh) * 2019-11-26 2020-04-10 苏宁云计算有限公司 一种基于神经网络的语音关键词检测方法、装置及***
CN110929804B (zh) * 2019-12-03 2024-04-09 无限极(中国)有限公司 一种栽培品产地识别方法、装置、设备及介质
CN111698552A (zh) * 2020-05-15 2020-09-22 完美世界(北京)软件科技发展有限公司 一种视频资源的生成方法和装置
CN112435653A (zh) * 2020-10-14 2021-03-02 北京地平线机器人技术研发有限公司 语音识别方法、装置和电子设备
CN112750428A (zh) * 2020-12-29 2021-05-04 平安普惠企业管理有限公司 语音交互方法、装置和计算机设备
CN113643692B (zh) * 2021-03-25 2024-03-26 河南省机械设计研究院有限公司 基于机器学习的plc语音识别方法
CN113643718B (zh) * 2021-08-16 2024-06-18 贝壳找房(北京)科技有限公司 音频数据处理方法和装置
CN113763960B (zh) * 2021-11-09 2022-04-26 深圳市友杰智新科技有限公司 模型输出的后处理方法、装置和计算机设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备
CN106557809A (zh) * 2015-09-30 2017-04-05 富士通株式会社 神经网络***及对该神经网络***进行训练的方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540957B2 (en) * 2014-12-15 2020-01-21 Baidu Usa Llc Systems and methods for speech transcription
CN105810192B (zh) * 2014-12-31 2019-07-02 展讯通信(上海)有限公司 语音识别方法及其***
CN104900232A (zh) * 2015-04-20 2015-09-09 东南大学 一种基于双层gmm结构和vts特征补偿的孤立词识别方法
CN105931633A (zh) * 2016-05-30 2016-09-07 深圳市鼎盛智能科技有限公司 语音识别的方法及***

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557809A (zh) * 2015-09-30 2017-04-05 富士通株式会社 神经网络***及对该神经网络***进行训练的方法
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105976812A (zh) * 2016-04-28 2016-09-28 腾讯科技(深圳)有限公司 一种语音识别方法及其设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HSU, WEI-NING ET AL.: "A prioritized grid long short-term memory RNN for speech recognition", IEEE PROC. 2016 SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 13 December 2016 (2016-12-13) - 16 December 2016 (2016-12-16), San Diego, California, pages 467 - 473, XP033061780, DOI: 10.1109/SLT.2016.7846305 *
LI,JINYU ET AL.: "Exploring multidimensional LSTMS for large vocabulary ASR", IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2016, pages 4940 - 4944, XP032901543, DOI: 10.1109/ICASSP.2016.7472617 *

Also Published As

Publication number Publication date
CN107331384B (zh) 2018-05-04
CN107331384A (zh) 2017-11-07

Similar Documents

Publication Publication Date Title
WO2018227781A1 (fr) Procédé de reconnaissance vocale, appareil, dispositif informatique et support de stockage
WO2018227780A1 (fr) Procédé de reconnaissance vocale, dispositif informatique et support d'informations
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
WO2021208287A1 (fr) Procédé et appareil de détection d'activité vocale pour reconnaissance d'émotion, dispositif électronique et support de stockage
US11030998B2 (en) Acoustic model training method, speech recognition method, apparatus, device and medium
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
US11875775B2 (en) Voice conversion system and training method therefor
WO2021051544A1 (fr) Procédé et dispositif de reconnaissance vocale
CN111312245B (zh) 一种语音应答方法、装置和存储介质
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
WO2020029404A1 (fr) Procédé et dispositif de traitement de parole, dispositif informatique et support de stockage lisible
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
CN109377981B (zh) 音素对齐的方法及装置
CN114550703A (zh) 语音识别***的训练方法和装置、语音识别方法和装置
Peguda et al. Speech to sign language translation for Indian languages
Singh et al. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning
CN113823265A (zh) 一种语音识别方法、装置和计算机设备
CN113539239B (zh) 语音转换方法、装置、存储介质及电子设备
Kurian et al. Connected digit speech recognition system for Malayalam language
Zou et al. End to End Speech Recognition Based on ResNet-BLSTM
Hao et al. Denoi-spex+: a speaker extraction network based speech dialogue system
Zhu et al. Continuous speech recognition based on DCNN-LSTM
Vyas et al. Study of Speech Recognition Technology and its Significance in Human-Machine Interface
Abudubiyaz et al. The acoustical and language modeling issues on Uyghur speech recognition
Wang et al. Artificial Intelligence and Machine Learning Application in NPP MCR Speech Monitoring System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17914080

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17914080

Country of ref document: EP

Kind code of ref document: A1