WO2024001662A1 - 语音识别方法、装置、设备和存储介质 - Google Patents

语音识别方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2024001662A1
WO2024001662A1 PCT/CN2023/097748 CN2023097748W WO2024001662A1 WO 2024001662 A1 WO2024001662 A1 WO 2024001662A1 CN 2023097748 W CN2023097748 W CN 2023097748W WO 2024001662 A1 WO2024001662 A1 WO 2024001662A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
recognized
confidence
decoding
noise
Prior art date
Application number
PCT/CN2023/097748
Other languages
English (en)
French (fr)
Inventor
雪巍
彭毅
范璐
Original Assignee
京东科技信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技信息技术有限公司 filed Critical 京东科技信息技术有限公司
Publication of WO2024001662A1 publication Critical patent/WO2024001662A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Definitions

  • This application relates to the field of speech processing technology, such as speech recognition methods, devices, equipment and storage media.
  • Speech recognition has been widely used in fields such as smart customer service, smart homes, and car assistants. Speech recognition systems are usually affected by noise interference from the environment or telephone channels, which can easily lead to speech recognition errors. For example, when noise and speech fragments do not coincide in time, speech recognition insertion errors will occur; when speech fragments are damaged by noise, deletion or modification errors will occur. Speech recognition errors bring huge challenges to subsequent voice interaction.
  • the speech to be recognized can be processed through a front-end noise reduction module to reduce the impact of noise on the characteristics of the speech to be recognized, and then the processed speech to be recognized is recognized based on the speech recognition module to determine the speech recognition result.
  • the front-end noise reduction module and speech recognition module need to be adapted, which increases the cost of speech recognition.
  • This application provides speech recognition methods, devices, equipment and storage media to reduce the cost of speech recognition.
  • this application provides a speech recognition method, including:
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized the Describe the comprehensive confidence of the speech to be recognized, and determine the speech recognition result based on the comprehensive confidence.
  • this application also provides a speech recognition device, including:
  • a decoding module configured to determine the decoding output score of the speech to be recognized and a plurality of second candidate words corresponding to the speech to be recognized based on the plurality of first candidate words obtained by decoding the speech to be recognized;
  • a decoding confidence determination module configured to determine the decoding characteristics of each second candidate word, and determine the decoding confidence of the speech to be recognized based on the decoding characteristics of the plurality of second candidate words
  • a noise confidence determination module configured to determine the noise confidence of each speech frame contained in the speech to be recognized, and to determine the noise confidence of the speech to be recognized based on the noise confidence of the multi-frame speech frames;
  • the execution module is configured to determine the comprehensive confidence of the speech to be recognized based on the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and determine the speech recognition result based on the comprehensive confidence.
  • the present application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, it implements the steps in the first aspect.
  • the present application also provides a storage medium containing computer-executable instructions that, when executed by a computer processor, are used to perform the speech recognition method as described in the first aspect.
  • Figure 1 is a schematic diagram of a speech recognition module provided by an embodiment of the present application.
  • Figure 2 is a flow chart of a speech recognition method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of a first word graph including a first candidate word obtained by decoding the speech to be recognized in a speech recognition method provided by an embodiment of the present application;
  • Figure 4 is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a second preset network model in a speech recognition method provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a speech recognition system provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG 1 is a schematic diagram of a speech recognition module provided by an embodiment of the present application.
  • the speech recognition module 100 can include a language model 110 and an acoustic model 120.
  • the speech recognition module 100 can use a decoding algorithm to pass through during the recognition process. Viterbi search is performed to obtain the optimal sequence and generate the decoding output corresponding to the speech, that is, the word map corresponding to the speech. Due to the contamination of speech by noise, it is easy to cause speech recognition errors. Therefore, embodiments of the present application propose a speech recognition method to improve the accuracy of speech recognition and reduce the error rate of speech recognition without increasing costs.
  • Figure 2 is a flow chart of a speech recognition method provided by an embodiment of the present application.
  • the embodiment of the present application can be applied to situations where it is necessary to improve the accuracy of speech recognition without increasing costs.
  • the method may be performed by a speech recognition device, which may be implemented in software and/or hardware. As described in Figure 2, the method includes the following steps:
  • Step 210 Determine the decoding output score of the speech to be recognized and a plurality of second candidate words corresponding to the speech to be recognized based on the plurality of first candidate words obtained by decoding the speech to be recognized.
  • the speech recognition module 100 including the language model 110 and the acoustic model 120 as shown in FIG. 1 can be used to decode the speech to be recognized and generate the first candidate word corresponding to the speech. Therefore, the speech to be recognized can be input into the speech recognition module 100 as shown in FIG. 1 , so that the speech recognition module 100 decodes the speech to be recognized and obtains the first candidate word corresponding to the speech to be recognized. Due to noise interference, the first candidate word obtained by decoding the speech to be recognized based on the speech recognition module 100 has a high error rate, and it is necessary to determine whether the speech to be recognized contains speech.
  • the first candidate word corresponding to the speech to be recognized can be represented based on the first word graph.
  • the first word graph is a compressed representation of each first candidate word and time and other information in the process of decoding the speech to be recognized.
  • the nodes represent the status, and the node brackets The value in represents the time corresponding to this state.
  • the value on the candidate path represents the score of the first candidate word, that is, the first candidate Posterior probability of word selection.
  • the score of each candidate path and the temporal information of each first candidate word in each candidate path can be determined from the first word graph.
  • Figure 3 is a schematic diagram of a first word graph containing a first candidate word obtained by decoding the speech to be recognized in a speech recognition method provided by an embodiment of the present application.
  • initial state 0 to end state 4 "Beijing", “background”, “mobilization”, “sports”, “Olympic Games” and “meeting” are the first candidate words.
  • the posterior probability of "Beijing” is 0.5
  • the posterior probability of "background” is 0.5
  • the posterior probability of "Beijing” is 0.5.
  • the posterior probability of "mobilization” is 0.5
  • the posterior probability of "sports” is 0.4
  • the posterior probability of "Olympic Games” is 0.2
  • the posterior probability of "meeting” is 0.4
  • the two endpoints of the edge where "Winter Olympics” is located The values are 6 and 20 respectively, indicating that the voice content corresponding to the 6s-20s time period is "Winter Olympics”.
  • the decoding output score of each first candidate word can be determined first, and the decoding output scores of multiple first candidate words can be sorted, and secondly, The three largest decoding output scores are normalized, and then the processing result can be determined as the decoding output score of the speech to be recognized; on the other hand, multiple second candidate words corresponding to the speech to be recognized can also be determined, which can be On the first word graph, secondary decoding is performed based on the edit distance as the criterion and the minimum Bayesian risk to obtain multiple second candidate words corresponding to the speech to be recognized.
  • the decoding output score of the speech to be recognized can be determined based on the first candidate word, and the third candidate word containing each first candidate word can also be determined.
  • the word image is decoded twice to obtain a second candidate word.
  • the second candidate word can be used to determine the decoding confidence of the speech to be recognized, providing a data basis for determining the decoding confidence of the speech to be recognized.
  • Step 220 Determine the decoding characteristics of each second candidate word, and determine the decoding confidence of the speech to be recognized based on the decoding characteristics of the plurality of second candidate words.
  • Decoding features include the confidence score of the second candidate word, word category, probability distribution, word length, and word graph depth.
  • the posterior probabilities of the multiple second candidate words After determining the posterior probabilities of the multiple second candidate words, normalize the posterior probabilities of the multiple second candidate words to obtain the confidence score of each second candidate word, and the confidence score of the second candidate word The score is the one-dimensional feature of the second candidate word.
  • determine the classification of words in the field, and the number of classifications is N+1. Map the second candidate word to any N+1 category to represent the second candidate based on N+1 dimensional features. The word category of the word. According to the number of occurrences of the second candidate word in all second candidate words corresponding to the speech to be recognized and the total number of second candidate words corresponding to the speech to be recognized, the probability distribution of the second candidate word is determined.
  • the probability distribution of the second candidate word is: One-dimensional features of the second candidate word.
  • the word length of the second candidate word can be determined according to the number of phonemes contained in the second candidate word, and the word length of the second candidate word is a one-dimensional feature of the second candidate word.
  • the depth of the word graph of the second candidate word is determined based on the number of outgoing edges of all nodes in the time period corresponding to the second candidate word and the length of the time period.
  • the word graph depth is the one-dimensional feature of the second candidate word.
  • the decoding features of the second candidate word are N+5-dimensional features
  • the N+5-dimensional decoding features corresponding to multiple second candidate words are respectively input into the pre-trained decoding confidence model to obtain each second candidate word.
  • the decoding confidence of the candidate word can further determine the arithmetic mean of the decoding confidence of multiple second candidate words, and the arithmetic mean is determined as the decoding confidence of the speech to be recognized.
  • the decoding confidence of the speech to be recognized reflects the reliability of the speech recognition result. Generally speaking, the decoding confidence of the speech to be recognized ranges from 0 to 1. The closer it is to 1, the more reliable the speech recognition result is.
  • the decoding of the second candidate word can be determined based on the second word graph.
  • the decoding confidence of the speech to be recognized can indicate the reliability of the speech recognition result. degree, providing a data basis for determining speech recognition results.
  • Step 230 Determine the noise confidence of each speech frame included in the speech to be recognized, and determine the noise confidence of the speech to be recognized based on the noise confidence of multiple speech frames.
  • the speech to be recognized Before determining the noise confidence of the speech to be recognized, the speech to be recognized can be divided into frames to obtain speech frames. Among them, the frame length of each voice frame is 25 milliseconds, and the frame shift is 10 milliseconds.
  • MFCC Mel-scaleFrequency Cepstral Coefficients
  • the segment-level noise confidence of the speech to be recognized can be determined based on the frame-level noise confidence of multiple speech frames included in the speech to be recognized.
  • Step 240 Determine the comprehensive confidence of the speech to be recognized based on the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and determine the speech recognition result based on the comprehensive confidence.
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized are input into the pre-trained recognition model as the recognition features of the speech to be recognized, and the output result obtained is the comprehensive confidence of the speech to be recognized.
  • the comprehensive confidence combines the segment-level decoding confidence of the speech to be recognized, the segment-level noise confidence of the speech to be recognized based on the frame-level noise confidence, and the decoding output score of the speech to be recognized. Based on the comprehensive confidence level, it can be determined whether the recognition result is a valid one and whether the speech recognition result contains speech.
  • the probability that the speech to be recognized contains speech and the probability that it does not contain speech can be determined based on the comprehensive confidence level, and then the speech recognition result is determined based on the probability that the speech to be recognized contains speech.
  • a speech recognition method provided by an embodiment of the present application includes: based on a plurality of first candidate words obtained by decoding the speech to be recognized, determining the decoding output score of the speech to be recognized and a plurality of second words corresponding to the speech to be recognized.
  • Candidate words determine the decoding characteristics of each second candidate word, and determine the decoding confidence of the speech to be recognized according to the decoding characteristics of the plurality of second candidate words; determine each speech frame contained in the speech to be recognized noise confidence, and determine the noise confidence of the speech to be recognized according to the noise confidence of the multi-frame speech frame; determine the noise confidence of the speech to be recognized according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized.
  • the comprehensive confidence of the speech, and the speech recognition result is determined based on the comprehensive confidence.
  • the above technical solution can first determine the decoding output score of the speech to be recognized based on a plurality of first candidate words obtained by decoding the speech to be recognized, providing a data basis for determining the comprehensive confidence level, and determining the corresponding score of the speech to be recognized based on the plurality of first candidate words.
  • Multiple second candidate words and after determining the decoding features of the multiple second candidate words, determine a more accurate decoding confidence of the speech to be recognized based on the decoding features, and secondly, based on the noise confidence of each speech frame contained in the speech to be recognized Determine the more accurate frame-level noise confidence of the speech to be recognized.
  • the comprehensive confidence of the speech to be recognized can be determined based on the decoding confidence, noise confidence and decoding output score of the speech to be recognized. Combined with the segment level of the speech to be recognized The decoding confidence, the frame-level noise confidence, and the previously determined decoding output score determine a more accurate comprehensive confidence of the speech to be recognized.
  • the above process does not require specific optimization or redesign of the speech recognition model used to decode the speech to be recognized. Through training, a more accurate comprehensive confidence of the speech to be recognized is obtained, so the determined speech recognition results are also more accurate. Improve the accuracy of speech recognition without increasing costs.
  • Figure 4 is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • the embodiment of the present application can be applied to situations where it is necessary to improve the accuracy of speech recognition without increasing costs. Interpretations of terms that are the same or corresponding to the embodiments of the present application and the above-mentioned embodiments will not be described again here.
  • the speech recognition method provided by the embodiment of the present application includes:
  • Step 410 Determine the decoding output score of the speech to be recognized based on a plurality of first candidate words obtained by decoding the speech to be recognized.
  • step 410 may include:
  • the speech to be recognized is decoded once to obtain the plurality of first candidate words; the language score and acoustic score of each first candidate word are determined, and according to the plurality of first candidate words The linguistic score and the acoustic score of the first candidate word determine all of the speech to be recognized.
  • the decoding output score Based on the speech recognition module composed of a language model and an acoustic model, the speech to be recognized is decoded once to obtain the plurality of first candidate words; the language score and acoustic score of each first candidate word are determined, and according to the plurality of first candidate words The linguistic score and the acoustic score of the first candidate word determine all of the speech to be recognized.
  • the decoding output score is generated by the speech to be recognized.
  • the speech recognition module including the language model and the acoustic model can be used to decode the speech to be recognized once.
  • the first candidate word corresponding to the speech to be recognized can be obtained through one decoding, and then the language score and acoustic score of each first candidate word can be determined. After fusing the language score and the acoustic score, the decoding output score of each first candidate word is obtained. After determining the decoding output score of each first candidate word, the decoding output scores of the multiple first candidate words can be sorted, and the three largest decoding output scores can be normalized, and the processing result can be determined as the to-be-recognized The decoded output score of the speech.
  • the decoding output score of the speech to be recognized can be determined based on the first candidate word, providing a data basis for determining the comprehensive confidence of the speech to be recognized.
  • Step 420 Determine a plurality of second candidate words corresponding to the speech to be recognized based on the plurality of first candidate words obtained by decoding the speech to be recognized.
  • step 420 may include:
  • the speech to be recognized is decoded once to obtain a first word graph containing the plurality of first candidate words; on the first word graph, edit Using distance as the criterion, secondary decoding is performed based on the minimum Bayesian risk to obtain the plurality of second candidate words corresponding to the speech to be recognized and the posterior probability of each second candidate word.
  • the first word map can also be determined based on the first candidate word, and the first word map can be Using edit distance as the criterion and secondary decoding based on minimum Bayesian risk, the second candidate word corresponding to the speech to be recognized is obtained.
  • the process of secondary decoding on the first word graph is as follows: 1) Select a candidate path from the initial state to the termination state from the first word graph; 2) Based on the candidate path, calculate the candidate path and the entire The edit distance of the first word graph (the edit distance can be the minimum number of insertions, deletions, and substitutions that change one text into another text). Through the edit distance, all the time periods corresponding to each first candidate word in the candidate path are obtained. The posterior probability of the first candidate word; 3) Select the word with the highest probability at each moment to obtain a new word sequence, that is, the second candidate word; 4) If the second candidate word and the first candidate corresponding to the candidate path in 2) If the words are different, return to execution 2).
  • the word sequence containing multiple second candidate words is determined to be secondary Decoding results.
  • the time period in which each second candidate word is located has the posterior probability of all second candidate words in that time period.
  • the second candidate word of the speech to be recognized can be determined based on the first candidate word.
  • the speech to be recognized Provides data basis for audio decoding confidence.
  • Step 430 Determine the decoding characteristics of each second candidate word.
  • the decoding features include the confidence score, word category, probability distribution, word length and word map depth of the second candidate word.
  • step 430 may include:
  • a second word graph containing multiple second candidate words can be generated.
  • the second word graph also contains the posterior probability of each second candidate word. , therefore, after normalizing the posterior probabilities of multiple second candidate words, the confidence score of each second candidate word is obtained, and the confidence score of the second candidate word can be determined as one of the second candidate words.
  • the word category can indicate the category information of the second candidate word. First, the field of speech to be recognized and the classification of words in this field can be determined, and the categories of words in this field can be sorted according to the word frequency of each class of words, and the top N categories can be sorted.
  • the probability distribution can indicate the number of occurrences of the second candidate word in all the second candidate words corresponding to the speech to be recognized.
  • the frequency of occurrence of the second candidate word in all the second candidate words corresponding to the speech to be recognized can be compared with the speech to be recognized.
  • the corresponding total number of second candidate words determines the probability distribution of the second candidate word, that is, the unigram probability of the second candidate word.
  • the probability distribution of the second candidate word can be determined as the one-dimensional feature of the second candidate word.
  • the word length can indicate the number of phonemes contained in the second candidate word. Therefore, the word length of the second candidate word can be determined according to the number of phonemes contained in the second candidate word.
  • the word length of the second candidate word can be determined as the number of phonemes contained in the second candidate word.
  • the word graph depth can indicate the average word graph depth of the time period corresponding to the second candidate word.
  • the number of outgoing edges of all nodes in the time period corresponding to the second candidate word can be used. and the length of the time period to determine the word map depth of the second candidate word, that is, the average lattice depth.
  • the word map depth of the second candidate word can be determined as the one-dimensional feature of the second candidate word.
  • the N+5-dimensional decoding features of each second candidate word obtained by secondary decoding can be determined.
  • the decoding features of the second candidate word are used to determine the decoding confidence of the speech to be recognized.
  • Decoding confidence provides a data basis.
  • Step 440 Determine the decoding confidence of the speech to be recognized according to the decoding characteristics of the plurality of second candidate words.
  • step 440 may include:
  • the decoding features of the plurality of second candidate words are respectively input into the pre-trained decoding confidence model to obtain the decoding confidence of each second candidate word; determined based on the decoding confidence of the plurality of second candidate words.
  • the decoding confidence of the speech to be recognized is respectively input into the pre-trained decoding confidence model to obtain the decoding confidence of each second candidate word; determined based on the decoding confidence of the plurality of second candidate words.
  • the method further includes:
  • the first preset network model is built based on the deep neural network and cross-entropy function, and the N+5-dimensional decoding features of the second candidate word corresponding to the annotated speech data containing noise and valid speech and the annotation information of the speech data are used as the first training data, perform network training on the first preset network model, and calculate the first loss function; perform network optimization based on the back propagation algorithm until the first loss function converges, and obtain a decoding confidence model.
  • the first loss function may be a cross-entropy function.
  • the speech data After building the first preset network model based on the deep neural network and cross-entropy function, label the speech data containing noise and valid speech, label the noise as 0, mark the valid speech as 1, and mark the labeled data containing noise and valid speech.
  • the voice data is determined as the first training voice set.
  • the training speech contained in the first training speech set is decoded once to obtain the first training candidate word corresponding to the training speech, and then decoded twice to obtain the second training candidate word corresponding to the training speech.
  • the aforementioned steps 430 Based on the aforementioned steps 430 to determine the N+5-dimensional decoding features of the second training candidate word.
  • the sigmoid activation function value output by the network model that represents the confidence score of the second candidate word and the annotation information of the training speech are used to calculate the cross entropy function; the network is optimized based on the back propagation algorithm until the cross entropy function converges, and the decoding confidence model is obtained .
  • the decoding features of multiple second candidate words can be input into the decoding confidence model respectively, and the obtained output result is the decoding confidence of each second candidate word, and the decoding of multiple second candidate words corresponding to the speech to be recognized is determined.
  • the arithmetic mean of the confidence is determined as the decoding confidence of the speech to be recognized.
  • the decoding confidence of each second candidate word corresponding to the speech to be recognized can be determined based on the decoding confidence model, and then the decoding confidence of the speech to be recognized can be determined based on the decoding confidence of multiple second candidate words.
  • the decoding confidence of the speech to be recognized can indicate the reliability of the speech recognition results and provide a data basis for determining the speech recognition results.
  • Step 450 Determine the noise confidence of each speech frame included in the speech to be recognized.
  • step 450 may include:
  • a second preset network model is constructed based on the Gate Recurrent Unit (GRU), and the mel inversion of the training speech corresponding to the frame of the training speech in the second training speech set consisting of training speech containing pure noise and pure speech is constructed.
  • the spectral coefficients and the annotation information of the frame training speech are used as the second training data to perform network training on the second preset network model and calculate the second loss function; iterate the weights of the second preset network model based on stochastic gradient descent until the second The loss function converges and the noise confidence model is obtained.
  • the second loss function can also be a cross-entropy function.
  • FIG 5 is a schematic diagram of a second preset network model in a speech recognition method provided by an embodiment of the present application.
  • the second preset network model includes first fully connected layers (FC), The first GRU, the second GRU, the third GRU and the second FC.
  • the second preset network model After building the second preset network model based on GRU, collect pure noise and pure speech, randomly add the pure noise to the pure speech at the preset signal-to-noise ratio to obtain training speech, and determine the preset number of training speech as the second training speech set , divide the training speech contained in the second training speech set into frames with a frame length of 25 milliseconds and a frame shift of 10 milliseconds to obtain the frame training speech corresponding to the training speech.
  • For the frame training speech when the phoneme is not silent, It is marked as 1, and when the phoneme is not non-silent, it is marked as 0.
  • the Mel cepstral coefficients of the training speech frames corresponding to the training speeches in the second training speech set and the annotation information of the training speech frames are used as second training data to perform network training on the second preset network model.
  • the Mel cepstrum coefficients of the L-frame training speech can be used as a training sequence and input into the second preset network model.
  • the output result corresponding to the frame training speech is a vector with a dimension of 2, where one dimension represents the probability that the current frame contains speech. The other dimension represents the probability that the current frame contains no speech.
  • the speech to be recognized can be divided into frames with a frame length of 25 milliseconds and a frame shift of 10 milliseconds to obtain the speech frames contained in the speech to be recognized, determine the mel cepstral coefficient of each speech frame, and add the mel cepstral coefficient of the speech frame to The cepstral coefficient is input into the noise confidence model, and the output result is the probability p(t) that the speech frame contains speech, and the probability that it does not contain speech is 1-p(t), and then the noise confidence of the speech frame is determined to be 1- p(t).
  • the noise confidence of each speech frame contained in the speech to be recognized can be determined based on the noise confidence model.
  • the noise confidence of multiple speech frames is used to determine the noise confidence of the speech to be recognized.
  • the noise confidence of the speech to be recognized is Noise confidence can be used to determine the comprehensive confidence of the speech to be recognized, providing a data basis for determining the speech recognition results.
  • Step 460 Determine the noise confidence of the speech to be recognized based on the noise confidence of the multi-frame speech frames.
  • step 460 may include:
  • the noise confidence of the speech to be recognized is determined according to the maximum noise confidence, the minimum noise confidence, the mean noise confidence and the variance of the noise confidence among the noise confidences of the multi-frame speech frames contained in the speech to be recognized. Spend.
  • the noise confidence of the multiple speech frames can be sorted, the mean and variance are calculated, and the noise confidence of the multiple speech frames included in the speech to be recognized is added.
  • the maximum noise confidence, minimum noise confidence, noise confidence mean and noise confidence variance are determined as the noise confidence of the speech to be recognized.
  • the segment-level noise confidence of the speech to be recognized can be determined based on the frame-level noise confidence of multiple speech frames included in the speech to be recognized.
  • Step 470 Determine the comprehensive confidence of the speech to be recognized based on the decoding confidence, noise confidence and decoding output score of the speech to be recognized.
  • step 470 may include:
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized are input into the pre-trained speech recognition model to obtain the comprehensive confidence of the speech to be recognized.
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized into the pre-trained speech recognition model it also includes:
  • a third preset network model is built based on the logistic regression, and the decoding confidence, noise confidence, decoding output score and annotation information of the training speech corresponding to the training speech in the third training speech set constructed from speech containing noise are used as the third training data, conduct network training on the third preset network model, and calculate the third loss function; perform network optimization based on the back propagation algorithm until the third loss function converges, and obtain the speech recognition model.
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized can be input into the speech recognition model, and the output result obtained is the comprehensive confidence of the speech to be recognized.
  • the segment-level decoding confidence of the speech to be recognized, the frame-level noise confidence, and the decoding output score input are integrated to determine the comprehensive confidence of the speech to be recognized.
  • Step 480 Determine the speech recognition result according to the comprehensive confidence level.
  • the comprehensive confidence includes the probability that the speech to be recognized contains speech and the probability that the speech to be recognized does not contain speech.
  • step 480 may include:
  • the speech recognition result is determined to be that the speech to be recognized contains speech; if the probability that the speech to be recognized contains speech is greater than or equal to the second If the preset threshold is smaller than the first preset threshold, then it is determined that the speech recognition result is that the speech to be recognized does not contain speech; if the probability that the speech to be recognized contains speech is less than the second preset threshold, Then it is determined that the speech recognition is wrong, or the speech to be recognized is optimized to obtain an optimized speech, and speech recognition is performed again based on the optimized speech.
  • the first preset threshold is greater than the second preset threshold, and both the first preset threshold and the second preset threshold are less than 1.
  • the probability that the speech to be recognized contains speech can be compared with the second preset threshold.
  • the speech recognition result can be determined based on the probability that the speech to be recognized contains speech, that is, the decoding result of the speech recognition module can be used. Then continue to compare the probability that the speech to be recognized contains speech and the first preset threshold. If the probability that the speech to be recognized contains speech is greater than or equal to the first preset threshold, it is determined that the speech recognition result is that the speech to be recognized contains speech; if the speech to be recognized contains speech If the probability of containing speech is less than the first preset threshold, it is determined that the speech recognition result is that the speech to be recognized does not contain speech.
  • the speech recognition result cannot be determined based on the probability that the speech to be recognized contains speech, and the decoding result of the speech to be recognized by the speech recognition module may not be used, and then The speech recognition error can be determined, or the speech to be recognized can be optimized to obtain the optimized speech, and speech recognition can be re-performed based on the optimized speech.
  • optimizing the speech to be recognized to obtain optimized speech includes:
  • the optimized speech is obtained by muting the speech frames whose noise confidence level is greater than the preset confidence level in the speech to be recognized.
  • the speech to be recognized contains speech or not according to the comprehensive confidence level. It is also possible to determine the speech recognition error without using the speech recognition module to decode the speech to be recognized, or to reduce the speech to be recognized. After the optimized speech is obtained by optimizing the noise to be recognized, the optimized speech is continued to be decoded based on the speech recognition module to obtain the speech recognition result.
  • the speech recognition method provided by the embodiment of the present application includes: determining the decoding output score of the speech to be recognized according to a plurality of first candidate words obtained by decoding the speech to be recognized; according to a plurality of first candidate words obtained by decoding the speech to be recognized, Determine a plurality of second candidate words corresponding to the speech to be recognized; determine the decoding characteristics of each second candidate word, and determine the decoding confidence of the speech to be recognized according to the decoding characteristics of the plurality of second candidate words; Determine the noise confidence of each speech frame contained in the speech to be recognized, and determine the noise confidence of the speech to be recognized based on the noise confidence of the multiple speech frames; according to the decoding confidence of the speech to be recognized, the noise
  • the confidence and the decoding output score are used to determine the comprehensive confidence of the speech to be recognized, and the speech recognition result is determined based on the comprehensive confidence.
  • the above technical solution can first determine the decoding output score of the speech to be recognized based on a plurality of first candidate words obtained by decoding the speech to be recognized, providing a data basis for determining the comprehensive confidence level, and determining the corresponding score of the speech to be recognized based on the plurality of first candidate words.
  • Multiple second candidate words and after determining the decoding features of each second candidate word, determine a more accurate decoding confidence of the speech to be recognized based on the multiple decoding features, and secondly, based on the multi-frame speech frames contained in the speech to be recognized
  • the noise confidence determines a more accurate frame-level noise confidence of the speech to be recognized.
  • the comprehensive confidence of the speech to be recognized can be determined based on the decoding confidence of the speech to be recognized, the noise confidence and the decoding output score, combined with the segments of the speech to be recognized.
  • Level decoding confidence, frame-level noise confidence and the previously determined decoding output score determine a more accurate comprehensive confidence of the speech to be recognized.
  • the above process does not require specific optimization of the speech recognition model used to decode the speech to be recognized. Or retraining can obtain a more accurate comprehensive confidence of the speech to be recognized, so the determined speech recognition result is also more accurate. Improve the accuracy of speech recognition without increasing costs.
  • the speech recognition error can be determined, or the speech to be recognized can be denoised to optimize the speech to be recognized. After the optimized speech is obtained, continue based on the speech The recognition module decodes the optimized speech to obtain speech recognition results.
  • Figure 6 is a schematic diagram of a speech recognition system provided by an embodiment of the present application.
  • the speech recognition system may include a speech recognition module 100, a decoding confidence module 200, a noise confidence module 300, and a result determination module. 400 and processing module 500, wherein the speech recognition module 100 is configured to decode the speech to be recognized once to determine a plurality of first candidate words of the speech to be recognized, and includes A first word graph containing a plurality of first candidate words; the decoding confidence module 200 is configured to determine the decoding output score of the speech to be recognized and a plurality of second candidate words corresponding to the speech to be recognized based on the plurality of first candidate words, and After determining the decoding characteristics of each second candidate word, determine the decoding confidence of the speech to be recognized based on the decoding characteristics of the plurality of second candidate words; the noise confidence module 300 is configured to determine each speech frame included in the speech to be recognized noise confidence, and determine the noise confidence of the speech to be recognized based on the noise confidence of the multi-frame speech
  • the processing module 500 is configured to determine the speech recognition result according to the comprehensive confidence. For example, when it is determined that the probability that the speech to be recognized contains speech is greater than or equal to the first preset threshold, the speech recognition result is determined to be the speech to be recognized. Contains speech; when it is determined that the probability that the speech to be recognized contains speech is greater than or equal to the second preset threshold and less than the first preset threshold, the speech recognition result is determined to be that the speech to be recognized does not contain speech; when it is determined that the speech to be recognized contains speech When the probability is less than the second preset threshold, it is determined that the speech recognition is wrong, or the speech to be recognized is optimized to obtain the optimized speech, and speech recognition is performed again based on the optimized speech.
  • the speech recognition system provided by the embodiments of this application can execute the speech recognition method provided by any embodiment of this application, and has corresponding functional modules and effects for executing the speech recognition method.
  • FIG. 7 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application. This device and the speech recognition method in the above embodiment belong to the same application concept. For details that are not described in detail in the embodiment of the speech recognition device, please refer to the embodiment of the speech recognition method above.
  • the structure of the speech recognition device is shown in Figure 7, including:
  • the decoding module 710 is configured to determine the decoding output score of the speech to be recognized and the plurality of second candidate words corresponding to the speech to be recognized based on the plurality of first candidate words obtained by decoding the speech to be recognized; the decoding confidence determination module 720, configured to determine the decoding characteristics of each second candidate word, and determine the decoding confidence of the speech to be recognized according to the decoding characteristics of the plurality of second candidate words; the noise confidence determination module 730, configured to determine the decoding characteristics of the speech to be recognized.
  • the noise confidence of each speech frame contained in the speech to be recognized is determined, and the noise confidence of the speech to be recognized is determined based on the noise confidence of multiple speech frames; the execution module 740 is configured to decode the speech to be recognized based on Confidence, noise confidence and decoding output score are used to determine the comprehensive confidence of the speech to be recognized, and the speech recognition result is determined based on the comprehensive confidence.
  • the decoding module 710 is configured as:
  • the speech to be recognized is decoded once to obtain a first word graph containing the plurality of first candidate words; the language score and sum of each first candidate word are determined.
  • Acoustic score determine the decoding output score of the speech to be recognized according to the language scores and acoustic scores of the plurality of first candidate words; on the first word graph, using the edit distance as the criterion, Secondary decoding is performed based on the minimum Bayesian risk to obtain the plurality of second candidate words corresponding to the speech to be recognized and the posterior probability of each second candidate word.
  • the decoding features include the confidence score, word category, probability distribution, word length and word map depth of the second candidate word.
  • the decoding confidence determination module 720 is set as:
  • each second candidate word Normalize the posterior probabilities of the plurality of second candidate words to obtain the confidence score of each second candidate word; determine each second candidate according to the category information of each second candidate word The word category of the word; determine the probability distribution of each second candidate word according to the number of occurrences of each second candidate word in all second candidate words corresponding to the speech to be recognized; according to each second candidate word Contains the number of phonemes to determine the word length of each second candidate word; in the second word map obtained by secondary decoding of the first candidate word, according to all the words in the corresponding time period of each second candidate word The number of outgoing edges of the node and the length of the time period are used to determine the word graph depth of each second candidate word; the decoding features of the multiple second candidate words are respectively input into the pre-trained decoding confidence model. , obtain the decoding confidence of each second candidate word; determine the decoding confidence of the speech to be recognized according to the decoding confidence of each second candidate word.
  • the noise confidence determination module 730 is set as:
  • the execution module 740 is configured as:
  • the decoding confidence, noise confidence and decoding output score of the speech to be recognized are input into the pre-trained speech recognition model to obtain the comprehensive confidence of the speech to be recognized; the speech recognition result is determined based on the comprehensive confidence.
  • the comprehensive confidence includes the probability that the speech to be recognized contains speech.
  • determining the speech recognition result based on the comprehensive confidence includes:
  • the speech recognition result is determined to be that the speech to be recognized contains speech; if the probability that the speech to be recognized contains speech is greater than or equal to the second If the preset threshold is smaller than the first preset threshold, then it is determined that the speech recognition result is that the speech to be recognized does not contain speech; if the probability that the speech to be recognized contains speech is less than the second preset threshold, Then determine the speech recognition error, or optimize the speech to be recognized. to the optimized voice, and perform speech recognition again based on the optimized voice.
  • Optimizing the speech to be recognized to obtain optimized speech includes:
  • the optimized speech is obtained by muting the speech frames whose noise confidence level is greater than the preset confidence level in the speech to be recognized.
  • the speech recognition device provided by the embodiments of this application can execute the speech recognition method provided by any embodiment of this application, and has corresponding functional modules and effects for executing the speech recognition method.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above divisions, as long as the corresponding functions can be realized; in addition, the multiple functional units
  • the names are only used to facilitate mutual distinction and are not used to limit the scope of protection of the present application.
  • FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Figure 8 illustrates a block diagram of an exemplary computer device 8 suitable for implementing embodiments of the present application.
  • the computer device 8 shown in FIG. 8 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present application.
  • the computer device 8 is embodied in the form of a general computing computer device.
  • the components of the computer device 8 may include, but are not limited to: one or more processors or processing units 16, memory 28, and a bus 18 connecting various system components, including the memory 28 and the processing unit 16.
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, Enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
  • Computer device 8 includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 8, including volatile and nonvolatile media, removable and non-removable media.
  • Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Computer equipment 8 may include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be configured to read and write to non-removable, non-volatile magnetic media (not shown in Figure 8, commonly referred to as a "hard drive”).
  • a disk drive configured to read and write to removable non-volatile disks (eg, "floppy disks”) may be provided, as well as to removable non-volatile optical disks.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • the memory 28 may include at least one program product having a set of (eg, at least one) program modules configured to perform the functions of embodiments of the present application.
  • a program/utility 40 having a set of (at least one) program modules 42 may be stored, for example, in memory 28 , each or a combination of these examples may include the implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the embodiments described herein.
  • Computer device 8 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 8, and/or with Any device (eg, network card, modem, etc.) that enables the computer device 8 to communicate with one or more other computing devices. This communication may occur through an input/output (I/O) interface 22 .
  • the computer device 8 can also communicate with one or more networks (such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN) and/or a public network, such as the Internet) through the network adapter 20. As shown in FIG.
  • network adapter 20 communicates with other modules of computer device 8 via bus 18 .
  • bus 18 It should be understood that, although not shown in Figure 8, other hardware and/or software modules may be used in conjunction with the computer device 8, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, redundant disk arrays, etc. Arrays of Independent Disks (RAID) systems, tape drives and data backup storage systems, etc.
  • RAID Arrays of Independent Disks
  • the processing unit 16 executes a variety of functional applications and page displays by running programs stored in the memory 28, for example, implementing the speech recognition method provided by the embodiment of the present application.
  • the method includes:
  • processor can also implement the technical solution of the speech recognition method provided by any embodiment of the present application.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the speech recognition method provided by the embodiment of the present application is implemented.
  • the method includes:
  • the computer storage medium in the embodiment of the present application may be any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. Examples of computer-readable storage media (a non-exhaustive list) include: electrical connections having one or more conductors, portable computer disks, hard drives, RAM, ROM, Erasable Programmable Read-Only Memory, EPROM or flash memory), optical fiber, CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • any appropriate medium including but not limited to: wireless, wire, optical cable, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages, or a combination thereof.
  • Programming language such as "C” or a similar programming language.
  • the program code can be executed completely on the user's computer, partially on the user's computer Execute on the user's computer, as a stand-alone software package, partially on the user's computer and partially on the remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

语音识别方法、装置、计算机设备(8)和存储介质。语音识别方法包括:根据解码待识别语音得到的多个第一候选词,确定待识别语音的解码输出得分和待识别语音对应的多个第二候选词(210);确定每个第二候选词的解码特征,并根据多个第二候选词的解码特征确定待识别语音的解码置信度(220);确定待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定待识别语音的噪声置信度(230);根据待识别语音的解码置信度、噪声置信度和解码输出得分,确定待识别语音的综合置信度,并根据综合置信度确定语音识别结果(240)。

Description

语音识别方法、装置、设备和存储介质
本申请要求在2022年06月28日提交中国专利局、申请号为202210753629.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理技术领域,例如涉及语音识别方法、装置、设备和存储介质。
背景技术
语音识别已经在智能客服、智能家居、车载助手等领域得到了广泛应用。语音识别***通常会受到环境或电话信道的噪声干扰影响,容易导致语音识别错误。例如,当噪声和语音片段时间不重合时,会产生语音识别***错误;当语音片段被噪声破坏时,会产生删除或修改错误。语音识别错误为后续语音交互带来了巨大挑战。
相关技术中,可以通过前端降噪模块处理待识别语音,以降低噪声对待识别语音特征的影响,进而基于语音识别模块对处理后的待识别语音进行识别,确定语音识别结果。
相关有技术中至少存在以下技术问题:
需要对前端降噪模块和语音识别模块进行适配,增加了语音识别成本。
发明内容
本申请提供语音识别方法、装置、设备和存储介质,以实现降低语音识别的成本。
第一方面,本申请提供了一种语音识别方法,包括:
根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;
确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;
确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据所述多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;
根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所 述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
第二方面,本申请还提供了一种语音识别装置,包括:
解码模块,设置为根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;
解码置信度确定模块,设置为确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;
噪声置信度确定模块,设置为确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据所述多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;
执行模块,设置为根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
第三方面,本申请还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面中所述的语音识别方法。
第四方面,本申请还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如第一方面中所述的语音识别方法。
附图说明
图1为本申请实施例提供的一种语音识别模块的示意图;
图2为本申请实施例提供的一种语音识别方法的流程图;
图3为本申请实施例提供的一种语音识别方法中解码待识别语音得到的包含第一候选词的第一词图的示意图;
图4为本申请实施例提供的另一种语音识别方法的流程图;
图5为本申请实施例提供的一种语音识别方法中第二预设网络模型的示意图;
图6为本申请实施例提供的一种语音识别***的示意图;
图7为本申请实施例提供的一种语音识别装置的结构示意图;
图8为本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。此处所描述的具体实施例仅仅用于解释本申请。为了便于描述,附图中仅示出了与本申请相关的部分。
在讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将多项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,多项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。
图1为本申请实施例提供的一种语音识别模块的示意图,如图1所示,语音识别模块100可以包括语言模型110和声学模型120,语音识别模块100可以采用解码算法在识别过程中通过维特比搜索,得到最优序列,生成语音对应的解码输出,即语音对应的词图。由于噪声对语音的感染,容易导致语音识别错误。因此,本申请实施例提出了一种语音识别方法,在不增加成本的前提下,提升语音识别的精确度,降低语音识别的错误率。
下面将结合图1所述的语音识别模块以及实施例对本申请实施例提出的语音识别方法进行描述。
图2为本申请实施例提供的一种语音识别方法的流程图,本申请实施例可适用于需要在不增加成本的前提下,提升语音识别精确度的情况。该方法可以由语音识别装置来执行,该装置可以由软件和/或硬件的方式来实现。如图2所述,该方法包括以下步骤:
步骤210、根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词。
如图1所示的包含有语言模型110和声学模型120的语音识别模块100,可以用于对待识别语音进行解码,生成语音对应的第一候选词。因此,可以将待识别语音输入如图1所示的语音识别模块100,以使语音识别模块100对待识别语音进行解码,得到待识别语音对应的第一候选词。由于噪声的干扰,基于语音识别模块100解码待识别语音得到的第一候选词错误率较高,需要确定待识别语音是否包含语音。
待识别语音对应的第一候选词可以基于第一词图进行表示,第一词图是解码待识别语音的过程中每个第一候选词和时间等信息的压缩表示,节点表示状态,节点括号中的数值表示该状态对应的时刻。从第一词图中初始状态到终止状态有不同的候选路径,候选路径上的数值表示第一候选词的打分,即第一候 选词的后验概率。从第一词图中可以确定每个候选路径的得分以及每个候选路径中每个第一候选词的时间信息。图3为本申请实施例提供的一种语音识别方法中解码待识别语音得到的包含第一候选词的第一词图的示意图,如图3所示,例如,初始状态0到终止状态4,“北京”、“背景”、“动员”、“运动”、“***”和“会”为第一候选词,“北京”的后验概率为0.5、“背景”的后验概率为0.5、“动员”的后验概率为0.5、“运动”的后验概率为0.4、“***”的后验概率为0.2、“会”的后验概率为0.4,“冬奥会”所在边的两个端点值分别为6和20,表明6s-20s这个时间段对应的语音内容为“冬奥会”。
在基于语音识别模块解码待识别语音得到第一候选词后,一方面,首先可以确定每个第一候选词的解码输出得分,并对多个第一候选词的解码输出得分进行排序,其次可以对最大的三个解码输出得分进行归一化处理,进而可以将处理结果确定为待识别语音的解码输出得分;另一方面,还可以确定待识别语音对应的多个第二候选词,可以在第一词图上做以编辑距离为准则,基于最小贝叶斯风险的二次解码,得到待识别语音对应的多个第二候选词。
本申请实施例中,在基于语音识别模块对待识别语音进行解码得到第一候选词后,可以根据第一候选词确定待识别语音的解码输出得分,还可以对包含有各第一候选词的第一词图进行二次解码,得到第二候选词,第二候选词可以用于确定待识别语音的解码置信度,为确定待识别语音的解码置信度提供数据基础。
步骤220、确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度。
解码特征包括第二候选词的置信度得分、词类别、概率分布、词长度和词图深度。
在确定多个第二候选词的后验概率后,对多个第二候选词的后验概率进行归一化处理,得到每个第二候选词的置信度得分,第二候选词的置信度得分为第二候选词的一维特征。确定待识别语音的领域后,确定该领域内词的分类,且分类数量为N+1,将该第二候选词映射到任一N+1类,以基于N+1维特征表示第二候选词的词类别。根据第二候选词在待识别语音对应的所有第二候选词中的出现次数与待识别语音对应的第二候选词总量,确定第二候选词的概率分布,第二候选词的概率分布为第二候选词的一维特征。可以根据第二候选词包含音素的数量,确定第二候选词的词长度,第二候选词的词长度为第二候选词的一维特征。在包含有第二候选词的第二词图中,根据第二候选词对应时间段内所有节点的出边个数和时间段长度,确定第二候选词的词图深度,第二候选词的词图深度为第二候选词的一维特征。
因此,可以确定第二候选词的解码特征为N+5维特征,将多个第二候选词对应的N+5维解码特征分别输入预先训练好的解码置信度模型中,得到每个第二候选词的解码置信度,进而可以确定多个第二候选词的解码置信度的算术平均值,将该算术平均值确定为待识别语音的解码置信度。
待识别语音的解码置信度反映了语音识别结果的可靠程度,一般而言,待识别语音的解码置信度的取值范围为0-1,越接近1表示语音识别结果越可靠。
本申请实施例中,在基于包含有第一候选词的第一词图进行二次解码得到包含有第二候选词的第二词图之后,可以根据第二词图确定第二候选词的解码特征,并根据解码特征确定第二候选词的解码置信度,进而可以根据第二候选词的解码置信度确定待识别语音的解码置信度,待识别语音的解码置信度可以表明语音识别结果的可靠程度,为确定语音识别结果提供数据基础。
步骤230、确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度。
在确定待识别语音的噪声置信度之前,可以对待识别语音进行分帧,得到语音帧。其中,每帧语音帧的帧长为25毫秒,帧移为10毫秒。
首先可以提取每帧语音帧的梅尔倒谱系数(Mel-scaleFrequency Cepstral Coefficients,MFCC),将语音帧的MFCC输入预先训练好的噪声置信度模型中,得到该语音帧包含语音的概率p(t),以及该语音帧不包含语音的概率1-p(t),进而确定该语音帧的噪声置信度为1-p(t)。在确定待识别语音所包含多帧语音帧的噪声置信度后,确定最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差,并将待识别语音所包含多帧语音帧的噪声置信度中的最大噪声置信度和最小噪声置信度,以及噪声置信度均值和噪声置信度方差确定为待识别语音的噪声置信度。
本申请实施例中,基于待识别语音所包含多帧语音帧的帧级别的噪声置信度可以确定待识别语音的片段级别的噪声置信度。
步骤240、根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
将待识别语音的解码置信度、噪声置信度和解码输出得分作为待识别语音的识别特征输入预先训练好的识别模型中,得到的输出结果为待识别语音的综合置信度。该综合置信度融合了待识别语音的片段级别的解码置信度、根据帧级别的噪声置信度确定待识别语音的片段级别的噪声置信度以及待识别语音的解码输出得分。根据综合置信度可以确定是否为有效识别结果以及是否包含语音的语音识别结果。
本申请实施例中,根据综合置信度可以确定待识别语音包含语音的概率和不包含语音的概率,进而根据待识别语音包含语音的概率确定语音识别结果。有效解决噪声导致的语音识别***错误问题,无需重新对语音识别模块进行特定优化和重新训练,从而可以适配不同语音识别模块。
本申请实施例提供的一种语音识别方法,包括:根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。上述技术方案,首先可以根据解码待识别语音得到的多个第一候选词确定待识别语音的解码输出得分,为确定综合置信度提供数据基础,根据多个第一候选词确定待识别语音对应的多个第二候选词,并在确定多个第二候选词的解码特征后,根据解码特征确定待识别语音更加精确的解码置信度,其次可以根据待识别语音所包含每帧语音帧的噪声置信度确定待识别语音的更加精确的帧级别的噪声置信度,可以根据待识别语音的解码置信度、噪声置信度和解码输出得分确定待识别语音的综合置信度,结合待识别语音的片段级别的解码置信度、帧级别的噪声置信度和前述确定的解码输出得分确定更为精确的待识别语音的综合置信度,上述过程无需重新对解码待识别语音所采用的语音识别模型进行特定优化或者重新训练,得到了更为精确的待识别语音的综合置信度,因此,确定的语音识别结果也更加精确。实现不增加成本的前提下,提升语音识别的精确度。
图4为本申请实施例提供的另一种语音识别方法的流程图,本申请实施例可适用于需要在不增加成本的前提下,提升语音识别精确度的情况。本申请实施例与上述实施例相同或相应的术语的解释在此不再赘述。参见图4,本申请实施例提供的语音识别方法包括:
步骤410、根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分。
一种实施方式中,步骤410可以包括:
基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到所述多个第一候选词;确定每个第一候选词的语言得分和声学得分,根据所述多个第一候选词的语言得分和声学得分确定所述待识别语音的所 述解码输出得分。
包含语言模型和声学模型的语音识别模块可以用于对待识别语音进行一次解码,一次解码可以得到待识别语音对应的第一候选词,进而可以确定每个第一候选词的语言得分和声学得分,融合语言得分和声学得分后,得到每个第一候选词的解码输出得分。确定每个第一候选词的解码输出得分后,可以对多个第一候选词的解码输出得分进行排序,并对最大的三个解码输出得分进行归一化处理,将处理结果确定为待识别语音的解码输出得分。
本申请实施例中,在基于语音识别模块对待识别语音进行解码得到第一候选词后,可以根据第一候选词确定待识别语音的解码输出得分,为确定待识别语音的综合置信度提供数据基础。
步骤420、根据解码待识别语音得到的多个第一候选词,确定所述待识别语音对应的多个第二候选词。
一种实施方式中,步骤420可以包括:
基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到包含有所述多个第一候选词的第一词图;在所述第一词图上,以编辑距离为准则,基于最小贝叶斯风险进行二次解码,得到所述待识别语音对应的所述多个第二候选词以及每个第二候选词的后验概率。
在基于包含语言模型和声学模型的语音识别模块对待识别语音进行一次解码得到待识别语音对应的第一候选词后,还可以根据第一候选词确定第一词图,并在第一词图上做以编辑距离为准则,基于最小贝叶斯风险的二次解码,得到待识别语音对应的第二候选词。
在第一词图上进行二次解码的过程如下,1)从第一词图中任选一条从初始状态到终止状态的候选路径;2)以该候选路径为基准,计算该候选路径和整个第一词图的编辑距离(编辑距离可以为把一个文本变成另一个文本的最小***、删除、替换次数),通过编辑距离,得到该候选路径中每个第一候选词对应时间段的所有第一候选词的后验概率;3)选择每个时刻概率最高的词,得到新的词序列,即第二候选词;4)若第二候选词和2)中候选路径对应的第一候选词不同,则返回执行2),若第二候选词和2)中候选路径对应的第一候选词相同,确定二次解码结束,将包含有多个第二候选词的词序列确定为二次解码结果。另外,包含有多个第二候选词的词序列中每个第二候选词所在时间段都有该时间段所有第二候选词的后验概率。
本申请实施例中,在基于语音识别模块对待识别语音进行解码得到第一候选词后,可以根据第一候选词确定待识别语音的第二候选词,为确定待识别语 音的解码置信度提供数据基础。
步骤430、确定每个第二候选词的解码特征。
所述解码特征包括所述第二候选词的置信度得分、词类别、概率分布、词长度和词图深度。
一种实施方式中,步骤430可以包括:
对所述多个第二候选词的后验概率进行归一化处理,得到每个第二候选词的置信度得分;根据每个第二候选词的类别信息,确定所述每个第二候选词的词类别;根据每个第二候选词在所述待识别语音对应的所有第二候选词中的出现次数,确定所述每个第二候选词的概率分布;根据每个第二候选词包含音素的数量,确定所述每个第二候选词的词长度;在对所述多个第一候选词进行二次解码得到的第二词图中,根据每个第二候选词对应时间段内所有节点的出边个数和时间段长度,确定所述每个第二候选词的词图深度。
在第一词图上进行二次解码得到第二候选词时,可以生成包含有多个第二候选词的第二词图,第二词图同样包含有每个第二候选词的后验概率,因此,对多个第二候选词的后验概率进行归一化处理后,得到每个第二候选词的置信度得分,第二候选词的置信度得分可以确定为第二候选词的一维特征。词类别可以表明第二候选词的类别信息,首先可以确定待识别语音的领域以及该领域内词的分类,并根据每个类词的词频对该领域内词的类别进行排序,将前N类的词分别单独作为一类,共N类,将其他类的词统一作为一个特殊的类,因此,可以将领域内词的分为N+1类,多个第二候选词均可映射到任一N+1类,因此,可以基于N+1维特征表示第二候选词的词类别。例如,将待识别语音领域内的词分为N+1=3+1=4类时,可以确定第二候选词的词类别为:(1,0,0,0)、(0,1,0,0)、(0,0,1,0)或者(0,0,0,1)。概率分布可以表明第二候选词在待识别语音对应的所有第二候选词中的出现次数,因此可以根据第二候选词在待识别语音对应的所有第二候选词中的出现次数与待识别语音对应的第二候选词总量,确定第二候选词的概率分布,即第二候选词的unigram概率,第二候选词的概率分布可以确定为第二候选词的一维特征。词长度可以表明第二候选词包含的音素个数,因此可以根据第二候选词包含音素的数量,确定第二候选词的词长度,第二候选词的词长度可以确定为第二候选词的一维特征。词图深度可以表明第二候选词对应时间段的平均词图深度,因此可以在包含有第二候选词的第二词图中,根据第二候选词对应时间段内所有节点的出边个数和时间段长度,确定第二候选词的词图深度,即平均lattice深度,第二候选词的词图深度可以确定为第二候选词的一维特征。
因此,可以确定第二候选词的解码特征为N+5维特征,如前所述, N+1=3+1=4时,可以确定第二候选词的解码特征为八维特征。
本申请实施例中,可以确定二次解码得到的每个第二候选词N+5维解码特征,第二候选词的解码特征用于确定待识别语音的解码置信度,为确定待识别语音的解码置信度提供数据基础。
步骤440、根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度。
一种实施方式中,步骤440可以包括:
将所述多个第二候选词的解码特征分别输入预先训练好的解码置信度模型中,得到每个第二候选词的解码置信度;根据所述多个第二候选词的解码置信度确定所述待识别语音的解码置信度。
在将所述多个第二候选词的解码特征分别输入预先训练好的解码置信度模型之前,还包括:
基于深度神经网络和交叉熵函数构建第一预设网络模型,并将标注后包含噪音和有效语音的语音数据对应的第二候选词的N+5维解码特征以及语音数据的标注信息作为第一训练数据,对第一预设网络模型进行网络训练,并计算第一损失函数;基于反向传播算法进行网络优化,直至第一损失函数收敛,得到解码置信度模型。
第一损失函数可以为交叉熵函数。
基于深度神经网络和交叉熵函数构建第一预设网络模型后,对含有噪音和有效语音的语音数据进行标注,将噪音标注为0,有效语音标注为1,将标注后包含噪音和有效语音的语音数据确定为第一训练语音集。基于语音识别模块对第一训练语音集所包含的训练语音进行一次解码得到训练语音对应的第一训练候选词后,对其进行二次解码得到训练语音对应的第二训练候选词,基于前述步骤430的方式确定第二训练候选词的N+5维解码特征。将训练语音集所包含的训练语音对应的第二候选词的N+5维解码特征以及该训练语音的标注信息作为训练数据,对第一预设网络模型进行网络训练,并根据第一预设网络模型输出的表示第二候选词的置信度打分的sigmoid激活函数值和该训练语音的标注信息计算交叉熵函数;基于反向传播算法进行网络优化,直至交叉熵函数收敛,得到解码置信度模型。
进而可以将多个第二候选词的解码特征分别输入解码置信度模型中,得到的输出结果为每个第二候选词的解码置信度,确定待识别语音对应的多个第二候选词的解码置信度的算术平均值,并将该算术平均值确定为待识别语音的解码置信度。
本申请实施例中,基于解码置信度模型可以确定待识别语音对应的每个第二候选词的解码置信度,进而可以根据多个第二候选词的解码置信度确定待识别语音的解码置信度,待识别语音的解码置信度可以表明语音识别结果的可靠程度,为确定语音识别结果提供数据基础。
步骤450、确定所述待识别语音所包含每帧语音帧的噪声置信度。
一种实施方式中,步骤450可以包括:
对所述待识别语音进行分帧,得到所述待识别语音所包含的多帧语音帧;确定每帧语音帧的梅尔倒谱系数,并将所述多帧语音帧的梅尔倒谱系数分别输入预先训练好的噪声置信度模型中,得到每帧语音帧的噪声置信度。
在将所述多帧语音帧的梅尔倒谱系数分别输入预先训练好的噪声置信度模型之前,还包括:
基于门控循环单元(Gate Recurrent Unit,GRU)构建第二预设网络模型,并将包含纯净噪声和纯净语音的训练语音构成的第二训练语音集中的训练语音对应的帧训练语音的梅尔倒谱系数以及帧训练语音的标注信息作为第二训练数据,对第二预设网络模型进行网络训练,并计算第二损失函数;基于随机梯度下降迭代第二预设网络模型的权重,直至第二损失函数收敛,得到噪声置信度模型。
第二损失函数也可以为交叉熵函数。
图5为本申请实施例提供的一种语音识别方法中第二预设网络模型的示意图,如图5所示,第二预设网络模型包括第一全连接层(fully connected layers,FC)、第一GRU、第二GRU、第三GRU和第二FC。
基于GRU构建第二预设网络模型后,收集纯净噪声和纯净语音,随机将纯净噪声以预设信噪比添加至纯净语音得到训练语音,将预设数量的训练语音确定为第二训练语音集,以25毫秒的帧长、10毫秒的帧移对第二训练语音集所包含的训练语音进行分帧,得到训练语音对应的帧训练语音,对于帧训练语音,当音素为非静音时,将其标注为1,当音素不为非静音时,将其标注为0。进而,将第二训练语音集中的训练语音对应的帧训练语音的梅尔倒谱系数以及帧训练语音的标注信息作为第二训练数据,对第二预设网络模型进行网络训练。可以将L帧帧训练语音的梅尔倒谱系数作为训练序列,输入第二预设网络模型,帧训练语音对应的输出结果为维度为2的向量,其中一维表示当前帧包含语音的概率,另一维表示当前帧不包含语音的概率。将L帧帧训练语音的标注信息作为目标序列,对第二预设网络模型进行网络训练,并计算交叉熵函数;基于随机梯度下降迭代第二预设网络模型的权重,直至第二损失函数收敛,得到噪声 置信度模型。
进而可以以25毫秒的帧长、10毫秒的帧移对待识别语音进行分帧,得到待识别语音所包含的语音帧,确定每帧语音帧的梅尔倒谱系数,并将语音帧的梅尔倒谱系数输入噪声置信度模型中,得到的输出结果为语音帧包含语音的概率p(t),不包含语音的概率为1-p(t),进而确定语音帧的噪声置信度为1-p(t)。
本申请实施例中,基于噪声置信度模型可以确定待识别语音所包含每帧语音帧的噪声置信度,多帧语音帧的噪声置信度用于确定待识别语音的噪声置信度,待识别语音的噪声置信度可以用于确定待识别语音的综合置信度,为确定语音识别结果提供数据基础。
步骤460、根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度。
一种实施方式中,步骤460可以包括:
根据所述待识别语音所包含的所述多帧语音帧的噪声置信度中的最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差,确定所述待识别语音的噪声置信度。
在确定待识别语音所包含每帧语音帧的噪声置信度后,可以对多帧帧语音的噪声置信度进行排序,计算均值和方差,将待识别语音所包含多帧语音帧的噪声置信度中的最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差确定为待识别语音的噪声置信度。
本申请实施例中,基于待识别语音所包含多帧语音帧的帧级别的噪声置信度可以确定待识别语音的片段级别的噪声置信度。
步骤470、根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度。
一种实施方式中,步骤470可以包括:
将所述待识别语音的解码置信度、噪声置信度和解码输出得分输入预先训练好的语音识别模型中,得到所述待识别语音的综合置信度。
在将所述待识别语音的解码置信度、噪声置信度和解码输出得分输入预先训练好的语音识别模型之前,还包括:
基于逻辑回归器构建第三预设网络模型,并将包含噪声的语音构建的第三训练语音集中的训练语音对应的解码置信度、噪声置信度、解码输出得分以及训练语音的标注信息作为第三训练数据,对第三预设网络模型进行网络训练,并计算第三损失函数;基于反向传播算法进行网络优化,直至第三损失函数收敛,得到语音识别模型。
进而可以将待识别语音的解码置信度、噪声置信度和解码输出得分输入语音识别模型中,得到的输出结果为待识别语音的综合置信度。
本申请实施例中,融合待识别语音的片段级别解码置信度、帧级别噪声置信度以及解码输出得分输入,以确定待识别语音的综合置信度。
步骤480、根据所述综合置信度确定语音识别结果。
综合置信度包括待识别语音包含语音的概率和待识别语音不包含语音的概率。
一种实施方式中,步骤480可以包括:
如果所述待识别语音包含语音的概率大于或者等于第一预设阈值,则确定所述语音识别结果为所述待识别语音包含语音;如果所述待识别语音包含语音的概率大于或者等于第二预设阈值,且小于所述第一预设阈值,则确定所述语音识别结果为所述待识别语音不包含语音;如果所述待识别语音包含语音的概率小于所述第二预设阈值,则确定语音识别错误,或者优化所述待识别语音得到优化语音,并基于所述优化语音重新进行语音识别。
第一预设阈值大于第二预设阈值,且第一预设阈值和第二预设阈值均小于1。
基于语音识别模型确定待识别语音的综合置信度后,首先可以比较待识别语音包含语音的概率和第二预设阈值。
一方面,如果待识别语音包含语音的概率大于或者等于第二预设阈值,则可以根据该待识别语音包含语音的概率确定语音识别结果,即可以采用语音识别模块的解码结果。进而继续比较待识别语音包含语音的概率和第一预设阈值,如果待识别语音包含语音的概率大于或者等于第一预设阈值,则确定语音识别结果为待识别语音包含语音;如果待识别语音包含语音的概率小于第一预设阈值,则确定语音识别结果为待识别语音不包含语音。
另一方面,如果待识别语音包含语音的概率小于第二预设阈值,则无法根据该待识别语音包含语音的概率确定语音识别结果,可以不采用语音识别模块对于待识别语音的解码结果,进而可以确定语音识别错误,或者,可以优化待识别语音得到优化语音,并基于优化语音重新进行语音识别。
一种实施方式中,优化所述待识别语音得到优化语音,包括:
将所述待识别语音中噪声置信度大于预设置信度的所述语音帧置为静音,得到所述优化语音。
比较待识别语音所包含每帧语音帧的噪声置信度和预设置信度,如果任一语音帧的噪声置信度大于预设置信度,则将该语音帧的噪声置信度置为0,实现 对待识别语音的优化,得到优化语音。将待识别语音所包含噪声置信度大于预设置信度的语音帧置为静音,实现对待识别语音的降噪,进而基于降噪得到的优化语音继续进行语音识别,可以提升语音识别的精确度。
本申请实施例中,可以根据综合置信度确定待识别语音包含语音或者不包含语音,还可以不采用语音识别模块对于待识别语音的解码结果,进而确定语音识别错误,或者,对待识别语音进行降噪以优化待识别语音得到优化语音后,继续基于语音识别模块对优化语音进行解码,以语音识别结果。
本申请实施例提供的语音识别方法包括:根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分;根据解码待识别语音得到的多个第一候选词,确定所述待识别语音对应的多个第二候选词;确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。上述技术方案,首先可以根据解码待识别语音得到的多个第一候选词确定待识别语音的解码输出得分,为确定综合置信度提供数据基础,根据多个第一候选词确定待识别语音对应的多个第二候选词,并在确定每个第二候选词的解码特征后,根据多个解码特征确定待识别语音更加精确的解码置信度,其次可以根据待识别语音所包含多帧语音帧的噪声置信度确定待识别语音的更加精确的帧级别的噪声置信度,可以根据待识别语音的解码置信度、噪声置信度和解码输出得分确定待识别语音的综合置信度,结合待识别语音的片段级别的解码置信度、帧级别的噪声置信度和前述确定的解码输出得分确定更为精确的待识别语音的综合置信度,上述过程无需重新对解码待识别语音所采用的语音识别模型进行特定优化或者重新训练,得到了更为精确的待识别语音的综合置信度,因此,确定的语音识别结果也更加精确。实现不增加成本的前提下,提升语音识别的精确度。
另外,在根据综合置信度确定不采用语音识别模块对于待识别语音的解码结果后,可以确定语音识别错误,或者,可以对待识别语音进行降噪以优化待识别语音得到优化语音后,继续基于语音识别模块对优化语音进行解码,以得到语音识别结果。
图6为本申请实施例提供的一种语音识别***的示意图,如图6所示,该语音识别***可以包括,语音识别模块100、解码置信度模块200、噪声置信度模块300、结果确定模块400和处理模块500,其中,语音识别模块100,设置为对待识别语音进行一次解码,以确定待识别语音的多个第一候选词,以及包 含多个第一候选词的第一词图;解码置信度模块200,设置为根据多个第一候选词确定待识别语音的解码输出得分和待识别语音对应的多个第二候选词,并在确定每个第二候选词的解码特征后,根据多个第二候选词的解码特征确定待识别语音的解码置信度;噪声置信度模块300,设置为确定待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定待识别语音的噪声置信度;结果确定模块400,设置为根据待识别语音的解码置信度、噪声置信度和解码输出得分,确定待识别语音的综合置信度;处理模块500,设置为根据综合置信度确定语音识别结果,例如在确定待识别语音包含语音的概率大于或者等于第一预设阈值时,确定语音识别结果为待识别语音包含语音;在确定待识别语音包含语音的概率大于或者等于第二预设阈值,且小于第一预设阈值时,确定语音识别结果为待识别语音不包含语音;在确定待识别语音包含语音的概率小于第二预设阈值时,确定语音识别错误,或者,优化待识别语音得到优化语音,并基于优化语音重新进行语音识别。
本申请实施例所提供的语音识别***可执行本申请任意实施例所提供的语音识别方法,具备执行语音识别方法相应的功能模块和效果。
图7为本申请实施例提供的一种语音识别装置的结构示意图。该装置与上述实施例的语音识别方法属于同一个申请构思,在语音识别装置的实施例中未详尽描述的细节内容,可以参考上述语音识别方法的实施例。
该语音识别装置的结构如图7所示,包括:
解码模块710,设置为根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;解码置信度确定模块720,设置为确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;噪声置信度确定模块730,设置为确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;执行模块740,设置为根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
在上述实施例的基础上,解码模块710,设置为:
基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到包含有所述多个第一候选词的第一词图;确定每个第一候选词的语言得分和声学得分,根据所述多个第一候选词的语言得分和声学得分确定所述待识别语音的所述解码输出得分;在所述第一词图上,以编辑距离为准则, 基于最小贝叶斯风险进行二次解码,得到所述待识别语音对应的所述多个第二候选词以及每个第二候选词的后验概率。
在上述实施例的基础上,所述解码特征包括所述第二候选词的置信度得分、词类别、概率分布、词长度和词图深度,相应地,解码置信度确定模块720,设置为:
对所述多个第二候选词的后验概率进行归一化处理,得到每个第二候选词的置信度得分;根据每个第二候选词的类别信息,确定所述每个第二候选词的词类别;根据每个第二候选词在所述待识别语音对应的所有第二候选词中的出现次数,确定所述每个第二候选词的概率分布;根据每个第二候选词包含音素的数量,确定所述每个第二候选词的词长度;在对所述第一候选词进行二次解码得到的第二词图中,根据每个第二候选词对应时间段内所有节点的出边个数和时间段长度,确定所述每个第二候选词的所述词图深度;将所述多个第二候选词的解码特征分别输入预先训练好的解码置信度模型中,得到每个第二候选词的解码置信度;根据每个第二候选词的解码置信度确定所述待识别语音的解码置信度。
在上述实施例的基础上,噪声置信度确定模块730,设置为:
对所述待识别语音进行分帧,得到所述待识别语音所包含的多帧语音帧;确定每帧语音帧的梅尔倒谱系数,并将所述多帧语音帧的梅尔倒谱系数分别输入预先训练好的噪声置信度模型中,得到每帧语音帧的噪声置信度;根据所述待识别语音所包含的所述多帧语音帧的噪声置信度中的最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差,确定所述待识别语音的噪声置信度。
在上述实施例的基础上,执行模块740,设置为:
将所述待识别语音的解码置信度、噪声置信度和解码输出得分输入预先训练好的语音识别模型中,得到所述待识别语音的综合置信度;根据所述综合置信度确定语音识别结果。
一种实施方式中,所述综合置信度包括所述待识别语音包含语音的概率,相应地,根据所述综合置信度确定语音识别结果,包括:
如果所述待识别语音包含语音的概率大于或者等于第一预设阈值,则确定所述语音识别结果为所述待识别语音包含语音;如果所述待识别语音包含语音的概率大于或者等于第二预设阈值,且小于所述第一预设阈值,则确定所述语音识别结果为所述待识别语音不包含语音;如果所述待识别语音包含语音的概率小于所述第二预设阈值,则确定语音识别错误,或者优化所述待识别语音得 到优化语音,并基于所述优化语音重新进行语音识别。
优化所述待识别语音得到优化语音,包括:
将所述待识别语音中噪声置信度大于预设置信度的所述语音帧置为静音,得到所述优化语音。
本申请实施例所提供的语音识别装置可执行本申请任意实施例所提供的语音识别方法,具备执行语音识别方法相应的功能模块和效果。
上述语音识别装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
图8为本申请实施例提供的一种计算机设备的结构示意图。图8示出了适于用来实现本申请实施方式的示例性计算机设备8的框图。图8显示的计算机设备8仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图8所示,计算机设备8以通用计算计算机设备的形式表现。计算机设备8的组件可以包括但不限于:一个或者多个处理器或者处理单元16,内存28,连接不同***组件(包括内存28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,***总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线,微通道体系结构(Micro Channel Architecture,MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及***组件互连(Peripheral Component Interconnect,PCI)总线。
计算机设备8包括多种计算机***可读介质。这些介质可以是任何能够被计算机设备8访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
内存28可以包括易失性存储器形式的计算机***可读介质,例如随机存取存储器(Random Access Memory,RAM)30和/或高速缓存存储器32。计算机设备8可以包括其它可移动/不可移动的、易失性/非易失性计算机***存储介质。仅作为举例,存储***34可以设置为读写不可移动的、非易失性磁介质(图8未显示,通常称为“硬盘驱动器”)。尽管图8中未示出,可以提供设置为对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘 (例如只读光盘存储器(Compact Disc Read-Only Memory,CD-ROM),数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。内存28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如内存28中,这样的程序模块42包括但不限于操作***、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或一种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备8也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备8交互的设备通信,和/或与使得该计算机设备8能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口22进行。并且,计算机设备8还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图8所示,网络适配器20通过总线18与计算机设备8的其它模块通信。应当明白,尽管图8中未示出,可以结合计算机设备8使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)***、磁带驱动器以及数据备份存储***等。
处理单元16通过运行存储在内存28中的程序,从而执行多种功能应用以及页面显示,例如实现本申请实施例所提供的语音识别方法,该方法包括:
根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
本领域技术人员可以理解,处理器还可以实现本申请任意实施例所提供的语音识别方法的技术方案。
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现例如本申请实施例所提供的语音识别方法,该方法包括:
根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于:电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言,诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机 上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络,包括LAN或WAN,连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
本领域普通技术人员应该明白,上述的本申请的多个模块或多个步骤可以用通用的计算装置来实现,它们可以集中在单个计算装置上,或者分布在多个计算装置所组成的网络上,可选地,他们可以用计算机装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成多个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件的结合。
另外,本申请技术方案中对数据的获取、存储、使用、处理等均符合国家法律法规的相关规定。

Claims (13)

  1. 一种语音识别方法,包括:
    根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;
    确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;
    确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;
    根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
  2. 根据权利要求1所述的语音识别方法,其中,所述根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分,包括:
    基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到所述多个第一候选词;
    确定每个第一候选词的语言得分和声学得分,根据所述多个第一候选词的语言得分和声学得分确定所述待识别语音的所述解码输出得分。
  3. 根据权利要求1所述的语音识别方法,其中,根据解码待识别语音得到的多个第一候选词,确定所述待识别语音对应的多个第二候选词,包括:
    基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到包含有所述多个第一候选词的第一词图;
    在所述第一词图上,以编辑距离为准则,基于最小贝叶斯风险进行二次解码,得到所述待识别语音对应的所述多个第二候选词以及每个第二候选词的后验概率。
  4. 根据权利要求3所述的语音识别方法,其中,所述解码特征包括所述第二候选词的置信度得分、词类别、概率分布、词长度和词图深度,所述确定每个第二候选词的解码特征,包括:
    对所述多个第二候选词的后验概率进行归一化处理,得到每个第二候选词的置信度得分;
    根据每个第二候选词的类别信息,确定所述每个第二候选词的词类别;
    根据所述每个第二候选词在所述待识别语音对应的所有第二候选词中的出现次数,确定所述每个第二候选词的概率分布;
    根据每个第二候选词包含音素的数量,确定所述每个第二候选词的词长度;
    在对所述多个第一候选词进行二次解码得到的第二词图中,根据每个第二候选词对应时间段内所有节点的出边个数和时间段长度,确定所述每个第二候选词的词图深度。
  5. 根据权利要求1所述的语音识别方法,其中,所述根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度,包括:
    将所述多个第二候选词的解码特征分别输入预先训练好的解码置信度模型中,得到每个第二候选词的解码置信度;
    根据所述多个第二候选词的解码置信度确定所述待识别语音的解码置信度。
  6. 根据权利要求1所述的语音识别方法,其中,所述确定所述待识别语音所包含每帧语音帧的噪声置信度,包括:
    对所述待识别语音进行分帧,得到所述待识别语音所包含的多帧语音帧;
    确定每帧语音帧的梅尔倒谱系数,并将所述多帧语音帧的梅尔倒谱系数分别输入预先训练好的噪声置信度模型中,得到每帧语音帧的噪声置信度。
  7. 根据权利要求1所述的语音识别方法,其中,所述根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度,包括:
    根据所述待识别语音所包含的所述多帧语音帧的噪声置信度中的最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差,确定所述待识别语音的噪声置信度。
  8. 根据权利要求1所述的语音识别方法,其中,所述根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,包括:
    将所述待识别语音的解码置信度、噪声置信度和解码输出得分输入预先训练好的语音识别模型中,得到所述待识别语音的综合置信度。
  9. 根据权利要求1所述的语音识别方法,其中,所述综合置信度包括所述待识别语音包含语音的概率,所述根据所述综合置信度确定语音识别结果,包括:
    在所述待识别语音包含语音的概率大于或者等于第一预设阈值的情况下,确定所述语音识别结果为所述待识别语音包含语音;
    在所述待识别语音包含语音的概率大于或者等于第二预设阈值,且小于所述第一预设阈值的情况下,确定所述语音识别结果为所述待识别语音不包含语音;
    在所述待识别语音包含语音的概率小于所述第二预设阈值的情况下,确定语音识别错误,或者,优化所述待识别语音得到优化语音,并基于所述优化语音重新进行语音识别。
  10. 根据权利要求9所述的语音识别方法,其中,所述优化所述待识别语音得到优化语音,包括:
    将所述待识别语音中噪声置信度大于预设置信度的语音帧置为静音,得到所述优化语音。
  11. 一种语音识别装置,包括:
    解码模块,设置为根据解码待识别语音得到的多个第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的多个第二候选词;
    解码置信度确定模块,设置为确定每个第二候选词的解码特征,并根据所述多个第二候选词的解码特征确定所述待识别语音的解码置信度;
    噪声置信度确定模块,设置为确定所述待识别语音所包含每帧语音帧的噪声置信度,并根据多帧语音帧的噪声置信度确定所述待识别语音的噪声置信度;
    执行模块,设置为根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。
  12. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如权利要求1-10中任一所述的语音识别方法。
  13. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-10中任一所述的语音识别方法。
PCT/CN2023/097748 2022-06-28 2023-06-01 语音识别方法、装置、设备和存储介质 WO2024001662A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210753629.8A CN115294974A (zh) 2022-06-28 2022-06-28 一种语音识别方法、装置、设备和存储介质
CN202210753629.8 2022-06-28

Publications (1)

Publication Number Publication Date
WO2024001662A1 true WO2024001662A1 (zh) 2024-01-04

Family

ID=83820283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/097748 WO2024001662A1 (zh) 2022-06-28 2023-06-01 语音识别方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN115294974A (zh)
WO (1) WO2024001662A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294974A (zh) * 2022-06-28 2022-11-04 京东科技信息技术有限公司 一种语音识别方法、装置、设备和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064315A1 (en) * 2002-09-30 2004-04-01 Deisher Michael E. Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments
CN111341305A (zh) * 2020-03-05 2020-06-26 苏宁云计算有限公司 一种音频数据标注方法、装置及***
CN111883109A (zh) * 2020-07-01 2020-11-03 北京猎户星空科技有限公司 语音信息处理及验证模型训练方法、装置、设备及介质
CN112599128A (zh) * 2020-12-31 2021-04-02 百果园技术(新加坡)有限公司 一种语音识别方法、装置、设备和存储介质
CN112951219A (zh) * 2021-02-01 2021-06-11 思必驰科技股份有限公司 噪声拒识方法和装置
CN114255754A (zh) * 2021-12-27 2022-03-29 贝壳找房网(北京)信息技术有限公司 语音识别方法、电子设备、程序产品和存储介质
CN115294974A (zh) * 2022-06-28 2022-11-04 京东科技信息技术有限公司 一种语音识别方法、装置、设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064315A1 (en) * 2002-09-30 2004-04-01 Deisher Michael E. Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments
CN111341305A (zh) * 2020-03-05 2020-06-26 苏宁云计算有限公司 一种音频数据标注方法、装置及***
CN111883109A (zh) * 2020-07-01 2020-11-03 北京猎户星空科技有限公司 语音信息处理及验证模型训练方法、装置、设备及介质
CN112599128A (zh) * 2020-12-31 2021-04-02 百果园技术(新加坡)有限公司 一种语音识别方法、装置、设备和存储介质
CN112951219A (zh) * 2021-02-01 2021-06-11 思必驰科技股份有限公司 噪声拒识方法和装置
CN114255754A (zh) * 2021-12-27 2022-03-29 贝壳找房网(北京)信息技术有限公司 语音识别方法、电子设备、程序产品和存储介质
CN115294974A (zh) * 2022-06-28 2022-11-04 京东科技信息技术有限公司 一种语音识别方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN115294974A (zh) 2022-11-04

Similar Documents

Publication Publication Date Title
US11922932B2 (en) Minimum word error rate training for attention-based sequence-to-sequence models
WO2021051544A1 (zh) 语音识别方法及其装置
Henderson et al. Discriminative spoken language understanding using word confusion networks
US8793132B2 (en) Method for segmenting utterances by using partner's response
US20050187768A1 (en) Dynamic N-best algorithm to reduce recognition errors
JP2002140089A (ja) 挿入ノイズを用いた後にノイズ低減を行うパターン認識訓練方法および装置
CN116888662A (zh) 学习用于子词端到端自动语音识别的词级置信度
EP3739583B1 (en) Dialog device, dialog method, and dialog computer program
US11120802B2 (en) Diarization driven by the ASR based segmentation
WO2022166218A1 (zh) 一种语音识别中添加标点符号的方法及语音识别装置
US20080215325A1 (en) Technique for accurately detecting system failure
CN110503956B (zh) 语音识别方法、装置、介质及电子设备
CN114360557B (zh) 语音音色转换方法、模型训练方法、装置、设备和介质
WO2024001662A1 (zh) 语音识别方法、装置、设备和存储介质
CN117099157A (zh) 用于端到端自动语音识别置信度和删除估计的多任务学习
WO2023109129A1 (zh) 语音数据的处理方法及装置
CN113782029B (zh) 语音识别模型的训练方法、装置、设备以及存储介质
CN111400463B (zh) 对话响应方法、装置、设备和介质
KR20240065125A (ko) 희귀 단어 스피치 인식을 위한 대규모 언어 모델 데이터 선택
US10468031B2 (en) Diarization driven by meta-information identified in discussion content
CN113793599A (zh) 语音识别模型的训练方法和语音识别方法及装置
Rose et al. Integration of utterance verification with statistical language modeling and spoken language understanding
CN114399992B (zh) 语音指令响应方法、装置及存储介质
TWI818427B (zh) 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統
CN113327596B (zh) 语音识别模型的训练方法、语音识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829847

Country of ref document: EP

Kind code of ref document: A1