CN114255754A - Speech recognition method, electronic device, program product, and storage medium - Google Patents

Speech recognition method, electronic device, program product, and storage medium Download PDF

Info

Publication number
CN114255754A
CN114255754A CN202111611631.3A CN202111611631A CN114255754A CN 114255754 A CN114255754 A CN 114255754A CN 202111611631 A CN202111611631 A CN 202111611631A CN 114255754 A CN114255754 A CN 114255754A
Authority
CN
China
Prior art keywords
word
voice
probability
noise
word sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111611631.3A
Other languages
Chinese (zh)
Inventor
颜瑞
徐延广
解传栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shell Housing Network Beijing Information Technology Co Ltd
Original Assignee
Shell Housing Network Beijing Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shell Housing Network Beijing Information Technology Co Ltd filed Critical Shell Housing Network Beijing Information Technology Co Ltd
Priority to CN202111611631.3A priority Critical patent/CN114255754A/en
Publication of CN114255754A publication Critical patent/CN114255754A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the disclosure discloses a voice recognition method, an electronic device, a program product and a storage medium, wherein a first decoding result is obtained by decoding a voice to be recognized, and the first decoding result comprises a plurality of first word sequences and the start time and the end time of each word in the plurality of first word sequences, acoustic probability and language probability; calculating the confidence coefficient of each word in the plurality of first word sequences respectively based on the acoustic probability and the language probability of each word in the plurality of first word sequences; obtaining a second decoding result based on words with the confidence degrees of more than a first preset threshold value in the plurality of first word sequences, wherein the words comprise the starting time and the ending time of each word in the plurality of second word sequences, the acoustic probability and the language probability; and obtaining a voice recognition result based on the plurality of second word sequences and the starting time and the ending time of each word, the acoustic probability and the language probability, so that the accuracy of the voice recognition result can be improved, and the application effect based on the voice recognition is improved.

Description

Speech recognition method, electronic device, program product, and storage medium
Technical Field
The present disclosure relates to a voice recognition method and apparatus, an electronic device, a program product, and a storage medium.
Background
The speech recognition is a technology which takes speech as a research object and enables a machine to automatically recognize and understand human spoken language through speech signal processing and pattern recognition, so that the speech is converted into text. With the development of the mobile internet, speech recognition is becoming more important, which is the basis on which many other applications can be implemented. For example, by using the voice recognition technology, applications such as voice dialing and voice navigation can be realized. The more accurate the speech recognition result, the better the effect of the speech recognition based application will be.
In the existing speech recognition system, when a text sentence of the whole speech recognition result is returned, the confidence level of the text sentence is returned, and the receiving end judges the confidence level of the text sentence according to whether the confidence level of the text sentence is greater than a preset threshold value, so that the next action is determined.
In the process of implementing the present invention, the inventor finds, through research, that since the confidence level of the whole text sentence is returned by the existing speech recognition system, and the whole text sentence may have incorrectly recognized words, when the receiving end judges that the confidence level of the text sentence is higher based on the confidence level of the whole text sentence and performs the next action, an incorrect action may be caused due to the incorrectly recognized words existing in the whole text sentence, thereby reducing the application effect based on the speech recognition.
Disclosure of Invention
The disclosed embodiments provide a speech recognition method and apparatus, an electronic device, a program product, and a storage medium to improve accuracy of a speech recognition result, solve, at least to a certain extent, a problem of erroneous actions due to the presence of an incorrectly recognized word in a whole text sentence, and improve application effects based on speech recognition.
According to an aspect of an embodiment of the present disclosure, there is provided a speech recognition method including:
decoding a speech to be recognized to obtain a first decoding result, wherein the first decoding result comprises: a plurality of first word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of first word sequences;
calculating the confidence of each word in the plurality of first word sequences respectively based on the acoustic probability and the language probability of each word in the plurality of first word sequences;
obtaining a second decoding result based on the words with the confidence degrees larger than a first preset threshold value in the plurality of first word sequences, wherein the second decoding result comprises: a plurality of second word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of second word sequences;
obtaining a speech recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability and the language probability of each word in the plurality of second word sequences, wherein the language recognition result comprises: a first sentence and a confidence level of the first sentence for performing a corresponding action based on the speech recognition result.
Optionally, in any embodiment of the present disclosure, the calculating the confidence of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences respectively includes:
respectively taking each word in each first word sequence in the plurality of first word sequences as a current word, and respectively calculating the forward probability and the backward probability of the current word based on the acoustic probability and the language probability of the current word;
calculating the posterior probability of the current word in the first word sequence based on the forward probability, the backward probability and the language probability of the current word;
and overlapping the posterior probabilities of the current words in the plurality of first word sequences to obtain the confidence of the current words.
Optionally, in any embodiment of the present disclosure, the obtaining a speech recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability, and the language probability of each word in the plurality of second word sequences includes:
determining a second word sequence with the highest comprehensive score in the second word sequences respectively based on the starting time and the ending time of each word in the second word sequences, the acoustic probability and the language probability;
obtaining a first sentence based on the second word sequence with the highest comprehensive score;
and obtaining the confidence coefficient of the first sentence based on the confidence coefficient of each word in the first sentence.
Optionally, in any embodiment of the present disclosure, before decoding the speech to be recognized, the method further includes:
performing voice endpoint detection on the audio signal acquired by the audio acquisition module by using an endpoint detection model to obtain a starting point and an end point of at least one voice activity section;
and intercepting the at least one voice activity section from the audio signal based on the starting point and the end point of the at least one voice activity section so as to respectively take each voice activity section in the at least one voice activity section as the voice to be recognized and execute the operation of decoding the voice to be recognized to obtain a first decoding result.
Optionally, in any embodiment of the present disclosure, after calculating the confidence of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences, the method further includes:
responding to the fact that the confidence degree of each word in the first word sequences is larger than a second preset threshold and smaller than a third preset threshold, and adding the voice to be recognized into a noise set as a noise sample to be used for training the endpoint detection model and/or used as foreground noise to be used for training an acoustic model; wherein the noise set comprises at least one noise sample, and the second preset threshold is smaller than the third preset threshold.
Optionally, in any embodiment of the present disclosure, the method further includes:
training the endpoint detection model using the noise samples in the noise set.
Optionally, in any embodiment of the present disclosure, after calculating the confidence of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences, the method further includes:
responding to the fact that the confidence degree of each word in the first word sequences is larger than the third preset threshold value and smaller than the first preset threshold value, adding the voice to be recognized into a background voice set as a background voice sample, and using the voice to be recognized as background voice noise for training an acoustic model; the background voice set comprises at least one background voice sample, and the third preset threshold is smaller than the first preset threshold.
Optionally, in any embodiment of the present disclosure, after calculating the confidence of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences, the method further includes:
responding to the fact that the confidence degrees of all words in the first word sequences are larger than a fourth preset threshold value, and obtaining a second sentence based on the first word sequences and the confidence degrees of all words in the first word sequences; wherein the fourth preset threshold is not less than the first preset threshold;
taking the second sentence as the labeling information of the voice to be recognized, and taking the voice to be recognized and the labeling information of the voice to be recognized as a voice sample to be added into a voice set so as to be used for training an acoustic model and a language model; wherein the speech set comprises at least one speech sample.
Optionally, in any embodiment of the present disclosure, the adding the to-be-recognized speech and the labeling information of the to-be-recognized speech as a speech sample into a speech set includes:
determining attribute information of the second sentence, wherein the attribute information comprises any one or more of the following items: domain, application scenario, geographic area;
and taking the voice to be recognized and the marking information of the voice to be recognized as a voice sample, and adding the voice sample into a voice set corresponding to the attribute information so as to train an acoustic model and a language model corresponding to the attribute information.
Optionally, in any embodiment of the present disclosure, the method further includes:
constructing a training data set based on a noise sample in a noise set, a background human voice sample in the background human voice set and a voice sample in a voice set, wherein the training data set comprises at least one noisy signal, and the noisy signal comprises any one or more of the following items: a noise signal generated by mixing the voice sample and the noise sample as foreground noise, a noise signal generated by mixing the voice sample and the background human voice sample as background human voice noise, and a noise signal generated by mixing the voice sample, the noise sample as foreground noise and the background human voice sample as background human voice noise, wherein the noise signal is marked with marking information of the voice sample for generating the noise signal;
and training an acoustic model and/or a language model for decoding the speech to be recognized by utilizing the training data set.
According to another aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:
the speech recognition model is used for decoding the speech to be recognized to obtain a first decoding result, and the first decoding result comprises: a plurality of first word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of first word sequences;
a confidence coefficient calculation module, configured to calculate a confidence coefficient of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences, respectively;
an obtaining module, configured to obtain a second decoding result based on a word in the plurality of first word sequences whose confidence level is greater than a first preset threshold, where the second decoding result includes: a plurality of second word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of second word sequences;
a result determining module, configured to obtain a speech recognition result based on the plurality of second word sequences and a start time and an end time, an acoustic probability, and a language probability of each word in the plurality of second word sequences, where the language recognition result includes: a first sentence and a confidence level of the first sentence for performing a corresponding action based on the speech recognition result.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including:
a memory for storing a computer program;
a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the speech recognition method according to any of the above embodiments of the present disclosure.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the speech recognition method according to any of the above embodiments of the present disclosure.
According to yet another aspect of an embodiment of the present disclosure, there is provided a computer program product including computer programs/instructions, which when executed by a processor, implement the speech recognition method according to any one of the above embodiments of the present disclosure.
Based on the speech recognition method and apparatus, the electronic device, the program product, and the storage medium provided by the embodiments of the present disclosure, after a first decoding result is obtained by decoding a speech to be recognized, a confidence of each word is calculated based on an acoustic probability and a language probability of each word in a plurality of first word sequences, respectively, the first decoding result is filtered based on a first preset threshold, so as to obtain a second decoding result, and further, a speech recognition result is obtained based on a plurality of second word sequences in the second decoding result, a start time and an end time of each word, and the acoustic probability and the language probability, so as to perform a corresponding action based on the speech recognition result. Therefore, the embodiment of the disclosure can eliminate words with low confidence in the decoding result after the decoding result is obtained by decoding the speech to be recognized, and determine the speech recognition result only based on the words with high confidence, thereby improving the accuracy of the speech recognition result, avoiding erroneous actions caused by the existence of words with inaccurate recognition in the obtained sentences when corresponding actions are performed based on the speech recognition result, and being beneficial to improving the application effect based on the speech recognition.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of one embodiment of the disclosed speech recognition method.
FIG. 2 is a flow chart of another embodiment of the disclosed speech recognition method.
FIG. 3 is a flow chart of yet another embodiment of the disclosed speech recognition method.
FIG. 4 is a flow chart of yet another embodiment of the disclosed speech recognition method.
Fig. 5 is a flowchart of another embodiment of the speech recognition method of the present disclosure.
Fig. 6 is a schematic structural diagram of an embodiment of a speech recognition apparatus according to the present disclosure.
Fig. 7 is a schematic structural diagram of another embodiment of the speech recognition apparatus of the present disclosure.
Fig. 8 is a schematic structural diagram of an embodiment of an application of the electronic device of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
The speech recognition is a technology which takes speech as a research object, enables a machine to automatically recognize and understand human dictation language through speech signal processing and pattern recognition, enables a computer to convert the speech into text, and obtains a corresponding word or character sequence for a given waveform sequence.
FIG. 1 is a flow chart of one embodiment of the disclosed speech recognition method. As shown in fig. 1, the speech recognition method of this embodiment includes:
and 102, decoding the voice to be recognized to obtain a first decoding result.
The first decoding result is a word graph, which includes: and each first word sequence comprises at least one word, and the starting time and the ending time, the acoustic probability and the language probability of each word in the at least one word.
Alternatively, the operation 102 may be implemented by a speech recognition model (including an acoustic model and a language model).
The acoustic probability is used for representing the probability of the pronunciation overlap phoneme of a certain section of speech in the speech to be recognized, and can be obtained through an acoustic model.
The acoustic model outputs an acoustic recognition result, the acoustic recognition result comprises a plurality of paths, each path comprises at least one phoneme and acoustic probability of each phoneme in the at least one phoneme, and the at least one phoneme obtains words included in a first word sequence based on the acquisition time sequence of the speech to be recognized. The acoustic models may include, for example, but are not limited to: gaussian Mixture Model-Hidden Markov Model (GMM-HMM), Recurrent Neural Network (RNN), Feedforward Sequential Memory Neural network (FSMN), etc., which are not limited in this disclosure.
After obtaining the acoustic recognition result, the acoustic recognition result may be input into a language model, and a language probability of each phoneme to word (including a word) in the acoustic recognition result may be obtained.
The language models may include, for example, but are not limited to: a regular language Model, a statistical language Model, or a Neural Network Language Model (NNLM), which is not limited in the embodiments of the present disclosure.
The speech to be recognized in the embodiment of the present disclosure may be an original audio signal acquired by an audio acquisition module (e.g., a microphone, etc.), or may be an audio signal obtained by processing the original audio signal with a front-end signal; in addition, the speech to be recognized may be an audio signal acquired by an Application (APP) with a speech interaction function in real time, or may also be a historical audio signal stored in the APP, which is not limited in the embodiment of the present disclosure. The front-end signal processing may include, but is not limited to: voice Activity Detection (VAD), noise reduction, Acoustic Echo Cancellation (AEC), dereverberation, sound source localization, Beam Forming (BF), etc.
And 104, calculating the confidence of each word in the plurality of first word sequences respectively based on the acoustic probability and the language probability of each word in the plurality of first word sequences.
The confidence of the word is used for representing the credibility and reliability of the word as a voice recognition result. The value range of the confidence coefficient is in the range of [0,1], and the larger the value is, the higher the credibility and the reliability of a word as a voice recognition result are.
And 106, obtaining a second decoding result based on the words with the confidence degrees larger than the first preset threshold value in the plurality of first word sequences.
The second decoding result is a word graph obtained by filtering out words with confidence degrees not greater than a first preset threshold value from the word graph and corresponding start time and end time, acoustic probability and language probability, and the word graph comprises: a plurality of second word sequences, a start time and an end time of each word in the plurality of second word sequences, an acoustic probability, and a language probability.
The specific value of the first preset threshold may be set according to various factors such as different fields, application scenarios, geographic areas, requirements for accuracy of voice recognition results, and the like, and may be updated according to actual needs. For example, in a voice dialing application, a value of the first preset threshold may be set to 0.5, in a voice navigation application, a value of the first preset threshold may be set to 0.4, and so on.
And 108, obtaining a voice recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability and the language probability of each word in the plurality of second word sequences, so as to perform corresponding actions based on the voice recognition result.
The language identification result comprises: a first sentence and a confidence level of the first sentence.
The corresponding action based on the voice recognition result means that the operation corresponding to the voice recognition result is executed in an application scene based on the voice recognition (for example, applications such as voice dialing, voice navigation, voice song requesting, and voice awakening), for example, in an application scene of the voice navigation, the navigation action of navigating to the building a is executed based on the language recognition result "navigating to the building a" in the application scene of the voice navigation.
And the applications of voice dialing, voice navigation and the like are realized. The more accurate the speech recognition result, the better the effect of the speech recognition based application will be.
According to the speech recognition method provided by the embodiment of the present disclosure, after a first decoding result is obtained by decoding a speech to be recognized, confidence levels of words are calculated based on acoustic probabilities and language probabilities of the words in a plurality of first word sequences, the first decoding result is filtered based on a first preset threshold, a second decoding result is obtained, and further, a speech recognition result is obtained based on a plurality of second word sequences in the second decoding result, start times and end times of the words, the acoustic probabilities and the language probabilities, so that corresponding actions are performed based on the speech recognition result. Therefore, the embodiment of the disclosure can eliminate words with low confidence in the decoding result after the decoding result is obtained by decoding the speech to be recognized, and determine the speech recognition result only based on the words with high confidence, thereby improving the accuracy of the speech recognition result, avoiding erroneous actions caused by the existence of words with inaccurate recognition in the obtained sentences when corresponding actions are performed based on the speech recognition result, and being beneficial to improving the application effect based on the speech recognition.
FIG. 2 is a flow chart of another embodiment of the disclosed speech recognition method. As shown in fig. 2, on the basis of the embodiment shown in fig. 1, 104 in this embodiment may include:
1042, respectively taking each word in each first word sequence in the plurality of first word sequences as a current word, and respectively calculating a forward probability and a backward probability of the current word based on an acoustic probability and a language probability of the current word and based on a preset calculation mode, such as a forward-backward algorithm.
1044, calculating the posterior probability of the current word in the first word sequence based on the forward probability, the backward probability and the language probability of the current word and based on a preset calculation mode.
1046, overlapping the posterior probabilities of the current word in the plurality of first word sequences to obtain the confidence of the current word, i.e. the confidence of the current word in the whole word graph.
Based on the embodiment, the confidence coefficient of the current word in the whole word graph can be objectively and accurately determined through the acoustic probability and the language probability of each word, so that the credibility and the reliability of the word serving as a voice recognition result can be objectively and accurately determined, whether the word is removed from the word graph or not is determined, and the accuracy of the voice recognition result is improved.
Optionally, in some implementations of any embodiment of the present disclosure, in 108, a second word sequence with the highest comprehensive score in the plurality of second word sequences may be determined based on the start time and the end time of each word in the plurality of second word sequences, the acoustic probability and the language probability, respectively, where the comprehensive score is determined based on the acoustic probability and the language probability, and specifically, a score of each word may be determined based on the acoustic probability and the language probability of each word, the words with the highest scores are selected from the plurality of second word sequences for the same start time and end time, respectively, and the words with the highest scores form the second word sequence with the highest comprehensive score based on the words with the highest scores corresponding to the start time and the end time, that is, the second word sequence corresponding to the optimal path in the word graph; then, a first sentence is obtained based on the second word sequence with the highest comprehensive score, and further, the confidence coefficient of the first sentence is obtained based on the confidence coefficient of each word in the first sentence.
In a specific implementation, the second word sequence with the highest comprehensive score may be directly used to form a first sentence, and the confidence of the first sentence is obtained based on the confidence of each word in the first sentence.
For example, in some examples, when the confidence of the first sentence is obtained based on the confidence of each word in the first sentence, the confidence of the first sentence may be determined as an average value of the confidence of each word in the first sentence, that is, a ratio between a sum of the confidences of each word in the first sentence and the number of words in the first sentence. Or, in other examples, a median of the confidence values of the words in the first sentence, that is, the words are sorted in an order from large to small or in an order from small to large, to obtain a confidence sequence, and a central confidence value in the confidence sequence or an average value of two central confidence values in the confidence sequence is selected as the confidence value of the first sentence. The embodiment of the present disclosure does not limit the specific manner of obtaining the confidence of the first sentence.
Based on the embodiment, the confidence coefficient of the first sentence can be determined based on the confidence coefficient of each word in the first sentence, so that the accuracy and precision of the sentence confidence coefficient are improved, and the reliability of the voice recognition result is determined.
Generally, a speech recognition system mainly includes a front-end signal processing module, an acoustic model, and a language model. The front-end processing module is mainly used for performing operations such as voice endpoint detection, noise reduction, feature extraction, acoustic echo cancellation, dereverberation processing, sound source positioning, beam forming and the like. The acoustic model and the language model belong to back-end processing, the acoustic model is mainly used for constructing a probability mapping relation between input voice and output acoustic units (phonemes), and the language model is mainly used for constructing a probability collocation relation between the acoustic units (phonemes) to words and between different words, so that recognized sentences are more natural.
Optionally, before 102 of any of the above embodiments of the present disclosure, the method may further include:
performing voice endpoint detection (VAD) on the audio signal acquired by the audio acquisition module by using an endpoint detection model to obtain a starting point and an end point of at least one voice activity section;
based on the starting point and the ending point of the at least one voice activity segment, at least one voice activity segment is intercepted from the audio signal, so that each voice activity segment in the at least one voice activity segment is taken as the voice to be recognized, and the operation of the embodiment is executed.
VAD, also called voice activity detection and voice boundary detection, refers to detecting the existence of voice in an audio signal in a noise environment, and accurately detecting the initial position of a voice segment in the audio signal, and is generally used in voice processing systems such as voice coding and voice enhancement, and plays roles of reducing a voice coding rate, saving a communication bandwidth, reducing energy consumption of a mobile device, improving a recognition rate, and the like. The starting point of the VAD is from mute to voice, the end point of the VAD is from voice to mute, and the judgment of the end point of the VAD needs a period of mute. The speech obtained by front-end signal processing of the original audio signal includes speech from the start point to the end point of the VAD, and therefore, as the speech to be recognized in the embodiment of the present disclosure, a section of silence may be included after the speech section.
Based on this embodiment, can be accurate intercept out effectual voice activity section from audio signal and carry out speech recognition respectively, be favorable to avoiding discerning the non-speech frame that audio acquisition module gathered as the speech frame in the voice activity section to be favorable to improving the accuracy and the speech recognition effect of speech recognition result.
In addition, in another embodiment of the speech recognition method of the present disclosure, after 104, the method may further include: and in response to that the confidence degrees of all words in the plurality of first word sequences are larger than a second preset threshold and smaller than a third preset threshold, the voice to be recognized is considered to be pure noise without human voice, and the voice to be recognized is taken as a noise sample to be added into a noise set for training an endpoint detection model and/or taken as foreground noise for training an acoustic model. Wherein the noise set comprises at least one noise sample, and the second predetermined threshold is less than the third predetermined threshold.
The second preset threshold may be set to a lower value, the third preset threshold may be set to a value higher than the second preset threshold and lower than the first preset threshold, for example, the value of the second preset threshold may be set to 0, 0.05, etc., and the value of the third preset threshold may be set to 0.25, 0.3, etc., so as to select pure noise excluding foreground voice and background voice. In practical application, specific values of the first preset threshold, the second preset threshold and the third preset threshold may be determined according to various factors such as specific application and environment, and may be adjusted according to requirements.
Fig. 3 is a flow chart of another embodiment of the speech recognition method of the present disclosure. As shown in fig. 3, in some of these implementations, the noise sample may be obtained by:
202, obtaining words of which the confidence degrees are greater than a second preset threshold in the plurality of first word sequences, and obtaining a plurality of third word sequences and decoding information of each word in the plurality of third word sequences, wherein the decoding information includes start time and end time of the word, acoustic probability and language probability.
Optionally, words and their decoding information with confidence levels not greater than a second preset threshold in the plurality of first word sequences may also be removed, and a plurality of third word sequences and decoding information of each word in the plurality of third word sequences may be obtained from the first decoding result.
And 204, obtaining words of which the confidence degrees are not greater than a third preset threshold in the third word sequences, and obtaining a plurality of fourth word sequences and decoding information of each word in the fourth word sequences, wherein the decoding information comprises the start time and the end time of each word, the acoustic probability and the language probability.
Optionally, words in the third word sequences and their decoded information whose confidence level is greater than a third preset threshold may be removed, that is, decoding information of the words in the third word sequences and the words in the third word sequences may be used to obtain a plurality of fourth word sequences and decoded information of the words in the fourth word sequences.
And 206, determining a fourth word sequence with the highest comprehensive score in the plurality of fourth word sequences respectively based on the starting time and the ending time of each word in the plurality of fourth word sequences, the acoustic probability and the language probability.
And 208, obtaining a third sentence based on the fourth word sequence with the highest comprehensive score.
210, it is identified whether a word is included in the third sentence.
If no words are included in the third sentence, operation 212 is performed. If the second sentence includes words, the subsequent process of the embodiment is not executed, and the third sentence can be directly discarded.
The speech to be recognized is added 212 as a noise sample to the noise set for training the endpoint detection model and/or as foreground noise for training the acoustic model.
Wherein the noise set includes at least one noise sample.
Alternatively, in the embodiment shown in fig. 3, 204 may be performed first to remove words with confidence degrees greater than a third preset threshold from the plurality of first word sequences, 202 may be performed to remove words with confidence degrees not greater than the second preset threshold from the plurality of first word sequences to obtain a plurality of fourth word sequences and decoding information of each word in the plurality of fourth word sequences, and then operation 206 and operation 212 are performed.
Alternatively, in other implementations, the noise sample may also be obtained by: and after words with the confidence degrees larger than a third preset threshold and the confidence degrees not larger than a second preset threshold in the plurality of first word sequences are respectively eliminated, identifying whether the words are contained in the plurality of first word sequences obtained by the identification, and if the words are not contained, adding the voice to be identified into a noise set as a noise sample for training the endpoint detection model and/or as foreground noise for training an acoustic model.
Or, in the embodiment of the present disclosure, when it is determined that the speech to be recognized is pure noise not including human voice based on the third preset threshold and the second preset threshold in any other manner, the speech to be recognized is taken as a noise sample and added to the noise set.
Based on this embodiment, can gather true effectual online audio noise data, realize the collection to the noise sample through the speech recognition flow on the line, through obtaining online noise sample for training endpoint detection model, can promote the performance of endpoint detection model in the online speech recognition scene, thereby help promoting the accuracy of final speech recognition result. In addition, the collection of the noise samples is realized through the voice recognition process on the line, the collection of the noise samples is realized by obtaining the noise samples on the line, the problems of difficulty in obtaining and expanding noise sets and high cost are solved, and when the noise samples with the concentrated noise are used as foreground noise for training an acoustic model, the performance of the acoustic model can be improved, so that the accuracy of a final voice recognition result is improved.
Optionally, after 208 of any of the above embodiments of the present disclosure of the embodiment shown in fig. 3, the method may further include: the endpoint detection model is trained using noise samples in the noise pool.
For example, in some examples, the endpoint detection model may be trained using noise samples in the noise concentrations by:
performing voice endpoint detection on at least one noise sample in the noise set by using an endpoint detection model to obtain detection information of a voice activity section in each noise sample in the at least one noise sample, wherein the detection information of the voice activity section comprises a starting point and an end point of the voice activity section; and training the endpoint detection model based on the difference between the detection information of the voice activity section in each noise sample in the at least one noise sample and the labeling information of the at least one noise sample.
The operation of training the endpoint detection model by using the noise samples with concentrated noise is an iterative operation, and the training of the endpoint detection model is realized by iteratively executing the operation of training the endpoint detection model by using the noise samples with concentrated noise, namely, the parameters of the endpoint detection model are optimally adjusted until a first preset training completion condition is met, so that the trained endpoint detection model is obtained. The preset training completion condition may include, but is not limited to, any one or more of: the number of times of iterative training of the endpoint detection model (i.e., the number of times of iterative performing the above-mentioned operation of training the endpoint detection model by using the noise samples in the noise set) reaches a preset number (e.g., 200 times), the difference between the detection information of the voice activity segment in each noise sample in the at least one noise sample and the label information of the at least one noise sample is less than or equal to a preset threshold, and so on. The disclosed embodiments are not so limited.
Based on the embodiment, training or iterative training of the endpoint detection model by using noise samples with concentrated noise is realized, and the performance of the endpoint detection model in an online voice recognition scene is favorably improved, so that the accuracy of a final voice recognition result is favorably improved.
In addition, in still another embodiment of the speech recognition method of the present disclosure, after 104, the method may further include: and in response to that the confidence degrees of all words in the plurality of first word sequences are larger than a third preset threshold and smaller than a first preset threshold, considering that the voice to be recognized is noise containing background voice, and adding the voice to be recognized into a background voice set as a background voice sample to be used as background voice noise for training an acoustic model. The background voice set comprises at least one background voice sample, and the third preset threshold is smaller than the first preset threshold.
The third preset threshold may be set to a value higher than the second preset threshold and lower than the first preset threshold, for example, the value of the third preset threshold may be set to 0.25, 0.3, and the like, and the value of the first preset threshold may be set to 0.4, 0.5, and the like, so as to select noise that does not include foreground voice but includes background voice. In practical application, specific values of the first preset threshold and the third preset threshold may be determined according to various factors such as specific application and environment, and may be adjusted according to requirements.
FIG. 4 is a flow chart of yet another embodiment of the disclosed speech recognition method. As shown in fig. 4, in some implementations, the background human voice sample may be obtained by:
302, obtaining words in the plurality of first word sequences, whose confidence degrees are greater than a third preset threshold, to obtain a plurality of fifth word sequences and decoding information of each word in the plurality of fifth word sequences, where the decoding information includes start time and end time of a word, acoustic probability, and language probability.
Optionally, words in the plurality of first word sequences whose confidence level is not greater than a third preset threshold may be removed, and the plurality of fifth word sequences and decoding information of each word in the plurality of fifth word sequences may be obtained from the first decoding result.
And 304, obtaining words of which the confidence degrees are not greater than a first preset threshold value in the fifth word sequences, and obtaining a plurality of sixth word sequences and decoding information of each word in the sixth word sequences, wherein the decoding information comprises the start time and the end time of the word, the acoustic probability and the language probability.
Optionally, words in the fifth word sequences and their decoded information whose confidence level is greater than the first preset threshold may be removed, that is, the sixth word sequences and the decoded information of each word in the sixth word sequences may be obtained from the decoded information of each word in the fifth word sequences and the fifth word sequences.
And 306, determining a sixth word sequence with the highest comprehensive score in the plurality of sixth word sequences respectively based on the starting time and the ending time of each word in the plurality of sixth word sequences, the acoustic probability and the language probability.
308, based on the sixth word sequence with the highest comprehensive score, a fourth sentence is obtained.
At 310, it is identified whether a word is included in the fourth sentence.
If no words are included in the fourth sentence, operation 312 is performed. If the fourth sentence includes words, the subsequent process of this embodiment is not executed, and the fourth sentence may be directly discarded.
312, adding the speech to be recognized as a background speech sample into the background speech set, and using the background speech sample as background speech noise for training the acoustic model.
Wherein the background voice set comprises at least one background voice sample.
Alternatively, in the embodiment shown in fig. 4, 304 may be executed first to remove words with confidence degrees greater than a first preset threshold from the plurality of first word sequences, then 302 is executed to remove words with confidence degrees not greater than a third preset threshold from the plurality of first word sequences to obtain a plurality of sixth word sequences and decoding information of each word in the plurality of sixth word sequences, and then operation 306 and operation 312 are executed.
Alternatively, in other implementations, the noise sample may also be obtained by: and after words with the confidence coefficient not larger than a third preset threshold value and words with the confidence coefficient larger than the first preset threshold value in the plurality of first word sequences are respectively removed, whether the plurality of first word sequences obtained by the method also comprise the words or not is identified, if the words are not comprised, the voice to be identified is used as a background voice sample and added into a background voice set, and the background voice sample is used as background voice noise to be used for training an acoustic model.
Or, in the embodiment of the present disclosure, when it is determined that the speech to be recognized is the noise including the background voice based on the third preset threshold and the first preset threshold in any other manner, the speech to be recognized is added to the background voice set as the background voice noise.
Based on this embodiment, realized the collection to the background voice sample through the speech recognition flow on the line, gather real effectual online audio frequency background voice data, through acquireing online background voice sample, realized the collection to the background voice sample, solved background voice set and acquireed difficultly, acquireed the problem that the mark data cost is too high, when being used for training acoustic model as background voice noise, can promote acoustic model's performance to help promoting the accuracy of final speech recognition result.
Fig. 5 is a flowchart of another embodiment of the speech recognition method of the present disclosure. As shown in fig. 5, on the basis of the embodiments shown in fig. 1 to 4, after 104, the embodiment may further include:
402, in response to that the confidence degrees of the words in the plurality of first word sequences are all greater than a fourth preset threshold, obtaining a second sentence based on the plurality of first word sequences and the confidence degrees of the words in the plurality of first word sequences.
And the fourth preset threshold is not less than the first preset threshold. For example, the value of the fourth preset threshold may be 0.9, 0.95, and the like, and may be specifically determined according to the performance of the acoustic model and the language model which are obtained through training as needed. The higher the value of the fourth preset threshold value is, the better the performance improvement effect on the acoustic model and the language model is when the fourth preset threshold value is used for training the acoustic model and the language model.
For example, in some implementations, a first word sequence with the highest comprehensive score in the plurality of first word sequences may be determined based on the start time and the end time of each word in the plurality of first word sequences, and the acoustic probability and the language probability, respectively, where the comprehensive score is determined based on the acoustic probability and the language probability, specifically, the score of each word may be determined based on the acoustic probability and the language probability of the word, the word with the highest score is selected from the plurality of first word sequences for the same start time and end time, and the word with the highest comprehensive score is formed based on the word with the highest score corresponding to each start time and end time, that is, the first word sequence corresponding to the optimal path in the word map; then, based on the first word sequence with the highest comprehensive score, a second sentence is obtained.
If the confidence of all the words in the second sentence is not greater than the fourth preset threshold, operation 406 is performed. The second sentence may be selectively discarded without performing the subsequent operations of the present embodiment.
And 404, taking the second sentence as the labeling information of the speech to be recognized, and adding the speech to be recognized and the labeling information of the speech to be recognized as a speech sample into the speech set for training the acoustic model and/or the language model.
Wherein the speech set comprises at least one speech sample.
In the embodiment of the disclosure, when training the acoustic model and/or the language model based on the voice sample in the sample voice set, the acoustic model and the language model may be separately trained, or the acoustic model and the language model may be uniformly trained end to end.
Based on the embodiment of the disclosure, high-quality voice samples can be collected in the online voice recognition process to enrich the voice samples in the voice set, a large amount of high-quality voices and corresponding labeling information can be obtained at low cost, and when the method is used for training an acoustic model and/or a language model, the recognition performance of the acoustic model and/or the language model can be effectively improved, so that the problems that the voice set sample data set is too few, the cost for obtaining labeling data is too high, and the recognition performance of the acoustic model and/or the language model is difficult to greatly improve caused by the problem are solved.
Optionally, in some implementation manners, in the embodiment shown in fig. 5, attribute information of the second sentence may also be determined, and the attribute information may include, for example and without limitation, any one or more of the following: domain, application scenario, geographical area. Such as music, poetry, science and technology, etc.; application scenarios such as voice dialing, voice navigation, voice song requesting, voice wakeup, etc.; the geographic region is a geographic region to which the speech to be recognized belongs, such as Guangzhou, Henan, northeast, etc. Then, in 404, the speech to be recognized and the labeled information of the speech to be recognized may be used as a speech sample, and a speech set corresponding to the attribute information may be added to train an acoustic model and a language model corresponding to the attribute information. For example, in some alternative examples, the domain of the second sentence may be determined based on words in preset domains included in the second sentence, for example, when the words "song", song name, singer name, and the like in the music domain are included in the second sentence, the domain of the second sentence is determined to be the music domain; and determining that the field of the second sentence is the poetry field when the second sentence comprises the words 'poetry', poetry name, poetry author and the like in the poetry field.
Alternatively, in another alternative example, the domain to which the second sentence belongs may be classified by using a first classification model trained in advance, and the domain to which the second sentence belongs may be determined based on the classification result. The first classification model may be implemented by a neural network. The neural network can be trained by utilizing the linguistic data and the field marking information of different fields to obtain a first classification model.
For example, in some optional examples, a specific Application (APP) may invoke a speech acquisition module to acquire an audio signal to obtain the speech to be recognized, and provide the speech to be recognized to a speech recognition system for speech recognition. The APP can carry the identification information of the APP when providing the speech to be recognized to the speech recognition system, the identification information of the APP can be, for example, the name, the application field, and the like of the APP, and the application scene corresponding to the identification information of the APP can be determined based on the correspondence between the identification information of the APP and the application scene, and the application scene is used as the application scene of the second sentence.
Or, in some alternative examples, the application scenario of the second sentence may also be classified by using a second classification model trained in advance, and the application scenario of the second sentence is determined based on the classification result. The second classification model can be realized by a neural network. The neural network can be trained by utilizing the corpora of different application scenes and the application scene labeling information to obtain a second classification model.
For example, in some optional examples, a specific Application (APP) may invoke a speech acquisition module to acquire an audio signal to obtain the speech to be recognized, and provide the speech to be recognized to a speech recognition system for speech recognition. The user can set up the geographical region that the own pronunciation corresponds when registering or using this APP, and when the user sent to this APP and treats the discernment pronunciation, the APP can obtain and carry the geographical region that this user set up when providing this to treat discernment pronunciation for the speech recognition system, can directly regard as the geographical region of second sentence with the geographical region that this to treat discernment pronunciation and carry. If the user does not set the geographic area corresponding to the voice of the user, the APP can acquire the current geographic area as the geographic area of the second sentence through a positioning module (for example, a GPS) on the terminal where the APP is located.
Based on the embodiment of the disclosure, in the online voice recognition process, high-quality voice samples are collected based on the field, the application scene and the geographic area to enrich the voice samples in the voice set, and a large amount of voices and corresponding labeled information can be obtained at low cost aiming at different fields, application scenes and geographic area dimensions. Through experimental research, the inventor of the present disclosure finds that, under the same other conditions, when the fourth preset threshold is set to 95%, the accuracy of the speech recognition result of the trained acoustic model and/or language model may reach more than 98%.
Optionally, in some implementations, the method may further include: constructing a training data set based on the noise samples in the noise set, the background human voice samples in the background human voice set and the voice samples in the voice set, wherein the training data set comprises at least one noisy signal, and the noisy signal comprises any one or more of the following items: the method comprises the steps of generating a noise signal by mixing a voice sample and a noise sample serving as foreground noise, generating a noise signal by mixing the voice sample and a background human voice sample serving as background human voice noise, and generating the noise signal by mixing the voice sample, the noise sample serving as foreground noise and the background human voice sample serving as background human voice noise, wherein the noise signal is marked with marking information for generating the voice sample of the noise signal; then, the training data set is utilized to train the acoustic model and/or the language model for realizing the decoding of the speech to be recognized, namely, the network parameters of the acoustic model and/or the language model are adjusted.
When the acoustic model is trained by using the noisy signal, a first difference between a phoneme output by the acoustic model and a phoneme corresponding to the labeling information of the noisy signal may be compared, and the acoustic model may be iteratively trained based on the first difference until a first preset training completion condition is met, for example, the first difference is smaller than a preset first difference threshold, and/or the number of times of iterative training performed on the acoustic model reaches a first preset number of times, and so on.
When the language model is trained by using the noisy signal, a second difference between a word sequence corresponding to a word output by the language model and the label information of the noisy signal may be compared, and the language model may be iteratively trained based on the second difference until a second preset training completion condition is satisfied, for example, the second difference is smaller than a preset second difference threshold, and/or the number of times of iterative training performed on the language model reaches a second preset number of times, and so on.
When the acoustic model and the language model are trained simultaneously by using the noisy signal, a first difference between a phoneme output by the acoustic model and a phoneme corresponding to the label information of the noisy signal may be compared, a second difference between a word sequence corresponding to a word output by the language model and the label information of the noisy signal may be compared, and the language model is iteratively trained based on the first difference and the second difference until a third preset training completion condition is satisfied, where the third preset training completion condition may include, but is not limited to, any one or more of the following: the first difference is smaller than a preset first difference threshold, the second difference is smaller than a preset second difference threshold, the mean value between the first difference and the second difference is smaller than a preset third difference threshold, the number of times of iterative training of the acoustic model and the language model reaches a third preset number of times, and the like.
The speech recognition system as a whole comprises two stages, training and recognition. The training refers to training an acoustic model and/or a language model, and is generally performed offline. Recognition refers to the process of recognizing the user's speech as text, typically done online. In the training of the acoustic model and the language model, a large amount of labeled audio data are needed to be used as training samples, and the labeled audio data are mainly obtained manually, so that the time and the labor are consumed, and the cost is high.
Based on the embodiment of the disclosure, in the online voice recognition process, the noise sample is collected to expand the noise concentrated noise sample, the background voice sample is collected to expand the background voice concentrated noise sample, the voice sample is collected to expand the voice concentrated voice sample, then the noise concentrated noise sample is used as foreground noise, the background voice concentrated noise sample is used as background voice noise, and based on the foreground noise and/or the background voice noise, the training data set is constructed with the voice concentrated voice sample, so that the real, rich and reliable training data set can be expanded and obtained for training the acoustic model and/or the language model, the recognition performance of the acoustic model and/or the language model obtained by training in various environments can be effectively improved, and the application effect based on voice recognition is improved.
When the speech set is divided based on the attribute information, the method can be used for training the acoustic model and/or the language model corresponding to each attribute information, so that the recognition performance of the acoustic model and/or the language model corresponding to each attribute information under various environments is improved, and the performance of the acoustic model and/or the language model obtained through training is improved.
Any of the speech recognition methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the speech recognition methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the speech recognition methods mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 6 is a schematic structural diagram of an embodiment of a speech recognition apparatus according to the present disclosure. The speech recognition device of this embodiment can be used to implement the speech recognition method embodiments of the present disclosure. As shown in fig. 6, the speech recognition apparatus of this embodiment includes: a speech recognition model 502, a confidence calculation module 504, an acquisition module 506, and a result determination module 508. Wherein:
the speech recognition model 502 is configured to decode a speech to be recognized to obtain a first decoding result, where the first decoding result includes: a plurality of first word sequences, and a start time and an end time, an acoustic probability, and a language probability for each word in the plurality of first word sequences. The speech recognition model 502 may include an acoustic model and a language model.
A confidence calculation module 504, configured to calculate a confidence of each word in the plurality of first word sequences based on the acoustic probability and the language probability of each word in the plurality of first word sequences, respectively.
An obtaining module 506, configured to obtain a second decoding result based on the words in the plurality of first word sequences whose confidence levels are greater than a first preset threshold, where the second decoding result includes: a plurality of second word sequences, and a start time and an end time, an acoustic probability, and a language probability of each word in the plurality of second word sequences.
A result determining module 508, configured to obtain a speech recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability, and the language probability of each word in the plurality of second word sequences, where the speech recognition result includes: a first sentence and a confidence level of the first sentence so as to perform a corresponding action based on the speech recognition result.
According to the speech recognition device provided by the above embodiment of the present disclosure, after a first decoding result is obtained by decoding a speech to be recognized, confidence levels of words are calculated based on acoustic probabilities and language probabilities of the words in a plurality of first word sequences, respectively, the first decoding result is filtered based on a first preset threshold, so as to obtain a second decoding result, and further, a speech recognition result is obtained based on a plurality of second word sequences in the second decoding result, start times and end times of the words, and the acoustic probabilities and the language probabilities, so as to perform corresponding actions based on the speech recognition result. Therefore, the embodiment of the disclosure can eliminate words with low confidence in the decoding result after the decoding result is obtained by decoding the speech to be recognized, and determine the speech recognition result only based on the words with high confidence, thereby improving the accuracy of the speech recognition result, avoiding erroneous actions caused by the existence of words with inaccurate recognition in the obtained sentences when corresponding actions are performed based on the speech recognition result, and being beneficial to improving the application effect based on the speech recognition.
Optionally, in some implementations, the confidence calculation module 504 may include: the first calculation unit is used for respectively taking each word in each first word sequence in the plurality of first word sequences as a current word and respectively calculating the forward probability and the backward probability of the current word based on the acoustic probability and the language probability of the current word; the second calculation unit is used for calculating the posterior probability of the current word in the first word sequence based on the forward probability, the backward probability and the language probability of the current word; and the superposition unit is used for superposing the posterior probability of the current word in the plurality of first word sequences to obtain the confidence coefficient of the current word.
Optionally, in some implementations, the result determining module 508 may include: a first determining unit, configured to determine, based on the start time and the end time of each word in the plurality of second word sequences, the acoustic probability and the language probability, a second word sequence with a highest comprehensive score in the plurality of second word sequences; the second determining unit is used for obtaining a first sentence based on the second word sequence with the highest comprehensive score; and the third determining unit is used for obtaining the confidence coefficient of the first sentence based on the confidence coefficient of each word in the first sentence.
Fig. 7 is a schematic structural diagram of another embodiment of the speech recognition apparatus of the present disclosure. As shown in fig. 7, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: an endpoint detection module 510, configured to perform voice endpoint detection on the audio signal acquired by the audio acquisition module to obtain a start point and an end point of at least one voice activity segment; and intercepting the at least one voice activity section from the audio signal based on the starting point and the end point of the at least one voice activity section, so as to respectively take each voice activity section in the at least one voice activity section as the voice to be recognized, and decoding the voice to be recognized by the voice recognition model 502.
Optionally, referring to fig. 7 again, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: a first collecting module 512, configured to add, in response to that the confidence of each word in the plurality of first word sequences is greater than a second preset threshold and smaller than a third preset threshold, a speech to be recognized as a noise sample to a noise set for training the endpoint detection model, and/or as foreground noise for training an acoustic model; wherein the noise set comprises at least one noise sample, and the second predetermined threshold is less than the third predetermined threshold.
Optionally, referring to fig. 7 again, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: a first training module 514 is configured to train the endpoint detection model 510 using noise samples in the noise set.
Optionally, referring to fig. 7 again, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: a second collecting module 516, configured to, in response to that the confidence of each word in the plurality of first word sequences is greater than a third preset threshold and smaller than the first preset threshold, add the speech to be recognized as a background speech sample to the background speech set, so as to serve as background speech noise for training the acoustic model; the background voice set comprises at least one background voice sample, and the third preset threshold is smaller than the first preset threshold.
Optionally, referring to fig. 7 again, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: sentence determination module 518 and second collection module 520. The sentence determining module 518 is configured to, in response to that the confidence degrees of the words in the plurality of first word sequences are all greater than a fourth preset threshold, obtain a second sentence based on the confidence degrees of the words in the plurality of first word sequences and the plurality of first word sequences, where the fourth preset threshold is not less than the first preset threshold.
And a third collecting module 520, configured to use the second sentence as labeling information of the speech to be recognized, and add the speech to be recognized and the labeling information of the speech to be recognized as a speech sample into a speech set for training an acoustic model and a language model, where the speech set includes at least one speech sample.
Optionally, in some implementations, the third collecting module 522 is specifically configured to: determining attribute information of the second sentence, the attribute information comprising any one or more of: domain, application scenario, geographic area; and taking the voice to be recognized and the marking information of the voice to be recognized as a voice sample, and adding the voice sample into a voice set corresponding to the attribute information so as to train an acoustic model and a language model corresponding to the attribute information.
Optionally, referring to fig. 7 again, on the basis of the foregoing embodiment, the speech recognition apparatus of this embodiment may further include: a build module 524 and a second training module 526.
Wherein: a constructing module 524, configured to construct a training data set based on the noise samples in the noise set, the background human voice samples in the background human voice set, and the voice samples in the voice set, where the training data set includes at least one noisy signal, and the noisy signal includes any one or more of: the noise-carrying signal is generated by mixing a voice sample and a noise sample as foreground noise, the noise-carrying signal is generated by mixing the voice sample and a background human voice sample as background human voice noise, and the noise-carrying signal is generated by mixing the voice sample, the noise sample as foreground noise and the background human voice sample as background human voice noise, wherein the noise-carrying signal is marked with marking information of the voice sample used for generating the noise-carrying signal.
The second training module 526 trains the acoustic model and/or the language model for decoding the speech to be recognized using the training data set.
In addition, an embodiment of the present disclosure also provides an electronic device, including:
a memory for storing a computer program;
a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the speech recognition method according to any of the above embodiments of the present disclosure.
Fig. 8 is a schematic structural diagram of an embodiment of an application of the electronic device of the present disclosure. Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 8. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.
As shown in fig. 8, the electronic device includes one or more processors and memory.
The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by a processor to implement the speech recognition methods of the various embodiments of the present disclosure described above and/or other desired functions.
In one example, the electronic device may further include: an input device and an output device, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device may also include, for example, a keyboard, a mouse, and the like.
The output device may output various information including the determined distance information, direction information, and the like to the outside. The output devices may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 8, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the speech recognition methods according to the various embodiments of the present disclosure described in the above-mentioned part of the specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a speech recognition method according to various embodiments of the present disclosure described in the above section of the present specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (13)

1. A speech recognition method, comprising:
decoding a speech to be recognized to obtain a first decoding result, wherein the first decoding result comprises: a plurality of first word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of first word sequences;
calculating the confidence of each word in the plurality of first word sequences respectively based on the acoustic probability and the language probability of each word in the plurality of first word sequences;
obtaining a second decoding result based on the words with the confidence degrees larger than a first preset threshold value in the plurality of first word sequences, wherein the second decoding result comprises: a plurality of second word sequences and start and end times, acoustic probabilities and language probabilities of words in the plurality of second word sequences;
obtaining a speech recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability and the language probability of each word in the plurality of second word sequences, wherein the language recognition result comprises: a first sentence and a confidence level of the first sentence for performing a corresponding action based on the speech recognition result.
2. The method of claim 1, wherein calculating the confidence level for each word in the first word sequences based on the acoustic probability and the linguistic probability for each word in the first word sequences, respectively, comprises:
respectively taking each word in each first word sequence in the plurality of first word sequences as a current word, and respectively calculating the forward probability and the backward probability of the current word based on the acoustic probability and the language probability of the current word;
calculating the posterior probability of the current word in the first word sequence based on the forward probability, the backward probability and the language probability of the current word;
and overlapping the posterior probabilities of the current words in the plurality of first word sequences to obtain the confidence of the current words.
3. The method according to claim 1 or 2, wherein obtaining the speech recognition result based on the plurality of second word sequences and the start time and the end time, the acoustic probability and the language probability of each word in the plurality of second word sequences comprises:
determining a second word sequence with the highest comprehensive score in the second word sequences respectively based on the starting time and the ending time of each word in the second word sequences, the acoustic probability and the language probability;
obtaining a first sentence based on the second word sequence with the highest comprehensive score;
and obtaining the confidence coefficient of the first sentence based on the confidence coefficient of each word in the first sentence.
4. The method according to any of claims 1-3, wherein before decoding the speech to be recognized, further comprising:
performing voice endpoint detection on the audio signal acquired by the audio acquisition module by using an endpoint detection model to obtain a starting point and an end point of at least one voice activity section;
and intercepting the at least one voice activity section from the audio signal based on the starting point and the end point of the at least one voice activity section so as to respectively take each voice activity section in the at least one voice activity section as the voice to be recognized and execute the operation of decoding the voice to be recognized to obtain a first decoding result.
5. The method of claim 4, wherein after calculating the confidence level for each word in the first word sequences based on the acoustic probability and the linguistic probability for each word in the first word sequences, respectively, further comprising:
responding to the fact that the confidence degree of each word in the first word sequences is larger than a second preset threshold and smaller than a third preset threshold, and adding the voice to be recognized into a noise set as a noise sample to be used for training the endpoint detection model and/or used as foreground noise to be used for training an acoustic model; wherein the noise set comprises at least one noise sample, and the second preset threshold is smaller than the third preset threshold.
6. The method of claim 5, further comprising:
training the endpoint detection model using the noise samples in the noise set.
7. The method according to claim 5 or 6, wherein after calculating the confidence of each word in the first word sequences based on the acoustic probability and the language probability of each word in the first word sequences, respectively, further comprises:
responding to the fact that the confidence degree of each word in the first word sequences is larger than the third preset threshold value and smaller than the first preset threshold value, adding the voice to be recognized into a background voice set as a background voice sample, and using the voice to be recognized as background voice noise for training an acoustic model; the background voice set comprises at least one background voice sample, and the third preset threshold is smaller than the first preset threshold.
8. The method according to any of claims 4-7, wherein after calculating the confidence level of each word in the first word sequences based on the acoustic probability and the linguistic probability of each word in the first word sequences, respectively, further comprising:
responding to the fact that the confidence degrees of all words in the first word sequences are larger than a fourth preset threshold value, and obtaining a second sentence based on the first word sequences and the confidence degrees of all words in the first word sequences; wherein the fourth preset threshold is not less than the first preset threshold;
taking the second sentence as the labeling information of the voice to be recognized, and taking the voice to be recognized and the labeling information of the voice to be recognized as a voice sample to be added into a voice set so as to be used for training an acoustic model and a language model; wherein the speech set comprises at least one speech sample.
9. The method according to claim 8, wherein the adding the speech to be recognized and the label information of the speech to be recognized as one speech sample into a speech set comprises:
determining attribute information of the second sentence, wherein the attribute information comprises any one or more of the following items: domain, application scenario, geographic area;
and taking the voice to be recognized and the marking information of the voice to be recognized as a voice sample, and adding the voice sample into a voice set corresponding to the attribute information so as to train an acoustic model and a language model corresponding to the attribute information.
10. The method according to any one of claims 4-9, further comprising:
constructing a training data set based on a noise sample in a noise set, a background human voice sample in the background human voice set and a voice sample in a voice set, wherein the training data set comprises at least one noisy signal, and the noisy signal comprises any one or more of the following items: a noise signal generated by mixing the voice sample and the noise sample as foreground noise, a noise signal generated by mixing the voice sample and the background human voice sample as background human voice noise, and a noise signal generated by mixing the voice sample, the noise sample as foreground noise and the background human voice sample as background human voice noise, wherein the noise signal is marked with marking information of the voice sample for generating the noise signal;
and training an acoustic model and/or a language model for decoding the speech to be recognized by utilizing the training data set.
11. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of the preceding claims 1-10.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 10.
13. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of any of the preceding claims 1-10.
CN202111611631.3A 2021-12-27 2021-12-27 Speech recognition method, electronic device, program product, and storage medium Pending CN114255754A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111611631.3A CN114255754A (en) 2021-12-27 2021-12-27 Speech recognition method, electronic device, program product, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111611631.3A CN114255754A (en) 2021-12-27 2021-12-27 Speech recognition method, electronic device, program product, and storage medium

Publications (1)

Publication Number Publication Date
CN114255754A true CN114255754A (en) 2022-03-29

Family

ID=80798107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111611631.3A Pending CN114255754A (en) 2021-12-27 2021-12-27 Speech recognition method, electronic device, program product, and storage medium

Country Status (1)

Country Link
CN (1) CN114255754A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376491A (en) * 2022-07-06 2022-11-22 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium
WO2024001662A1 (en) * 2022-06-28 2024-01-04 京东科技信息技术有限公司 Speech recognition method and apparatus, device, and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024001662A1 (en) * 2022-06-28 2024-01-04 京东科技信息技术有限公司 Speech recognition method and apparatus, device, and storage medium
CN115376491A (en) * 2022-07-06 2022-11-22 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium
CN115376491B (en) * 2022-07-06 2023-08-18 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium

Similar Documents

Publication Publication Date Title
US11735176B2 (en) Speaker diarization using speaker embedding(s) and trained generative model
US20190005961A1 (en) Method and device for processing voice message, terminal and storage medium
US11797772B2 (en) Word lattice augmentation for automatic speech recognition
US8843369B1 (en) Speech endpointing based on voice profile
CN110838289A (en) Awakening word detection method, device, equipment and medium based on artificial intelligence
CN111028842B (en) Method and equipment for triggering voice interaction response
EP3667660A1 (en) Information processing device and information processing method
US11574637B1 (en) Spoken language understanding models
KR20220004224A (en) Context biasing for speech recognition
CN114255754A (en) Speech recognition method, electronic device, program product, and storage medium
US20240013784A1 (en) Speaker recognition adaptation
CN112071310A (en) Speech recognition method and apparatus, electronic device, and storage medium
JP7400112B2 (en) Biasing alphanumeric strings for automatic speech recognition
US11544504B1 (en) Dialog management system
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
JP4700522B2 (en) Speech recognition apparatus and speech recognition program
US11646035B1 (en) Dialog management system
CN112397053A (en) Voice recognition method and device, electronic equipment and readable storage medium
CN113658593B (en) Wake-up realization method and device based on voice recognition
US11929070B1 (en) Machine learning label generation
CN114678040B (en) Voice consistency detection method, device, equipment and storage medium
CN116189680B (en) Voice wake-up method of exhibition intelligent equipment
CN116844555A (en) Method and device for vehicle voice interaction, vehicle, electronic equipment and storage medium
CN113921016A (en) Voice processing method, device, electronic equipment and storage medium
EP4383249A2 (en) Speaker diarization using speaker embedding(s) and trained generative model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination