CN114005434A - End-to-end voice confidence calculation method, device, server and medium - Google Patents

End-to-end voice confidence calculation method, device, server and medium Download PDF

Info

Publication number
CN114005434A
CN114005434A CN202111403940.1A CN202111403940A CN114005434A CN 114005434 A CN114005434 A CN 114005434A CN 202111403940 A CN202111403940 A CN 202111403940A CN 114005434 A CN114005434 A CN 114005434A
Authority
CN
China
Prior art keywords
confidence
recognition
recognition result
model
confidence coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111403940.1A
Other languages
Chinese (zh)
Inventor
王文超
余骁捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaopeng Automobile Co Ltd
Original Assignee
Beijing Xiaopeng Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaopeng Automobile Co Ltd filed Critical Beijing Xiaopeng Automobile Co Ltd
Priority to CN202111403940.1A priority Critical patent/CN114005434A/en
Publication of CN114005434A publication Critical patent/CN114005434A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a device, a server and a medium for calculating confidence coefficient of end-to-end voice in voice recognition. The identification method comprises the following steps: extracting acoustic characteristics of each frame data of the input audio; inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results; extracting the confidence coefficient characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and a preset characteristic abstract model; and taking the recognition result and the extracted confidence characteristic as the input of a confidence calculation model, and predicting the confidence of each word and the confidence of the sentence in the recognition result. The confidence coefficient calculation method of the end-to-end voice in the voice recognition directly calculates the confidence coefficient of each character and sentence according to the acoustic characteristics and the recognition results, the confidence coefficient calculation scheme does not need to be adapted and depends on the concrete realization of a voice recognition decoder, has the advantages of independent optimization, high efficiency and error accumulation reduction, and has higher practical value in actual service scenes.

Description

End-to-end voice confidence calculation method, device, server and medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a server, and a medium for computing confidence of end-to-end speech in speech recognition.
Background
In the related art, the confidence module is a module that gives a degree of confidence to the recognition result output from the speech recognition decoder. The recognition results combined with the confidence scores are applied to downstream tasks such as dialog systems, natural language understanding, keyword retrieval, and the like. The confidence degree has important significance for improving the accuracy of the human-computer interaction.
The confidence module implementation of conventional speech recognition systems is generally calculated based on decoded lattice maps, without additional model and parameter training. In recent years, confidence algorithms based on an end-to-end speech recognition system are also developed, a subsequent model-based confidence module is trained mainly by using a recognition sequence generated by a decoder and abstract features in an end-to-end acoustic model, and the scheme has a better calling effect than a traditional lattice diagram. However, the above two solutions have the following two problems:
1) the confidence module is strongly dependent on the speech recognition decoder and has strong coupling. Especially for model-based confidence schemes, replacing different speech recognition decoders requires retraining different confidence modules to adapt.
2) After a traditional speech recognition system trains a confidence coefficient model, a large number of decoding results and acoustic features need to be stored, a large number of storage resources need to be consumed, the practicability is poor, and the consumption storage is larger in a large-scale data training scene.
Disclosure of Invention
The invention provides a method, a device, a server and a medium for calculating confidence coefficient of end-to-end voice in voice recognition.
The invention discloses a method for calculating confidence coefficient of end-to-end voice in voice recognition, which comprises the following steps:
extracting acoustic characteristics of each frame data of the input audio;
inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
extracting the confidence coefficient characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and a preset characteristic abstract model;
and taking the recognition result and the extracted confidence characteristic as the input of a confidence calculation model, and predicting the confidence of each word and the confidence of the sentence in the recognition result.
The confidence coefficient calculation method of the end-to-end voice in the voice recognition directly calculates the confidence coefficient of each character and sentence according to the acoustic characteristics and the recognition results, the confidence coefficient calculation scheme does not need to be adapted and depends on the concrete realization of a voice recognition decoder, has the advantages of independent optimization, high efficiency and error accumulation reduction, and has higher practical value in actual service scenes.
Extracting the confidence characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and a preset characteristic abstract model, wherein the steps of:
presetting a characteristic extraction model adopting a model structure of an encoder-decoder;
training the feature extraction model;
inputting the acoustic features into a coder of a trained feature extraction model to abstract out original features;
inputting the original features into a decoder of a trained feature extraction model to abstract the encoder features;
and inputting the original features and the recognition results into a decoder of a trained feature extraction model to abstract the decoder features.
In this manner, a confidence characteristic for each word may be obtained.
Inputting the original features into a decoder of a trained feature extraction model to abstract the encoder features, wherein the method comprises the following steps:
and abstracting the encoder features from the original features in a decoder of a trained feature extraction model through a multi-head attention mechanism.
Therefore, the method can realize that the encoder features are abstracted from the original features output by the encoder in a decoder of a trained feature extraction model.
Taking the recognition result and the extracted confidence feature as the input of a confidence calculation model, predicting the confidence of each word and the confidence of each sentence in the recognition result, and comprising the following steps:
and taking the recognition result and the confidence coefficient characteristics as input, and sending the input result to a multi-layer transform Block module after characteristic splicing and position coding, wherein one end generates the confidence coefficient of a word through Sigmoid, and the other end generates the confidence coefficient of a sentence through Sigmoid after sentence level abstraction through hierarchical attention.
In this way, calculation of the confidence of the words and the confidence of the sentences can be achieved.
The confidence computation method comprises a training phase of a confidence computation model,
the training phase comprises:
and training the whole confidence coefficient calculation model by using the recognition result and the confidence coefficient characteristic as input through a back propagation algorithm.
In this way, the confidence level calculation model can be trained.
And training the whole confidence coefficient calculation model by using the recognition result and the confidence coefficient characteristic as input through a back propagation algorithm, wherein the training comprises the following steps:
the recognition result and the confidence coefficient feature are used as input, feature splicing and position coding are used as confidence coefficient calculation model input, and word correct probability and sentence correct probability are output through a last Sigmoid layer;
calculating the minimum editing distance through correct transcription and the recognition result to obtain word labels and sentence labels of the model;
and performing logistic regression loss modeling according to the word correct probability and sentence correct probability and the word labels and sentence labels, and training the whole confidence coefficient calculation model through a back propagation algorithm.
In this way, a specific process of training can be achieved.
The confidence calculation method includes a prediction phase of a confidence calculation model,
the prediction phase comprises:
and taking the recognition result and the confidence coefficient characteristics as input, taking characteristic splicing and position coding as input of a confidence coefficient calculation model, sending the input into the trained confidence coefficient calculation model, outputting the correct probability of the recognition result words through one head, and outputting the correct probability of sentences through the other head for downstream tasks.
In this manner, a correct probability (confidence) calculation of words and sentences can be achieved.
The invention discloses a device for calculating confidence coefficient of end-to-end voice in voice recognition, which comprises:
the acoustic feature extraction module is used for extracting the acoustic features of the data of each frame of the input audio;
the recognition module is used for inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
the confidence coefficient feature extraction module is used for extracting the confidence coefficient feature of each word in the recognition result according to the acoustic feature, the recognition result and a preset feature abstract model; and
and the confidence coefficient calculation module is used for taking the recognition result and the extracted confidence coefficient characteristics as the input of the confidence coefficient calculation model, and predicting the confidence coefficient of each word and the confidence coefficient of the sentence in the recognition result.
The server of the invention comprises the device for calculating the confidence coefficient of the end-to-end voice in the voice recognition.
The present invention provides a non-transitory computer-readable storage medium of computer-executable instructions which, when executed by one or more processors, cause the processors to perform the above-described method of confidence computation of end-to-end speech in speech recognition.
The confidence coefficient calculation device, the server and the storage medium of the end-to-end voice in the voice recognition directly calculate the confidence coefficient of each character and sentence according to the acoustic characteristics and the recognition results, the confidence coefficient calculation scheme does not need to be adapted and depends on the specific implementation of a voice recognition decoder, and has the advantages of independent optimization, high efficiency and error accumulation reduction, and has higher practical value in actual service scenes.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for computing confidence of end-to-end speech in speech recognition according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for computing confidence of end-to-end speech in speech recognition according to an embodiment of the present invention;
FIG. 3 is a block diagram of a confidence feature extraction module according to an embodiment of the invention;
FIG. 4 is a block diagram of a confidence computation module according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for the purpose of illustrating the embodiments of the present invention and are not to be construed as limiting the embodiments of the present invention.
The following disclosure provides many different embodiments or examples for implementing different configurations of embodiments of the invention. In order to simplify the disclosure of embodiments of the invention, the components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit the present invention. Embodiments of the invention may repeat reference numerals and/or letters in the various examples for simplicity and clarity and do not in themselves dictate a relationship between the various embodiments and/or arrangements discussed.
Referring to fig. 1, a method for calculating confidence of end-to-end speech in speech recognition according to an embodiment of the present invention includes:
step 01, extracting acoustic features of data of each frame of input audio;
step 03, inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
step 05, extracting the confidence coefficient characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and the preset characteristic abstract model;
and step 07, taking the recognition result and the extracted confidence feature as the input of a confidence calculation model, and predicting the confidence of each word and the confidence of the sentence in the recognition result.
Referring to fig. 2, the method for calculating the confidence of the end-to-end speech in speech recognition according to the above embodiment can be implemented by the apparatus 100 for calculating the confidence of the end-to-end speech in speech recognition according to the embodiment of the present invention. Specifically, the device 100 for calculating confidence of end-to-end voice in voice recognition according to the embodiment of the present invention includes an acoustic feature extraction module 11, a recognition module 13, a confidence feature extraction module 15, and a confidence calculation module 17. And the acoustic feature extraction module 11 is configured to extract an acoustic feature of each frame of data of the input audio. And the recognition module 13 is configured to input the acoustic features into the speech recognition decoder and obtain a corresponding recognition result. And the confidence coefficient feature extraction module 15 is configured to extract the confidence coefficient feature of each word in the recognition result according to the acoustic feature, the recognition result, and the preset feature abstract model. And the confidence coefficient calculation module 17 is configured to use the recognition result and the extracted confidence coefficient features as input of a confidence coefficient calculation model, and predict the confidence coefficient of each word and the confidence coefficient of each sentence in the recognition result.
The confidence coefficient calculation method of the end-to-end voice in the voice recognition and the confidence coefficient calculation device 100 of the end-to-end voice in the voice recognition directly calculate the confidence coefficient of each word and sentence according to the acoustic characteristics and the recognition results, and the confidence coefficient calculation scheme does not need to be adapted and depends on the specific implementation of a voice recognition decoder, has the advantages of independent optimization, high efficiency and error accumulation reduction, and has higher practical value in actual service scenes.
Specifically, the embodiment of the present invention analyzes the problems in the related art that the confidence calculation scheme is strongly coupled with the recognition decoder and the deep learning confidence is difficult to adapt to the conventional decoder, and proposes the above end-to-end speech confidence recognition strategy independent of the speech recognition decoder. The above confidence of each word and the confidence of a sentence can be applied to downstream tasks such as a dialog system, natural language understanding, keyword retrieval, and the like.
The input audio may be acquired by a first terminal to which the calculation method is applied, or may be acquired by a second terminal communicating with the first terminal and then transmitted to the first terminal. The user may input speech through the first terminal or the second terminal to generate input audio. The first terminal and the second terminal include, but are not limited to, a mobile phone, a tablet computer, a car terminal, a server, and the like.
The acoustic features of the frame data of the input audio are extracted, and a sequence of acoustic feature frames can be generated. Specifically, the method for extracting the acoustic features of each frame of the input audio may refer to the method in the speech processing field in the related art, and will not be described in detail herein. In this embodiment, the acoustic features are input to the speech recognition decoder to obtain a recognition result on the one hand, and can be input as confidence extraction to extract the confidence features and calculate the correct probability independently of the speech recognition decoder on the other hand.
In some embodiments, the speech recognition decoder includes an HMM speech recognition system-based decoder and an end-to-end speech recognition system-based decoder. Thus, the related voice recognition system can be flexibly used to obtain the recognition result.
Specifically, in one embodiment, a HMM (Hidden Markov Model) speech recognition system may include a DNN-HMM acoustic Model + Ngram language Model.
In one embodiment, the end-to-end speech recognition system may include the former-LSTM RNNT model or the like. The confidence recognition result of the embodiment is obtained by sending the acoustic features to the decoder.
It is to be understood that in other embodiments, other types of speech recognition decoders may be employed to obtain recognition results, and are not limited to HMM-based speech recognition system decoders and end-to-end-based speech recognition system decoders.
In certain embodiments, step 05, comprises:
presetting a characteristic extraction model adopting a model structure of an encoder-decoder;
training the feature extraction model;
inputting the acoustic features into a coder of a trained feature extraction model to abstract out original features;
inputting the original features into a decoder of a trained feature extraction model to abstract the features of the encoder;
and inputting the original features and the recognition result into a decoder of the trained feature extraction model to abstract the features of the decoder.
Referring to fig. 2, the method for calculating the confidence of the end-to-end speech in speech recognition according to the above embodiment can be implemented by the apparatus 100 for calculating the confidence of the end-to-end speech in speech recognition according to the embodiment of the present invention. Specifically, the confidence feature extraction module 15 is configured to: presetting a characteristic extraction model adopting a model structure of an encoder-decoder; training the feature extraction model; inputting the acoustic features into a coder of a trained feature extraction model to abstract out original features; inputting the original features into a decoder of a trained feature extraction model to abstract the features of the encoder; and inputting the recognition result into a decoder of the trained feature extraction model to abstract the features of the decoder.
In this manner, a confidence characteristic for each word may be obtained.
Specifically, in order to solve the problem of strong dependence and strong coupling between a confidence coefficient calculation model and a speech recognition decoder in the related art, the recognition method provided by the embodiment of the invention does not acquire any feature information from the decoder, and directly extracts necessary confidence coefficient features from an audio acoustic feature frame.
In one embodiment, referring to fig. 3, the confidence feature extraction module 15 may employ the model structure of the encoder-decoder as a whole. The acoustic features are sent to an encoder of the trained feature extraction model to abstract original features (such as high-dimensional features), the original features output by the encoder of the trained feature extraction model are input to a decoder of the trained feature extraction model to be processed, and the encoder features are abstracted. The recognition result is sent to a decoder of the trained feature extraction model to abstract the features (such as high-dimensional features) of the decoder.
In this embodiment, the confidence feature extraction module 15 may be divided into two stages, namely training and prediction:
a training stage: the main problem at this stage is to design a loss function that can be abstracted in favor of the confidence feature.
The cross-entropy loss of the speech recognition classification task labeled with text can be employed. At this time, the decoder outputs the softmax classification task.
The minimum mean square error loss labeled with the masked acoustic features of the speech restoration pre-training task may also be employed. At this time, the encoder outputs a mask regression task, and the decoder outputs a softmax classification task.
A prediction stage: the encoder features finally output by the trained confidence feature extraction module 15 are obtained by performing multi-head attention processing on the original features output by the encoder of the trained feature extraction model by the decoder of the trained feature extraction model, and the decoder features are directly output by the decoder of the trained feature extraction model.
In some embodiments, the encoder of the pre-defined feature abstraction model is composed of a convolutional layer and a multi-layer former Block;
the Decoder of the preset feature abstract model is composed of multiple layers of transform Decoder blocks. In this way, a model structure of the encoder-decoder can be realized.
Specifically, the former Block is composed of a normalization layer, a feedforward layer, a multi-head attention layer, a convolution layer, and a feedforward layer. The Transformer Decoder Block is formed by cascading a multi-head attention layer, a feedforward layer, a multi-head attention layer, a feedforward layer and a normalization layer.
In some embodiments, inputting the original features into a decoder of a trained feature extraction model to abstract the encoder features, includes:
and (3) abstracting the encoder features from the original features in a decoder of a trained feature extraction model through a multi-head attention mechanism.
Referring to fig. 2, the method for calculating the confidence of the end-to-end speech in speech recognition according to the above embodiment can be implemented by the apparatus 100 for calculating the confidence of the end-to-end speech in speech recognition according to the embodiment of the present invention. Specifically, the confidence feature extraction module 15 is configured to abstract the encoder features from the original features in the decoder of the trained feature extraction model through a multi-head attention mechanism.
Therefore, the method can realize that the encoder features are abstracted from the original features output by the encoder in a decoder of a trained feature extraction model.
Specifically, please refer to fig. 3, a transform Decoder Block of the trained feature extraction model has a multi-head attention layer, and the original features output by the encoder of the trained feature extraction model are sent to the multi-head attention layer for processing, so as to obtain the encoder features.
In certain embodiments, step 07, comprises:
and taking the recognition result and the confidence characteristic as input, sending the input into a multilayer transform Block module after characteristic splicing and position coding, generating the confidence coefficient of a word through Sigmoid at one end, and sending the abstract of sentence level into Sigmoid to generate the confidence coefficient of a sentence after carrying out sentence level abstraction on the other end through hierarchical attention.
Referring to fig. 2, the method for calculating the confidence of the end-to-end speech in speech recognition according to the above embodiment can be implemented by the apparatus 100 for calculating the confidence of the end-to-end speech in speech recognition according to the embodiment of the present invention. Specifically, the confidence coefficient calculation module 17 is configured to use the recognition result and the confidence coefficient feature as input, and send the input result to the multi-layer Transformer Block module after feature concatenation and position coding, and then generate the confidence coefficient of a word through Sigmoid at one end, and send Sigmoid to generate the confidence coefficient of a sentence after sentence-level abstraction is performed through hierarchical attention at the other end.
In this way, calculation of the confidence of the words and the confidence of the sentences can be achieved.
Specifically, the confidence level (correct probability) of each word and sentence is calculated at the confidence level feature of each word of the recognition result obtained as described above as an input to the confidence level calculation module 17.
In this embodiment, please refer to fig. 4, the confidence calculating module 17 can adopt the structure of a transform encoder. And the recognition result of the speech recognition decoder, the encoder characteristics extracted by the confidence coefficient characteristic extraction module and the decoder characteristics are used as input, and are sent to the multilayer Transformer Block module after characteristic splicing (concatemate linking) and position coding. And then, generating word confidence correct probability by passing one end of the Sigmoid, and performing sentence level abstraction by passing hierarchy attention of the other end of the Sigmoid and then sending the Sigmoid to generate sentence confidence correct probability.
In some embodiments, the confidence calculation method includes a training phase of the confidence calculation model,
the training phase comprises the following steps:
and training the whole confidence coefficient calculation model by using the recognition result and the confidence coefficient characteristic as input through a back propagation algorithm.
Referring to fig. 2, the method for calculating the confidence of the end-to-end speech in speech recognition according to the above embodiment can be implemented by the apparatus 100 for calculating the confidence of the end-to-end speech in speech recognition according to the embodiment of the present invention. In particular, the confidence calculation module 17 may have a training phase.
In this way, the confidence level calculation model can be trained.
Specifically, in one embodiment, the entire confidence computation model is trained by a back propagation algorithm with the recognition result and the confidence features as inputs, including:
using the recognition result and the confidence coefficient characteristics as input, using characteristic splicing and position coding as confidence coefficient calculation model input, and outputting word correct probability and sentence correct probability through a final Sigmoid layer;
calculating the minimum editing distance through correct transcription and recognition results to obtain word labels and sentence labels of the model;
and performing logistic regression loss modeling through the word correct probability and sentence correct probability, and the word labels and the sentence labels, and training the whole confidence coefficient calculation model through a back propagation algorithm. In this way, a specific process of training can be achieved.
Specifically, the method for obtaining word labels and sentence labels of the model is developed as follows:
for word correctness judgment, the correct transcription is aligned to each word of the recognition result by the minimum editing distance, so as to obtain the 0-1 label of each word, and then the training is carried out through logistic regression. In one example, this alignment may be represented by the following table:
recognition result Tilting device Pin Bai Is divided into In-line with the above Five of them +
Correct transcription Get Xiaoxiao (medicine for eliminating cough and asthma) Bai Is divided into In-line with the above +
0-1 tag 0 (alternative) 0 (alternative) 1 1 1 0 (insert) 1
For sentence correctness judgment, the CER of correct transcription and recognition result can be calculated, when the CER is 0, the label is 1(1 represents correct sentence), otherwise, the label is 0(0 represents wrong sentence), and then the training is carried out through logistic regression.
It is understood that the label may also be represented by other numbers or symbols, and is not limited to 0 and 1.
In some embodiments, the confidence calculation method includes a prediction phase of a confidence calculation model,
the prediction phase comprises:
and (3) taking the recognition result and the confidence coefficient characteristics as input, taking the characteristic splicing and the position coding as the input of a confidence coefficient calculation model, sending the input into the trained confidence coefficient calculation model, outputting the correct probability of the recognition result words through one head, and outputting the correct probability of sentences through the other head for downstream tasks. In this manner, a correct probability (confidence) calculation of words and sentences can be achieved.
Specifically, downstream tasks include, but are not limited to, dialog systems, natural language understanding, keyword retrieval, and the like.
The server according to an embodiment of the present invention includes the apparatus 100 for calculating the confidence level of end-to-end speech in speech recognition according to the above embodiment.
The server directly calculates the confidence coefficient of each character and sentence according to the acoustic characteristics and the recognition result, the confidence coefficient calculation scheme does not need to be adapted and depends on the specific realization of a speech recognition decoder, has the advantages of independent optimization, high efficiency and error accumulation reduction, and has higher practical value in an actual service scene.
Specifically, the input audio may be collected by a microphone of a vehicle in communication with the server, uploaded to the server by the vehicle, collected by the server itself, or directly input an audio file by the user, which is not particularly limited herein. Vehicles include, but are not limited to, fuel-powered vehicles, extended range electric vehicles, hybrid vehicles, hydrogen-powered vehicles, and the like.
Embodiments of the present invention also provide a non-transitory computer-readable storage medium of computer-executable instructions, which, when executed by one or more processors, cause the processors to perform the method for calculating confidence of end-to-end speech in speech recognition according to any of the above embodiments.
Specifically, in one embodiment, when the computer-executable instructions are executed by the processor, the method for calculating the confidence level of the end-to-end voice in the voice recognition is realized by the steps of:
step 01, extracting acoustic features of data of each frame of input audio;
step 03, inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
step 05, extracting the confidence coefficient characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and the preset characteristic abstract model;
and step 07, taking the recognition result and the extracted confidence feature as the input of a confidence calculation model, and predicting the confidence of each word and the confidence of the sentence in the recognition result.
It is understood that the above explanation of the embodiments and advantageous effects of the method for calculating confidence of end-to-end speech in speech recognition is also applicable to the computer readable storage medium of the embodiments of the present invention, and is not detailed herein to avoid redundancy.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A method for computing confidence of end-to-end voice in voice recognition is characterized by comprising the following steps:
extracting acoustic characteristics of each frame data of the input audio;
inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
extracting the confidence coefficient characteristic of each word in the recognition result according to the acoustic characteristic, the recognition result and a preset characteristic abstract model;
and taking the recognition result and the extracted confidence characteristic as the input of a confidence calculation model, and predicting the confidence of each word and the confidence of the sentence in the recognition result.
2. The method for calculating the confidence level of end-to-end voice in voice recognition according to claim 1, wherein extracting the confidence level feature of each word in the recognition result according to the acoustic feature, the recognition result and a preset feature abstract model comprises:
presetting a characteristic extraction model adopting a model structure of an encoder-decoder;
training the feature extraction model;
inputting the acoustic features into a coder of a trained feature extraction model to abstract out original features;
inputting the original features into a decoder of a trained feature extraction model to abstract the encoder features;
and inputting the original features and the recognition results into a decoder of a trained feature extraction model to abstract the decoder features.
3. The method of claim 2, wherein the extracting the encoder features from the original features input to a trained decoder of a feature extraction model comprises:
and abstracting the encoder features from the original features in a decoder of a trained feature extraction model through a multi-head attention mechanism.
4. The method of claim 1, wherein the step of predicting the confidence level of each word and sentence in the recognition result using the recognition result and the extracted confidence level features as input of a confidence level calculation model comprises:
and taking the recognition result and the confidence coefficient characteristics as input, and sending the input result to a multi-layer transform Block module after characteristic splicing and position coding, wherein one end generates the confidence coefficient of a word through Sigmoid, and the other end generates the confidence coefficient of a sentence through Sigmoid after sentence level abstraction through hierarchical attention.
5. The method of confidence computation of end-to-end speech in speech recognition according to claim 4, wherein the method of confidence computation comprises a training phase of a confidence computation model,
the training phase comprises:
and training the whole confidence coefficient calculation model by using the recognition result and the confidence coefficient characteristic as input through a back propagation algorithm.
6. The method of claim 5, wherein the training of the entire confidence computation model by a back propagation algorithm with the recognition result and the confidence features as inputs comprises:
the recognition result and the confidence coefficient feature are used as input, feature splicing and position coding are used as confidence coefficient calculation model input, and word correct probability and sentence correct probability are output through a last Sigmoid layer;
calculating the minimum editing distance through correct transcription and the recognition result to obtain word labels and sentence labels of the model;
and performing logistic regression loss modeling according to the word correct probability and sentence correct probability and the word labels and sentence labels, and training the whole confidence coefficient calculation model through a back propagation algorithm.
7. The method of confidence computation of end-to-end speech in speech recognition according to claim 5, wherein the method of confidence computation comprises a prediction phase of a confidence computation model,
the prediction phase comprises:
and taking the recognition result and the confidence coefficient characteristics as input, taking characteristic splicing and position coding as input of a confidence coefficient calculation model, sending the input into the trained confidence coefficient calculation model, outputting the correct probability of the recognition result words through one head, and outputting the correct probability of sentences through the other head for downstream tasks.
8. An apparatus for computing confidence of end-to-end speech in speech recognition, comprising:
the acoustic feature extraction module is used for extracting the acoustic features of the data of each frame of the input audio;
the recognition module is used for inputting the acoustic features into a voice recognition decoder and obtaining corresponding recognition results;
the confidence coefficient feature extraction module is used for extracting the confidence coefficient feature of each word in the recognition result according to the acoustic feature, the recognition result and a preset feature abstract model; and
and the confidence coefficient calculation module is used for taking the recognition result and the extracted confidence coefficient characteristics as the input of the confidence coefficient calculation model, and predicting the confidence coefficient of each word and the confidence coefficient of the sentence in the recognition result.
9. A server, characterized by comprising the confidence calculation means of end-to-end speech in speech recognition according to claim 8.
10. A non-transitory computer-readable storage medium of computer-executable instructions, which, when executed by one or more processors, cause the processors to perform the method of confidence computation of end-to-end speech in speech recognition of any of claims 1-7.
CN202111403940.1A 2021-11-24 2021-11-24 End-to-end voice confidence calculation method, device, server and medium Pending CN114005434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111403940.1A CN114005434A (en) 2021-11-24 2021-11-24 End-to-end voice confidence calculation method, device, server and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111403940.1A CN114005434A (en) 2021-11-24 2021-11-24 End-to-end voice confidence calculation method, device, server and medium

Publications (1)

Publication Number Publication Date
CN114005434A true CN114005434A (en) 2022-02-01

Family

ID=79930149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111403940.1A Pending CN114005434A (en) 2021-11-24 2021-11-24 End-to-end voice confidence calculation method, device, server and medium

Country Status (1)

Country Link
CN (1) CN114005434A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376491A (en) * 2022-07-06 2022-11-22 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376491A (en) * 2022-07-06 2022-11-22 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium
CN115376491B (en) * 2022-07-06 2023-08-18 北京数美时代科技有限公司 Voice confidence calculation method, system, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN111369996B (en) Speech recognition text error correction method in specific field
CN113811946A (en) End-to-end automatic speech recognition of digital sequences
KR20230147685A (en) Word-level reliability learning for subword end-to-end automatic speech recognition
US20230092440A1 (en) N-best softmax smoothing for minimum bayes risk training of attention based sequence-to-sequence models
CN113327595B (en) Pronunciation deviation detection method and device and storage medium
CN113609824A (en) Multi-turn dialog rewriting method and system based on text editing and grammar error correction
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
Samarakoon et al. Domain adaptation of end-to-end speech recognition in low-resource settings
CN114781377B (en) Error correction model, training and error correction method for non-aligned text
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN116502628A (en) Multi-stage fusion text error correction method for government affair field based on knowledge graph
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
US20220310065A1 (en) Supervised and Unsupervised Training with Contrastive Loss Over Sequences
Collobert et al. Word-level speech recognition with a letter to word encoder
Sun et al. Tree-constrained pointer generator with graph neural network encodings for contextual speech recognition
CN114005434A (en) End-to-end voice confidence calculation method, device, server and medium
CN117099157A (en) Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation
Wei et al. Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr
CN116227472B (en) Method for constructing accessory synonym library for BERT-FLAT entity recognition
CN116665675B (en) Voice transcription method, system, electronic equipment and storage medium
CN116090468A (en) Entity relationship joint extraction method and system based on stacked pointer network
Soltau et al. RNN Transducers for Nested Named Entity Recognition with constraints on alignment for long sequences
CN115223549A (en) Vietnamese speech recognition corpus construction method
CN114912441A (en) Text error correction model generation method, error correction method, system, device and medium
Shafran et al. Efficient determinization of tagged word lattices using categorial and lexicographic semirings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination