CN101609672B

CN101609672B - Speech recognition semantic confidence feature extraction method and device

Info

Publication number: CN101609672B
Application number: CN2009100888676A
Authority: CN
Inventors: 陈伟; 刘刚; 郭军; 国玉晶
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2009-07-21
Filing date: 2009-07-21
Publication date: 2011-09-07
Anticipated expiration: 2029-07-21
Also published as: CN101609672A

Abstract

The embodiment of the invention discloses a speech recognition semantic confidence feature extraction method, which comprises the steps of reasoning a speech recognition result through a topic model for obtaining a topic structure of the recognition result, utilizing the reasoning result to calculate and obtain the topic distribution of words, selecting a certain number of words with acoustic posterior probability of being greater than a certain threshold and strong topic from the recognition result as anchor words, utilizing the topic distribution of the anchor words for calculating the reference topic distribution of the whole recognition result, and using the topic distribution in the recognition result for comparing the similarity between the topic distribution and the reference topic distribution of the recognition result and being taken as a semantic confidence feature of the words. The invention further discloses a speech recognition semantic confidence feature extraction device, which provides guidance of semantic high-level information for confidence annotation and can further more accurately describe and analysis the speech recognition result and improve the precision of confidence annotation.

Description

Method and device for extracting semantic confidence features of voice recognition

Technical Field

The invention relates to the field of voice recognition, in particular to a method and a device for extracting semantic confidence characteristics.

Background

The voice recognition confidence characteristics are the key for evaluating the reliability of the recognition result after voice recognition, and are mainly used for solving the problem of voice recognition confidence labeling.

Generally, voice confidence labeling needs to label confidence labeling primitives in a recognition result as correct and wrong two types based on different confidence features or feature combinations, so as to evaluate the reliability of the recognition result. The elements labeled by the confidence coefficient generally adopt words, and can also adopt speech frames, phonemes, sentences and the like.

Currently, the confidence feature of speech recognition is mainly derived from the information of the decoder, however, Huang-Zengyang in its 1998 book HNC (concept hierarchy network) theory published by Qinghua university Press mentions that human hearing experiments show that human hearing preprocessing can only hear 70% of the syllables in a continuous speech stream, and people can use knowledge of grammar, semantics, etc. to guide the understanding of speech when the pronunciation of speech is fuzzy. At present, the key of speech recognition also depends on the deblurring and error correction capability of a post-processing system, so high-level information such as grammar and semantics is very important for speech recognition post-processing. But it is also difficult for a machine to efficiently extract grammatical and semantic confidence features in speech recognition post-processing.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the speech confidence characteristics extracted by the existing method are all derived from the information of a decoder, the source of the characteristic information is single, and the semantic layer confidence characteristics cannot be effectively extracted from high-level information such as semantics and the like to guide the evaluation of the recognition result.

The method is based on a Statistical Topic model (Statistical Topic Models), gives an identification result, extracts a Topic structure implied in the identification result and a relatively stable implied Semantic structure which can be understood by people through the Topic model, finds description of a Semantic layer for the identification result, and further extracts Semantic features of words or other confidence degree labeling primitives in the identification result, wherein the Topic model comprises Latent Dirichlet Allocation (LDA), Probability Latent Semantic Analysis (PLSA), and the like.

Disclosure of Invention

In view of this, one or more embodiments of the present invention provide a method and an apparatus for semantic confidence feature extraction, so as to achieve the purposes of increasing information sources of confidence features, describing and analyzing a speech recognition result more accurately through knowledge such as semantics, and improving confidence labeling precision.

The embodiment of the invention provides a method for extracting semantic confidence characteristics of voice recognition, which comprises the following steps:

reasoning the voice recognition result through a topic model to obtain a topic structure of the recognition result;

calculating to obtain the topic distribution of the Words by using the inference result, selecting a certain number of Words with acoustic posterior probability greater than a certain threshold and strong topic from the recognition result as Anchor Words (Anchor Words), and calculating to obtain the reference topic distribution of the whole recognition result by using the topic distribution of the Anchor Words;

and comparing the similarity between the topic distribution of the words in the recognition result and the topic distribution of the recognition result reference as the semantic confidence characteristics of the words.

Also disclosed is a speech recognition semantic confidence feature extraction device, comprising:

the theme analysis device is used for carrying out reasoning analysis on the recognition result by using the theme model to obtain a theme structure in the recognition result;

the posterior probability generating device is used for calculating the acoustic posterior probability of each word in the recognition result by utilizing the detailed decoding information recorded in the voice recognition process;

the word topic distribution generating device is used for calculating and obtaining the topic distribution of the words according to the topic structure in the identification result obtained by the topic analyzing device;

the document reference topic distribution generating device is used for determining anchor words, specifically, a topic structure in a recognition result obtained by the topic analyzing device and acoustic posterior probability information of words in the recognition result obtained by the posterior probability generating device, selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong topic property from the recognition result as the anchor words, and then calculating by utilizing the topic distribution of the anchor words to obtain the reference topic distribution of the whole recognition result;

and the semantic feature extraction device is used for comparing the similarity between the topic distribution of the words in the recognition result and the standard topic distribution of the recognition result as semantic confidence features of the words.

Compared with the prior art, the voice recognition semantic confidence characteristics provided by the embodiment of the invention provide semantic high-level information guidance for confidence annotation, so that a voice recognition result can be more accurately described and analyzed, and the accuracy of the confidence annotation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an embodiment of the present invention;

FIG. 2 is a flow chart illustrating the generation of a baseline topic distribution for recognition results according to an embodiment of the present invention;

FIG. 2-1 is a flowchart illustrating a method for finding anchor words according to an embodiment of the present invention;

FIG. 2-2 shows confidence with acoustic posterior probability and semantic confidence feature combinations of the present invention

Marking as an example, and giving a change schematic diagram of the marking precision and anchor word searching parameters;

fig. 3 is a block diagram of a semantic confidence feature extraction device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the related technical solution for semantic confidence feature extraction provided by the embodiments of the present invention, there is a basic premise that correctly recognized words in recognition results better conform to semantic rules than incorrectly recognized words, and the inventors have conceived the related embodiments of the present invention on the above-mentioned premise.

In the embodiment of the present invention, the semantic confidence feature extraction function may be divided as follows:

the first functional unit of the embodiment of the invention mainly uses a large number of document sets to train the theme model.

The second functional unit of the embodiment of the invention mainly performs voice recognition, outputs the final recognition result and records the whole decoding process in detail.

The third functional unit of the embodiment of the invention is mainly used for extracting semantic confidence characteristics of words in the recognition result under the guidance of the information generated by the first functional unit and the second functional unit. Reasoning and analyzing the voice recognition result by using a topic model generated by the first functional unit to obtain a topic structure in the recognition result; and calculating the acoustic posterior probability of each word in the recognition result by using the detailed decoding information recorded by the second functional unit. Under the guidance of the information, calculating to obtain the topic distribution of the words; selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong subject from the recognition result as anchor words, and calculating to obtain the reference subject distribution of the whole recognition result by utilizing the subject distribution of the anchor words; and comparing the similarity between the topic distribution of the words in the recognition result and the reference topic distribution of the recognition result as the semantic confidence characteristics of the words.

It should be noted that the above functional modules are relatively divided, and are mainly used to help those skilled in the art to understand the principle of the present invention as a whole, and the embodiments of the present invention may also use other functional modules and their combinations to achieve the same technical effect, without departing from the scope of the present invention.

As shown in fig. 1, it is a structural block diagram of an embodiment of the present invention, including:

the system comprises a first functional unit 101, a second functional unit 102 and a third functional unit 103, wherein the third functional unit is respectively connected with the first functional unit and the second functional unit, and the first functional unit 101 comprises a document set 1011, a topic model training module 1012 and a topic model 1013; the second functional unit 102 includes a speech data input module 1021, a speech recognition module 1022, a speech recognition result 1023, and speech recognition decoded information 1024, and the third functional unit includes a topic model analysis module 1031, a posterior probability generation module 1032, a word topic distribution generation module 1033, a document reference topic distribution generation module 1034, and a semantic feature extraction module 1035.

Next, taking LDA as an example, the topic model analysis module 1031 and the word topic distribution generation module 1033 are introduced.

The LDA model is a topic model for unsupervised learning which is proposed in recent years and can extract the hidden topic of the text, and is a generative probability model containing a three-layer structure of words, topics and documents, and supposing that a document set for training LDA contains M documents and V different words, the number of topics of LDA is K, namelyThe number of words in the current recognition result d is N_dCorresponding word sequences

<math><mrow><mover><mi>w</mi><mo>&RightArrow;</mo></mover><mo>=</mo><mrow><mo>(</mo><msub><mi>w</mi><mn>1</mn></msub><mo>,</mo><msub><mi>w</mi><mn>2</mn></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>w</mi><msub><mi>N</mi><mi>d</mi></msub></msub><mo>)</mo></mrow><mo>.</mo></mrow></math>

The topic model analysis module 1031 obtains the topic structure on the current recognition result d, that is, the probability of the word w under the given topic j and the probability of the topic j under the current recognition result d, by LDA inference:

and

the Topic distribution generating module 1034 calculates Topic distribution Topic _ dis (w) of words by using the information obtained by the Topic model analyzing module 1031_i) Wherein w is_iTo identify a word in the result d, Topic _ dis (w)_i) Is a vector of K dimension, and is specifically shown in the following formula:

Topic_dis(w_i)＝(H(w_i，z₁)，H(w_i，z₂)...H(w_i，z_K))；

wherein,

(Note: the prior probability of document d is considered to be a uniform distribution, i.e., P (d)_i)＝p(d)，i＝1...M)

The method of the document reference topic distribution generation module 1034 in fig. 1 is described below by taking LDA as an example in conjunction with fig. 2 to 4.

As shown in fig. 2, it is a flowchart of a recognition result reference topic distribution generating module in the embodiment of the present invention, and the flowchart includes:

201. performing topic model reasoning on the current recognition result to obtain a topic structure in the recognition result;

202. and searching for Anchor words in the recognition result according to the inference result and the posterior probability, wherein the words in the recognition result d are consistent with the theme to be expressed in the whole document, but considering that the theme distribution of the recognition result d is mainly determined by some words with strong theme in d, the words determining the theme distribution are required to be found for calculating the reference theme distribution of the recognition result, and the words are called Anchor words (Anchor words). Because the word which is recognized by mistake exists in the recognition result, when selecting the anchor word, the anchor word needs to be ensured to be recognized correctly, namely, the acoustic posterior probability is large enough, and the theme of the anchor words needs to be ensured to be strong. The specific method for finding anchor words is shown in fig. 2-1, and fig. 2-1 is a flowchart of a method for finding anchor words according to an embodiment of the present invention:

2021. calculating the acoustic posterior probability of each word in the recognition result through the detailed decoding information recorded by the voice recognition;

2022. setting a posterior probability threshold named PPTHhresh, and adding a word into a credible class named CClass when the posterior probability of the word is greater than the threshold; if the threshold value is smaller than the threshold value, discarding the test card;

2023. counting the number of words in the credible class CClass, and naming the words as C _ hum;

2024. judging whether words exist in the credible class CClass, namely whether C _ num is 0 or not;

2025. if no word exists in the credible class CClass, namely C _ num is equal to 0, changing the posterior probability threshold PPTHhresh, and reselecting the word and adding the word to the credible class;

2026. if there are words in the credible class CClass, that is, C _ num is not equal to 0, calculating the Topic _ dis (w) of each word in the credible class CClass_i) And record w_iCorresponding H (w)_i，z_j) Maximum value of (1), i.eThe maximum value corresponds to the strength of the word theme;

2027. setting the proportion of selected anchor words, named as Aratio, wherein the number of anchor points L is INT (C _ num is Aratio) +1, the INT () function is a rounding function, and the function is selected from the credible class CClass

According to max _ prob (w)_i) And selecting L words from large to small as anchor words of the current document.

203. After the anchor words in the recognition result are found 202, the topic distribution of the anchor words is counted, and it is assumed that the current anchor words are L in number and correspond to the point sequenceThen anchor word A_iIs Topic _ dis (A)_i)，i＝1...L。

204. Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a vector of K dimension, and specifically shown in the following formula:

Topic_dis(d)＝(L(d，z₁)，L(d，z₂)...L(d，z_K))

wherein,

L(d，z_j)＝Com(H(A₁，z_j)，H(A₂，z_j)...，H(A_L，z_j))；

where Com () is a function combining probability values of anchor words under a certain topic, e.g. using an arithmetic mean method

Thus, the semantic feature extraction module 1035 of FIG. 1 may distribute Topic _ dis (w) by comparing term topics_i) Similarity with the document reference Topic distribution Topic _ dis (d) as semantic confidence feature of words in the recognition result, i.e.

Sem(w_i)＝Similarity(Topic_dis(w_i)，Topic_dis(d))

Wherein, Sem (w)_i) Is the word w_iThe semantic confidence feature of (1), the method of measuring Similarity () is many, such as symmetric K-L divergence:

let M1: topic _ dis (w)_i)；M2：Topic_dis(d)；

The K-L divergence of M1 and M2 using M2 as a reference model can be defined as

In order to not consider the reference model, a symmetric K-L divergence is defined as a measure of similarity, such that the semantic confidence characteristics of the word are

Sem (w_{i}) = \frac{1}{2} {D_{KL} (M 1 | | M 2) + D_{KL} (M 2 | | M 1)}

As shown in fig. 2-2, the change diagram of the labeling precision and the anchor word search parameter is given by taking the acoustic posterior probability and the semantic confidence feature combination of the present invention for confidence labeling as an example.

As can be seen from fig. 2-2, when the acoustic posterior probability threshold is not used, that is, when PPThresh is 0, the anchor word search parameter is used, compared with the acoustic posterior probability threshold, that is, when PPThresh is 0.88 in the diagram, it can be seen that the effect of using PPThresh is better, so that it proves that a word with a high possibility of being correctly recognized needs to be selected when selecting an anchor word, that is, a word with an acoustic posterior probability greater than the threshold needs to be selected. Meanwhile, when the anchor words are selected and the acoustic posterior probability threshold is used, the change range of the labeling performance along with the selection proportion aromatic of the anchor words is large, so that the necessity of selecting aromatic parameters is also described, and the selection of the anchor words needs to ensure that the anchor words are greatly and possibly correctly recognized, namely the acoustic posterior probability is large enough, and meanwhile, the high-performance semantic confidence characteristics can be extracted only by ensuring that the themes of the anchor words are strong.

As shown in fig. 3, an embodiment of the present invention further provides a speech recognition semantic confidence feature extraction apparatus, including:

a topic analysis device 301, configured to perform inference analysis on the recognition result by using the topic model to obtain a topic structure in the recognition result, that is, assuming that the number of topics is K, that is, the topic structure is

Giving the probability of the word w under the topic j and the probability of the topic j under the current recognition result d:

and

a posterior probability generating device 302, configured to calculate, by using detailed decoding information recorded in the speech recognition process, an acoustic posterior probability of each word in the recognition result;

word topic distribution generation means 303 for counting the topic structures in the recognition result obtained by the topic analysis means 301Calculating the Topic distribution Topic _ dis (w) of the word_i) According to the formula

Topic_dis(w_i)＝(H(w_i，z₁)，H(w_i，z₂)...H(w_i，z_K))；

Wherein,

A document reference topic distribution generating device 304, configured to determine anchor words, specifically, the topic structure in the recognition result obtained by the topic analyzing device 301, and the acoustic posterior probability information of the words in the recognition result obtained by the posterior probability generating device 302, and select a certain number of words from the recognition result with high acoustic posterior probabilityAnd taking the words with a certain threshold and strong themes as anchor words, and then calculating the reference theme distribution of the whole recognition result by using the theme distribution of the anchor words. Assuming that the current anchor words are L in number, corresponding point sequences

L1.. L. Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a K-dimensional vector, and specifically according to a formula:

Topic_dis(d)＝(L(d，z₁)，L(d，z₂)...L(d，z_K))；

wherein,

L(d，z_j)＝Com(H(A₁，z_j)，H(A₂，z_j)...，H(A_L，z_j))；

wherein Com () is a function for combining probability values of anchor words under a certain subject;

semantic feature extraction means 305 for comparing the similarity between the topic distribution of the word in the recognition result and the reference topic distribution of the recognition result as the semantic confidence feature of the word, specifically by formula

Sem(w_i)＝Similarity(Topic_dis(w_i)，Topic_dis(d))

Wherein, Sem (w)_i) Is the word w_iThe semantic confidence feature of (a) is,

similarity () is a method of Similarity measurement.

The embodiment of the device has the same technical effect as the embodiment of the method, and the embodiment is not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting semantic confidence features of speech recognition is characterized by comprising the following steps:

calculating to obtain the topic distribution of the words by using the reasoning result;

selecting a certain number of Words with acoustic posterior probability larger than a certain threshold and strong theme from the recognition result as Anchor Words (Anchor Words), and then calculating to obtain the reference theme distribution of the recognition result by utilizing the theme distribution of the Anchor Words;

2. The method of claim 1, wherein inferring the speech recognition result through a topic model to obtain a topic structure of the recognition result comprises:

assume that the number of topics is K, i.e.The topic structure on the current recognition result d, that is, the probability of the word w under the given topic j and the probability of the topic j under the current recognition result d are obtained through topic model reasoning:

and

3. the method of claim 2, wherein calculating the topic distribution of words using inference results comprises:

by using

And

calculating Topic distribution of words Topic _ dis (w)_i) Wherein w is_iTo identify a word in the result d, Topic _ dis (w)_i) Is a vector of K dimension, and is specifically shown in the following formula:

Topic_dis(w_i)＝(H(w_i，Z₁)，H(w_i，Z₂)...H(w_i，Z_K))；

wherein,

where M is the number of training documents for the topic model, d_iTo train the ith document in the document, the prior probability of document d is considered as a uniform distribution, i.e., P (d)_i) P (d), where i 1.. M, then

4. The method of claim 3, wherein selecting a certain number of Words with acoustic posterior probability greater than a certain threshold and strong subject from the recognition result as Anchor Words (Anchor Words), and then calculating a reference subject distribution of the recognition result using the subject distribution of the Anchor Words, comprises:

calculating the acoustic posterior probability of each word in the recognition result through the detailed decoding information recorded by the voice recognition;

setting a posterior probability threshold, adding a word to the credible class when the posterior probability of the word in the recognition result is greater than the threshold, and discarding the word if the posterior probability of the word is less than the threshold;

counting the number of words in the credible class, and naming the number as C _ num;

judging whether a word exists in the credible class, if no word exists in the credible class, changing the posterior probability threshold, and selecting the word again and adding the word to the credible class;

if there are words in the credible class, calculating the Topic _ dis (w) of each word in the credible class_i) And record w_iCorresponding H (w)_i，z_j) Maximum value of (1), i.e

The maximum value corresponds to the strength of the word theme;

setting a proportion aromatic of selected anchor words, wherein the number L of the anchors is INT (C _ num is aromatic) +1, the INT () function is an integer function, and the integer function is selected from the credibility class according to max _ prob (w)_i) Selecting L words from big to small as anchor words of the current recognition result;

counting the topic distribution of anchor words, assuming that the current anchor words are L in number, and corresponding point sequences

Then anchor word A_iIs Topic _ dis (A)_i)，i＝1...L；

Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a vector of K dimension, and specifically shown in the following formula:

Topic_dis(d)＝(L(d，Z₁)，L(d，Z₂)...L(d，Z_K))；

wherein,

L(d，Z_j)＝Com(H(A₁，Z_j)，H(A₂，Z_j)...，H(A_L，Z_j))；

wherein Com () is a function of arithmetic mean of probability values of anchor words under the jth topic.

5. The method of claim 4, wherein using the topic distribution of the words in the recognition result to compare their similarity to a reference topic distribution of the recognition result as semantic confidence features for the words comprises:

distributing Topic _ dis (w) by using word Topic_i) Comparing the similarity with the recognition result reference Topic distribution Topic _ dis (d) as the semantic confidence feature of the words in the recognition result, namely

Sem(w_i)＝Similarity(Topic_dis(w_i)，Topic_dis(d))

Wherein, Sem (w)_i) Is the word w_iThe semantic confidence feature of (1), Similarity () is a Similarity measure function, using a symmetric K-L divergence.

6. A speech recognition semantic confidence feature extraction device, comprising:

the document reference topic distribution generating device is used for determining anchor words, specifically, a topic structure in a recognition result obtained by the topic analyzing device and acoustic posterior probability information of words in the recognition result obtained by the posterior probability generating device, selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong topic property from the recognition result as the anchor words, and then calculating by utilizing topic distribution of the anchor words to obtain reference topic distribution of the recognition result;

7. The apparatus of claim 6, in which the theme isThe analysis device includes: the method is used for carrying out reasoning analysis on the recognition result by using the topic model to obtain the topic structure in the recognition result, namely, the number of the topics is assumed to be K, namely

and

8. the apparatus of claim 7, wherein the word topic distribution generating means comprises: for using

And

calculating to obtain Topic distribution Topic _ dis (w) of the words_i) According to the formula Topic _ dis (w)_i)＝(H(w_i，Z₁)，H(w_i，Z₂)...H(w_i，Z_K) ); wherein,

9. The apparatus of claim 8, wherein said document reference topic distribution generating means comprises: selecting a certain number of words with acoustic posterior probability greater than a certain threshold and strong theme from the recognition result as anchor words through the topic structure in the recognition result obtained by the topic analysis device and the acoustic posterior probability information of the words in the recognition result obtained by the posterior probability generation device; then, calculating the topic distribution of the anchor words to obtain the reference topic distribution of the whole recognition result; assuming that the current anchor words are L in number, corresponding point sequences

L, calculating a reference Topic distribution of the recognition result d according to the Topic distribution of the anchor word, named Topic _ dis (d), which is a K-dimensional vector, and using a formula:

Topic_dis(d)＝(L(d，Z₁)，L(d，Z₂)...L(d，Z_K))；

wherein,

L(d，Z_j)＝Com(H(A₁，Z_j)，H(A₂，Z_j)...，H(A_L，Z_j))；

wherein Com () is a function of arithmetic mean of probability values of anchor words under jth topic.

10. The apparatus of claim 9, wherein the semantic feature extraction means comprises: the method is used for comparing similarity between the topic distribution of the words in the recognition result and the reference topic distribution of the recognition result as semantic confidence characteristics of the words by using a formula

Sem(w_i)＝Similarity(Topic_dis(w_i)，Topic_dis(d))

Wherein, Sem (w)_i) Is the word w_iThe semantic confidence feature of (a) is,

similarity () is a method of Similarity measurement that uses a symmetric K-L divergence.