CN101609672B - Speech recognition semantic confidence feature extraction method and device - Google Patents

Speech recognition semantic confidence feature extraction method and device Download PDF

Info

Publication number
CN101609672B
CN101609672B CN2009100888676A CN200910088867A CN101609672B CN 101609672 B CN101609672 B CN 101609672B CN 2009100888676 A CN2009100888676 A CN 2009100888676A CN 200910088867 A CN200910088867 A CN 200910088867A CN 101609672 B CN101609672 B CN 101609672B
Authority
CN
China
Prior art keywords
mrow
msub
topic
recognition result
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100888676A
Other languages
Chinese (zh)
Other versions
CN101609672A (en
Inventor
陈伟
刘刚
郭军
国玉晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN2009100888676A priority Critical patent/CN101609672B/en
Publication of CN101609672A publication Critical patent/CN101609672A/en
Application granted granted Critical
Publication of CN101609672B publication Critical patent/CN101609672B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a speech recognition semantic confidence feature extraction method, which comprises the steps of reasoning a speech recognition result through a topic model for obtaining a topic structure of the recognition result, utilizing the reasoning result to calculate and obtain the topic distribution of words, selecting a certain number of words with acoustic posterior probability of being greater than a certain threshold and strong topic from the recognition result as anchor words, utilizing the topic distribution of the anchor words for calculating the reference topic distribution of the whole recognition result, and using the topic distribution in the recognition result for comparing the similarity between the topic distribution and the reference topic distribution of the recognition result and being taken as a semantic confidence feature of the words. The invention further discloses a speech recognition semantic confidence feature extraction device, which provides guidance of semantic high-level information for confidence annotation and can further more accurately describe and analysis the speech recognition result and improve the precision of confidence annotation.

Description

Method and device for extracting semantic confidence features of voice recognition
Technical Field
The invention relates to the field of voice recognition, in particular to a method and a device for extracting semantic confidence characteristics.
Background
The voice recognition confidence characteristics are the key for evaluating the reliability of the recognition result after voice recognition, and are mainly used for solving the problem of voice recognition confidence labeling.
Generally, voice confidence labeling needs to label confidence labeling primitives in a recognition result as correct and wrong two types based on different confidence features or feature combinations, so as to evaluate the reliability of the recognition result. The elements labeled by the confidence coefficient generally adopt words, and can also adopt speech frames, phonemes, sentences and the like.
Currently, the confidence feature of speech recognition is mainly derived from the information of the decoder, however, Huang-Zengyang in its 1998 book HNC (concept hierarchy network) theory published by Qinghua university Press mentions that human hearing experiments show that human hearing preprocessing can only hear 70% of the syllables in a continuous speech stream, and people can use knowledge of grammar, semantics, etc. to guide the understanding of speech when the pronunciation of speech is fuzzy. At present, the key of speech recognition also depends on the deblurring and error correction capability of a post-processing system, so high-level information such as grammar and semantics is very important for speech recognition post-processing. But it is also difficult for a machine to efficiently extract grammatical and semantic confidence features in speech recognition post-processing.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the speech confidence characteristics extracted by the existing method are all derived from the information of a decoder, the source of the characteristic information is single, and the semantic layer confidence characteristics cannot be effectively extracted from high-level information such as semantics and the like to guide the evaluation of the recognition result.
The method is based on a Statistical Topic model (Statistical Topic Models), gives an identification result, extracts a Topic structure implied in the identification result and a relatively stable implied Semantic structure which can be understood by people through the Topic model, finds description of a Semantic layer for the identification result, and further extracts Semantic features of words or other confidence degree labeling primitives in the identification result, wherein the Topic model comprises Latent Dirichlet Allocation (LDA), Probability Latent Semantic Analysis (PLSA), and the like.
Disclosure of Invention
In view of this, one or more embodiments of the present invention provide a method and an apparatus for semantic confidence feature extraction, so as to achieve the purposes of increasing information sources of confidence features, describing and analyzing a speech recognition result more accurately through knowledge such as semantics, and improving confidence labeling precision.
The embodiment of the invention provides a method for extracting semantic confidence characteristics of voice recognition, which comprises the following steps:
reasoning the voice recognition result through a topic model to obtain a topic structure of the recognition result;
calculating to obtain the topic distribution of the Words by using the inference result, selecting a certain number of Words with acoustic posterior probability greater than a certain threshold and strong topic from the recognition result as Anchor Words (Anchor Words), and calculating to obtain the reference topic distribution of the whole recognition result by using the topic distribution of the Anchor Words;
and comparing the similarity between the topic distribution of the words in the recognition result and the topic distribution of the recognition result reference as the semantic confidence characteristics of the words.
Also disclosed is a speech recognition semantic confidence feature extraction device, comprising:
the theme analysis device is used for carrying out reasoning analysis on the recognition result by using the theme model to obtain a theme structure in the recognition result;
the posterior probability generating device is used for calculating the acoustic posterior probability of each word in the recognition result by utilizing the detailed decoding information recorded in the voice recognition process;
the word topic distribution generating device is used for calculating and obtaining the topic distribution of the words according to the topic structure in the identification result obtained by the topic analyzing device;
the document reference topic distribution generating device is used for determining anchor words, specifically, a topic structure in a recognition result obtained by the topic analyzing device and acoustic posterior probability information of words in the recognition result obtained by the posterior probability generating device, selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong topic property from the recognition result as the anchor words, and then calculating by utilizing the topic distribution of the anchor words to obtain the reference topic distribution of the whole recognition result;
and the semantic feature extraction device is used for comparing the similarity between the topic distribution of the words in the recognition result and the standard topic distribution of the recognition result as semantic confidence features of the words.
Compared with the prior art, the voice recognition semantic confidence characteristics provided by the embodiment of the invention provide semantic high-level information guidance for confidence annotation, so that a voice recognition result can be more accurately described and analyzed, and the accuracy of the confidence annotation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of an embodiment of the present invention;
FIG. 2 is a flow chart illustrating the generation of a baseline topic distribution for recognition results according to an embodiment of the present invention;
FIG. 2-1 is a flowchart illustrating a method for finding anchor words according to an embodiment of the present invention;
FIG. 2-2 shows confidence with acoustic posterior probability and semantic confidence feature combinations of the present invention
Marking as an example, and giving a change schematic diagram of the marking precision and anchor word searching parameters;
fig. 3 is a block diagram of a semantic confidence feature extraction device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related technical solution for semantic confidence feature extraction provided by the embodiments of the present invention, there is a basic premise that correctly recognized words in recognition results better conform to semantic rules than incorrectly recognized words, and the inventors have conceived the related embodiments of the present invention on the above-mentioned premise.
In the embodiment of the present invention, the semantic confidence feature extraction function may be divided as follows:
the first functional unit of the embodiment of the invention mainly uses a large number of document sets to train the theme model.
The second functional unit of the embodiment of the invention mainly performs voice recognition, outputs the final recognition result and records the whole decoding process in detail.
The third functional unit of the embodiment of the invention is mainly used for extracting semantic confidence characteristics of words in the recognition result under the guidance of the information generated by the first functional unit and the second functional unit. Reasoning and analyzing the voice recognition result by using a topic model generated by the first functional unit to obtain a topic structure in the recognition result; and calculating the acoustic posterior probability of each word in the recognition result by using the detailed decoding information recorded by the second functional unit. Under the guidance of the information, calculating to obtain the topic distribution of the words; selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong subject from the recognition result as anchor words, and calculating to obtain the reference subject distribution of the whole recognition result by utilizing the subject distribution of the anchor words; and comparing the similarity between the topic distribution of the words in the recognition result and the reference topic distribution of the recognition result as the semantic confidence characteristics of the words.
It should be noted that the above functional modules are relatively divided, and are mainly used to help those skilled in the art to understand the principle of the present invention as a whole, and the embodiments of the present invention may also use other functional modules and their combinations to achieve the same technical effect, without departing from the scope of the present invention.
As shown in fig. 1, it is a structural block diagram of an embodiment of the present invention, including:
the system comprises a first functional unit 101, a second functional unit 102 and a third functional unit 103, wherein the third functional unit is respectively connected with the first functional unit and the second functional unit, and the first functional unit 101 comprises a document set 1011, a topic model training module 1012 and a topic model 1013; the second functional unit 102 includes a speech data input module 1021, a speech recognition module 1022, a speech recognition result 1023, and speech recognition decoded information 1024, and the third functional unit includes a topic model analysis module 1031, a posterior probability generation module 1032, a word topic distribution generation module 1033, a document reference topic distribution generation module 1034, and a semantic feature extraction module 1035.
Next, taking LDA as an example, the topic model analysis module 1031 and the word topic distribution generation module 1033 are introduced.
The LDA model is a topic model for unsupervised learning which is proposed in recent years and can extract the hidden topic of the text, and is a generative probability model containing a three-layer structure of words, topics and documents, and supposing that a document set for training LDA contains M documents and V different words, the number of topics of LDA is K, namelyThe number of words in the current recognition result d is NdCorresponding word sequences <math><mrow><mover><mi>w</mi><mo>&RightArrow;</mo></mover><mo>=</mo><mrow><mo>(</mo><msub><mi>w</mi><mn>1</mn></msub><mo>,</mo><msub><mi>w</mi><mn>2</mn></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>w</mi><msub><mi>N</mi><mi>d</mi></msub></msub><mo>)</mo></mrow><mo>.</mo></mrow></math>
The topic model analysis module 1031 obtains the topic structure on the current recognition result d, that is, the probability of the word w under the given topic j and the probability of the topic j under the current recognition result d, by LDA inference: <math><mrow><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow></msubsup><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>w</mi><mo>|</mo><mi>z</mi><mo>=</mo><mi>j</mi><mo>)</mo></mrow></mrow></math> and <math><mrow><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow></msubsup><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>z</mi><mo>=</mo><mi>j</mi><mo>|</mo><mi>d</mi><mo>)</mo></mrow><mo>.</mo></mrow></math>
the Topic distribution generating module 1034 calculates Topic distribution Topic _ dis (w) of words by using the information obtained by the Topic model analyzing module 1031i) Wherein w isiTo identify a word in the result d, Topic _ dis (w)i) Is a vector of K dimension, and is specifically shown in the following formula:
Topic_dis(wi)=(H(wi,z1),H(wi,z2)...H(wi,zK));
wherein,
<math><mrow><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>=</mo><mfrac><mrow><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>;</mo></mrow></math>
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>,</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow><mo>*</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>;</mo></mrow></math>
(Note: the prior probability of document d is considered to be a uniform distribution, i.e., P (d)i)=p(d),i=1...M)
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>;</mo></mrow></math>
The method of the document reference topic distribution generation module 1034 in fig. 1 is described below by taking LDA as an example in conjunction with fig. 2 to 4.
As shown in fig. 2, it is a flowchart of a recognition result reference topic distribution generating module in the embodiment of the present invention, and the flowchart includes:
201. performing topic model reasoning on the current recognition result to obtain a topic structure in the recognition result;
202. and searching for Anchor words in the recognition result according to the inference result and the posterior probability, wherein the words in the recognition result d are consistent with the theme to be expressed in the whole document, but considering that the theme distribution of the recognition result d is mainly determined by some words with strong theme in d, the words determining the theme distribution are required to be found for calculating the reference theme distribution of the recognition result, and the words are called Anchor words (Anchor words). Because the word which is recognized by mistake exists in the recognition result, when selecting the anchor word, the anchor word needs to be ensured to be recognized correctly, namely, the acoustic posterior probability is large enough, and the theme of the anchor words needs to be ensured to be strong. The specific method for finding anchor words is shown in fig. 2-1, and fig. 2-1 is a flowchart of a method for finding anchor words according to an embodiment of the present invention:
2021. calculating the acoustic posterior probability of each word in the recognition result through the detailed decoding information recorded by the voice recognition;
2022. setting a posterior probability threshold named PPTHhresh, and adding a word into a credible class named CClass when the posterior probability of the word is greater than the threshold; if the threshold value is smaller than the threshold value, discarding the test card;
2023. counting the number of words in the credible class CClass, and naming the words as C _ hum;
2024. judging whether words exist in the credible class CClass, namely whether C _ num is 0 or not;
2025. if no word exists in the credible class CClass, namely C _ num is equal to 0, changing the posterior probability threshold PPTHhresh, and reselecting the word and adding the word to the credible class;
2026. if there are words in the credible class CClass, that is, C _ num is not equal to 0, calculating the Topic _ dis (w) of each word in the credible class CClassi) And record wiCorresponding H (w)i,zj) Maximum value of (1), i.eThe maximum value corresponds to the strength of the word theme;
2027. setting the proportion of selected anchor words, named as Aratio, wherein the number of anchor points L is INT (C _ num is Aratio) +1, the INT () function is a rounding function, and the function is selected from the credible class CClass
According to max _ prob (w)i) And selecting L words from large to small as anchor words of the current document.
203. After the anchor words in the recognition result are found 202, the topic distribution of the anchor words is counted, and it is assumed that the current anchor words are L in number and correspond to the point sequenceThen anchor word AiIs Topic _ dis (A)i),i=1...L。
204. Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a vector of K dimension, and specifically shown in the following formula:
Topic_dis(d)=(L(d,z1),L(d,z2)...L(d,zK))
wherein,
L(d,zj)=Com(H(A1,zj),H(A2,zj)...,H(AL,zj));
where Com () is a function combining probability values of anchor words under a certain topic, e.g. using an arithmetic mean method
<math><mrow><mi>L</mi><mrow><mo>(</mo><mi>d</mi><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mi>L</mi></mfrac><mo>*</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>L</mi></munderover><mi>H</mi><mrow><mo>(</mo><msub><mi>A</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow></math>
Thus, the semantic feature extraction module 1035 of FIG. 1 may distribute Topic _ dis (w) by comparing term topicsi) Similarity with the document reference Topic distribution Topic _ dis (d) as semantic confidence feature of words in the recognition result, i.e.
Sem(wi)=Similarity(Topic_dis(wi),Topic_dis(d))
Wherein, Sem (w)i) Is the word wiThe semantic confidence feature of (1), the method of measuring Similarity () is many, such as symmetric K-L divergence:
let M1: topic _ dis (w)i);M2:Topic_dis(d);
The K-L divergence of M1 and M2 using M2 as a reference model can be defined as
<math><mrow><msub><mi>D</mi><mi>KL</mi></msub><mrow><mo>(</mo><mi>M</mi><mn>1</mn><mo>|</mo><mo>|</mo><mi>M</mi><mn>2</mn><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>log</mi><mrow><mo>(</mo><mfrac><mrow><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>L</mi><mrow><mo>(</mo><mi>d</mi><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>)</mo></mrow></mrow></math>
In order to not consider the reference model, a symmetric K-L divergence is defined as a measure of similarity, such that the semantic confidence characteristics of the word are
Sem ( w i ) = 1 2 { D KL ( M 1 | | M 2 ) + D KL ( M 2 | | M 1 ) }
As shown in fig. 2-2, the change diagram of the labeling precision and the anchor word search parameter is given by taking the acoustic posterior probability and the semantic confidence feature combination of the present invention for confidence labeling as an example.
As can be seen from fig. 2-2, when the acoustic posterior probability threshold is not used, that is, when PPThresh is 0, the anchor word search parameter is used, compared with the acoustic posterior probability threshold, that is, when PPThresh is 0.88 in the diagram, it can be seen that the effect of using PPThresh is better, so that it proves that a word with a high possibility of being correctly recognized needs to be selected when selecting an anchor word, that is, a word with an acoustic posterior probability greater than the threshold needs to be selected. Meanwhile, when the anchor words are selected and the acoustic posterior probability threshold is used, the change range of the labeling performance along with the selection proportion aromatic of the anchor words is large, so that the necessity of selecting aromatic parameters is also described, and the selection of the anchor words needs to ensure that the anchor words are greatly and possibly correctly recognized, namely the acoustic posterior probability is large enough, and meanwhile, the high-performance semantic confidence characteristics can be extracted only by ensuring that the themes of the anchor words are strong.
As shown in fig. 3, an embodiment of the present invention further provides a speech recognition semantic confidence feature extraction apparatus, including:
a topic analysis device 301, configured to perform inference analysis on the recognition result by using the topic model to obtain a topic structure in the recognition result, that is, assuming that the number of topics is K, that is, the topic structure is
Figure GSB00000473818500071
Giving the probability of the word w under the topic j and the probability of the topic j under the current recognition result d:
Figure GSB00000473818500072
and <math><mrow><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow></msubsup><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>z</mi><mo>=</mo><mi>j</mi><mo>|</mo><mi>d</mi><mo>)</mo></mrow><mo>;</mo></mrow></math>
a posterior probability generating device 302, configured to calculate, by using detailed decoding information recorded in the speech recognition process, an acoustic posterior probability of each word in the recognition result;
word topic distribution generation means 303 for counting the topic structures in the recognition result obtained by the topic analysis means 301Calculating the Topic distribution Topic _ dis (w) of the wordi) According to the formula
Topic_dis(wi)=(H(wi,z1),H(wi,z2)...H(wi,zK));
Wherein,
<math><mrow><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>=</mo><mfrac><mrow><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>;</mo></mrow></math>
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>,</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow><mo>*</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>;</mo></mrow></math>
(Note: the prior probability of document d is considered to be a uniform distribution, i.e., P (d)i)=p(d),i=1...M)
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>;</mo></mrow></math>
A document reference topic distribution generating device 304, configured to determine anchor words, specifically, the topic structure in the recognition result obtained by the topic analyzing device 301, and the acoustic posterior probability information of the words in the recognition result obtained by the posterior probability generating device 302, and select a certain number of words from the recognition result with high acoustic posterior probabilityAnd taking the words with a certain threshold and strong themes as anchor words, and then calculating the reference theme distribution of the whole recognition result by using the theme distribution of the anchor words. Assuming that the current anchor words are L in number, corresponding point sequences
Figure GSB00000473818500077
L1.. L. Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a K-dimensional vector, and specifically according to a formula:
Topic_dis(d)=(L(d,z1),L(d,z2)...L(d,zK));
wherein,
L(d,zj)=Com(H(A1,zj),H(A2,zj)...,H(AL,zj));
wherein Com () is a function for combining probability values of anchor words under a certain subject;
semantic feature extraction means 305 for comparing the similarity between the topic distribution of the word in the recognition result and the reference topic distribution of the recognition result as the semantic confidence feature of the word, specifically by formula
Sem(wi)=Similarity(Topic_dis(wi),Topic_dis(d))
Wherein, Sem (w)i) Is the word wiThe semantic confidence feature of (a) is,
similarity () is a method of Similarity measurement.
The embodiment of the device has the same technical effect as the embodiment of the method, and the embodiment is not repeated.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for extracting semantic confidence features of speech recognition is characterized by comprising the following steps:
reasoning the voice recognition result through a topic model to obtain a topic structure of the recognition result;
calculating to obtain the topic distribution of the words by using the reasoning result;
selecting a certain number of Words with acoustic posterior probability larger than a certain threshold and strong theme from the recognition result as Anchor Words (Anchor Words), and then calculating to obtain the reference theme distribution of the recognition result by utilizing the theme distribution of the Anchor Words;
and comparing the similarity between the topic distribution of the words in the recognition result and the topic distribution of the recognition result reference as the semantic confidence characteristics of the words.
2. The method of claim 1, wherein inferring the speech recognition result through a topic model to obtain a topic structure of the recognition result comprises:
assume that the number of topics is K, i.e.The topic structure on the current recognition result d, that is, the probability of the word w under the given topic j and the probability of the topic j under the current recognition result d are obtained through topic model reasoning:
Figure FSB00000530734200012
and
Figure FSB00000530734200013
3. the method of claim 2, wherein calculating the topic distribution of words using inference results comprises:
by using
Figure FSB00000530734200014
And
Figure FSB00000530734200015
calculating Topic distribution of words Topic _ dis (w)i) Wherein w isiTo identify a word in the result d, Topic _ dis (w)i) Is a vector of K dimension, and is specifically shown in the following formula:
Topic_dis(wi)=(H(wi,Z1),H(wi,Z2)...H(wi,ZK));
wherein,
<math><mrow><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>=</mo><mfrac><mrow><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>;</mo></mrow></math>
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><msub><mrow><mo>,</mo><mi>d</mi></mrow><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow><mo>*</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>;</mo></mrow></math>
where M is the number of training documents for the topic model, diTo train the ith document in the document, the prior probability of document d is considered as a uniform distribution, i.e., P (d)i) P (d), where i 1.. M, then
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><msub><mrow><mo>,</mo><mi>z</mi></mrow><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>.</mo></mrow></math>
4. The method of claim 3, wherein selecting a certain number of Words with acoustic posterior probability greater than a certain threshold and strong subject from the recognition result as Anchor Words (Anchor Words), and then calculating a reference subject distribution of the recognition result using the subject distribution of the Anchor Words, comprises:
calculating the acoustic posterior probability of each word in the recognition result through the detailed decoding information recorded by the voice recognition;
setting a posterior probability threshold, adding a word to the credible class when the posterior probability of the word in the recognition result is greater than the threshold, and discarding the word if the posterior probability of the word is less than the threshold;
counting the number of words in the credible class, and naming the number as C _ num;
judging whether a word exists in the credible class, if no word exists in the credible class, changing the posterior probability threshold, and selecting the word again and adding the word to the credible class;
if there are words in the credible class, calculating the Topic _ dis (w) of each word in the credible classi) And record wiCorresponding H (w)i,zj) Maximum value of (1), i.e
Figure FSB00000530734200021
The maximum value corresponds to the strength of the word theme;
setting a proportion aromatic of selected anchor words, wherein the number L of the anchors is INT (C _ num is aromatic) +1, the INT () function is an integer function, and the integer function is selected from the credibility class according to max _ prob (w)i) Selecting L words from big to small as anchor words of the current recognition result;
counting the topic distribution of anchor words, assuming that the current anchor words are L in number, and corresponding point sequences
Figure FSB00000530734200022
Then anchor word AiIs Topic _ dis (A)i),i=1...L;
Calculating the reference Topic distribution of the recognition result d according to the Topic distribution of the anchor words, named as Topic _ dis (d), which is a vector of K dimension, and specifically shown in the following formula:
Topic_dis(d)=(L(d,Z1),L(d,Z2)...L(d,ZK));
wherein,
L(d,Zj)=Com(H(A1,Zj),H(A2,Zj)...,H(AL,Zj));
wherein Com () is a function of arithmetic mean of probability values of anchor words under the jth topic.
5. The method of claim 4, wherein using the topic distribution of the words in the recognition result to compare their similarity to a reference topic distribution of the recognition result as semantic confidence features for the words comprises:
distributing Topic _ dis (w) by using word Topici) Comparing the similarity with the recognition result reference Topic distribution Topic _ dis (d) as the semantic confidence feature of the words in the recognition result, namely
Sem(wi)=Similarity(Topic_dis(wi),Topic_dis(d))
Wherein, Sem (w)i) Is the word wiThe semantic confidence feature of (1), Similarity () is a Similarity measure function, using a symmetric K-L divergence.
6. A speech recognition semantic confidence feature extraction device, comprising:
the theme analysis device is used for carrying out reasoning analysis on the recognition result by using the theme model to obtain a theme structure in the recognition result;
the posterior probability generating device is used for calculating the acoustic posterior probability of each word in the recognition result by utilizing the detailed decoding information recorded in the voice recognition process;
the word topic distribution generating device is used for calculating and obtaining the topic distribution of the words according to the topic structure in the identification result obtained by the topic analyzing device;
the document reference topic distribution generating device is used for determining anchor words, specifically, a topic structure in a recognition result obtained by the topic analyzing device and acoustic posterior probability information of words in the recognition result obtained by the posterior probability generating device, selecting a certain number of words with acoustic posterior probability larger than a certain threshold and strong topic property from the recognition result as the anchor words, and then calculating by utilizing topic distribution of the anchor words to obtain reference topic distribution of the recognition result;
and the semantic feature extraction device is used for comparing the similarity between the topic distribution of the words in the recognition result and the standard topic distribution of the recognition result as semantic confidence features of the words.
7. The apparatus of claim 6, in which the theme isThe analysis device includes: the method is used for carrying out reasoning analysis on the recognition result by using the topic model to obtain the topic structure in the recognition result, namely, the number of the topics is assumed to be K, namely
Figure FSB00000530734200031
Giving the probability of the word w under the topic j and the probability of the topic j under the current recognition result d:
Figure FSB00000530734200032
and
Figure FSB00000530734200033
8. the apparatus of claim 7, wherein the word topic distribution generating means comprises: for using
Figure FSB00000530734200034
And
Figure FSB00000530734200035
calculating to obtain Topic distribution Topic _ dis (w) of the wordsi) According to the formula Topic _ dis (w)i)=(H(wi,Z1),H(wi,Z2)...H(wi,ZK) ); wherein,
<math><mrow><mi>H</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>,</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>=</mo><mfrac><mrow><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>;</mo></mrow></math>
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><msub><mrow><mo>,</mo><mi>d</mi></mrow><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>|</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>P</mi><mrow><mo>(</mo><mi>d</mi><mo>)</mo></mrow><mo>*</mo><munderover><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover><msubsup><mi>&theta;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>d</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>;</mo></mrow></math>
where M is the number of training documents for the topic model, diTo train the ith document in the document, the prior probability of document d is considered as a uniform distribution, i.e., P (d)i) P (d), where i 1.. M, then
<math><mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><msub><mrow><mo>,</mo><mi>z</mi></mrow><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mi>P</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>|</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><msubsup><mi>&Phi;</mi><mi>j</mi><mrow><mo>(</mo><msub><mi>w</mi><mi>i</mi></msub><mo>)</mo></mrow></msubsup><mo>*</mo><mi>P</mi><mrow><mo>(</mo><msub><mi>z</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>.</mo></mrow></math>
9. The apparatus of claim 8, wherein said document reference topic distribution generating means comprises: selecting a certain number of words with acoustic posterior probability greater than a certain threshold and strong theme from the recognition result as anchor words through the topic structure in the recognition result obtained by the topic analysis device and the acoustic posterior probability information of the words in the recognition result obtained by the posterior probability generation device; then, calculating the topic distribution of the anchor words to obtain the reference topic distribution of the whole recognition result; assuming that the current anchor words are L in number, corresponding point sequences
Figure FSB00000530734200042
L, calculating a reference Topic distribution of the recognition result d according to the Topic distribution of the anchor word, named Topic _ dis (d), which is a K-dimensional vector, and using a formula:
Topic_dis(d)=(L(d,Z1),L(d,Z2)...L(d,ZK));
wherein,
L(d,Zj)=Com(H(A1,Zj),H(A2,Zj)...,H(AL,Zj));
wherein Com () is a function of arithmetic mean of probability values of anchor words under jth topic.
10. The apparatus of claim 9, wherein the semantic feature extraction means comprises: the method is used for comparing similarity between the topic distribution of the words in the recognition result and the reference topic distribution of the recognition result as semantic confidence characteristics of the words by using a formula
Sem(wi)=Similarity(Topic_dis(wi),Topic_dis(d))
Wherein, Sem (w)i) Is the word wiThe semantic confidence feature of (a) is,
similarity () is a method of Similarity measurement that uses a symmetric K-L divergence.
CN2009100888676A 2009-07-21 2009-07-21 Speech recognition semantic confidence feature extraction method and device Expired - Fee Related CN101609672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100888676A CN101609672B (en) 2009-07-21 2009-07-21 Speech recognition semantic confidence feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100888676A CN101609672B (en) 2009-07-21 2009-07-21 Speech recognition semantic confidence feature extraction method and device

Publications (2)

Publication Number Publication Date
CN101609672A CN101609672A (en) 2009-12-23
CN101609672B true CN101609672B (en) 2011-09-07

Family

ID=41483397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100888676A Expired - Fee Related CN101609672B (en) 2009-07-21 2009-07-21 Speech recognition semantic confidence feature extraction method and device

Country Status (1)

Country Link
CN (1) CN101609672B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106062868A (en) * 2014-07-25 2016-10-26 谷歌公司 Providing pre-computed hotword models

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894549A (en) * 2010-06-24 2010-11-24 中国科学院声学研究所 Method for fast calculating confidence level in speech recognition application field
CN103177721B (en) * 2011-12-26 2015-08-19 中国电信股份有限公司 Audio recognition method and system
CN103700368B (en) * 2014-01-13 2017-01-18 联想(北京)有限公司 Speech recognition method, speech recognition device and electronic equipment
CN105529028B (en) * 2015-12-09 2019-07-30 百度在线网络技术(北京)有限公司 Speech analysis method and apparatus
CN107195299A (en) * 2016-03-14 2017-09-22 株式会社东芝 Train the method and apparatus and audio recognition method and device of neutral net acoustic model
DE102017213946B4 (en) * 2017-08-10 2022-11-10 Audi Ag Method for processing a recognition result of an automatic online speech recognizer for a mobile terminal
CN112435656B (en) * 2020-12-11 2024-03-01 平安科技(深圳)有限公司 Model training method, voice recognition method, device, equipment and storage medium
CN115376499B (en) * 2022-08-18 2023-07-28 东莞市乐移电子科技有限公司 Learning monitoring method of intelligent earphone applied to learning field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN1490786A (en) * 2002-10-17 2004-04-21 中国科学院声学研究所 Phonetic recognition confidence evaluating method, system and dictation device therewith
CN101013421A (en) * 2007-02-02 2007-08-08 清华大学 Rule-based automatic analysis method of Chinese basic block
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN1490786A (en) * 2002-10-17 2004-04-21 中国科学院声学研究所 Phonetic recognition confidence evaluating method, system and dictation device therewith
CN101013421A (en) * 2007-02-02 2007-08-08 清华大学 Rule-based automatic analysis method of Chinese basic block
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Desilets, Alain.Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts.《Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing》.2005,49-56.
Cox S. J.
Cox, S. J., Dasmahapatra, S..High-level Approaches to Confidence Estimation in Speech Recognition.《IEEE Transactions on Speech and Audio》.2002,460-471. *
Inkpen, Diana
Inkpen, Diana ; Desilets, Alain.Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts.《Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing》.2005,49-56. *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106062868A (en) * 2014-07-25 2016-10-26 谷歌公司 Providing pre-computed hotword models
CN106062868B (en) * 2014-07-25 2019-10-29 谷歌有限责任公司 The hot word model precalculated is provided

Also Published As

Publication number Publication date
CN101609672A (en) 2009-12-23

Similar Documents

Publication Publication Date Title
Chung et al. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech
CN101609672B (en) Speech recognition semantic confidence feature extraction method and device
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
Ghosh et al. Fracking sarcasm using neural network
CN106328147B (en) Speech recognition method and device
Cummins et al. Multimodal bag-of-words for cross domains sentiment analysis
Levitan et al. Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection.
Bone et al. Intoxicated speech detection: A fusion framework with speaker-normalized hierarchical functionals and GMM supervectors
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
Houjeij et al. A novel approach for emotion classification based on fusion of text and speech
Wang et al. Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113901200A (en) Text summarization method and device based on topic model and storage medium
Wataraka Gamage et al. Speech-based continuous emotion prediction by learning perception responses related to salient events: A study based on vocal affect bursts and cross-cultural affect in AVEC 2018
Chou et al. Automatic deception detection using multiple speech and language communicative descriptors in dialogs
Verkholyak et al. A Bimodal Approach for Speech Emotion Recognition using Audio and Text.
Liyanage et al. Augmenting reddit posts to determine wellness dimensions impacting mental health
Ranjith et al. GTSO: Gradient tangent search optimization enabled voice transformer with speech intelligibility for aphasia
Gris et al. Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person
Yue English spoken stress recognition based on natural language processing and endpoint detection algorithm
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
Chen et al. Automatic emphatic information extraction from aligned acoustic data and its application on sentence compression
Bañeras-Roux et al. Hats: An open data set integrating human perception applied to the evaluation of automatic speech recognition metrics
Singhal et al. Estimation of Accuracy in Human Gender Identification and Recall Values Based on Voice Signals Using Different Classifiers
Tang et al. A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110907

Termination date: 20140721

EXPY Termination of patent right or utility model