CN1979638A

CN1979638A - Method for correcting error of voice identification result

Info

Publication number: CN1979638A
Application number: CNA2005101274476A
Authority: CN
Inventors: 王晓瑞; 江杰; 王士进; 丁鹏; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2005-12-02
Filing date: 2005-12-02
Publication date: 2007-06-13

Abstract

The invention relates to the voice recognition technical field, especially relating to a voice recognition result correcting method, namely a method for correct the recognition result by error correction knowledge library, and the most basic characteristics of the method comprises: 1. using continuous language fragments in corpus as error correcting template, and using the corpus to build error correction template library; 2. indexing for the error correction template library and using the searching technique to fast find error correction templates; 3. according to error correction modes, using creditability to cut recognition result into short recognition fragments and submitting creditable parts in the recognition fragments to an error correction template system for fast find, and obtaining error correction template candidates highly related with the recognition fragments; and 4. using acoustic confusion matrix to select templates close to acoustic characteristics of the recognition fragments from the error correction template candidates to make substitution error correction.

Description

A kind of method for correcting error of voice identification result

Technical field

The present invention relates to the speech recognition technology field, particularly a kind of method for correcting error of voice identification result.

Background technology

Present most of speech recognition system adopts the N unit syntax (Ngram) language model, because there is a kind of incomplete independence assumption in this model, suppose that promptly current speech only relies on this speech N-1 speech before, its limitation shows the uncertain inference of its just preceding N-1 speech, causes recognition result skimble-skamble sentence or fragment often to occur.

Summary of the invention

The present invention proposes a kind of method for correcting error of voice identification result, can utilize variable length error correction masterplate, recognition result is carried out error correction according to degree of confidence and acoustics degree of obscuring.The present invention can be used for large vocabulary continuous speech recognition system.The present invention mainly contains following feature: the one, and as the error correction masterplate, utilize corpus to set up the error correction template library with the connected speech fragment in the corpus; The 2nd, the error correction template library is set up index, use Fast search technique that the error correction template library is searched fast; The 3rd, according to error correction mode, utilize degree of confidence that recognition result is cut into short identification fragment, and will discern the error correction masterplate system that the believable part in the fragment submits to and search fast, obtain and discern the high error correction masterplate candidate of fragment correlativity; The 4th, utilize acoustics degree of obscuring matrix from error correction masterplate candidate, to select and discern the close masterplate of fragment acoustic feature and replace error correction.

Technical scheme

A kind of method for correcting error of voice identification result may further comprise the steps:

1) recognition system is discerned computing and confidence calculations to the input voice, is had

The recognition result of degree of confidence;

2), recognition result is cut into little identification fragment according to the height of degree of confidence according to error correction mode;

3) resulting identification fragment is input to error correction masterplate searching system, obtains and discern the high error correction masterplate candidate list of fragment correlativity;

4) calculate the acoustics degree of obscuring of discerning error correction masterplate in fragment and the candidate list, select the wherein the highest masterplate of acoustics similarity, when the similarity degree of identification segment and this error correction masterplate during greater than a reliable thresholding, use error correction masterplate replaces this recognition result fragment;

5) fragment after the error correction is merged, obtain the recognition result after the error correction.

Described method for correcting error of voice identification result also comprises, the input voice are being carried out confidence calculations when discerning computing, obtains having the step of the recognition result of degree of confidence.

Described a kind of method for correcting error of voice identification result, also comprise, when recognition result being cut into little identification fragment according to the height of putting letter, the long word of degree of confidence thresholding CM-threshold and system's error correction template at first is set counts max-var-length, think when degree of confidence is higher than CM-threshold that recognition result is reliable, reliable number of words must not be greater than max-var-length in the identification fragment after the cutting.

Described method for correcting error of voice identification result, also comprise, with the recognition result piecemeal, the stroke that continuous degree of confidence is higher or lower than CM-threshold is the step of a module, is about to recognition result and divides one or more (A, x into, B) pattern constitutes, wherein A, B are the module that degree of confidence is higher than CM-threshold, and x is the module that degree of confidence is lower than CM-threshold, and maximum one of A and B are empty module.

Described method for correcting error of voice identification result, also comprise, for low confidence module x all in the recognition result, if the length of A is the part of max-var-length with length adjacent with x among the A then more than or equal to max-var-length, be made as sub-A, form fragment (sub-A with x, x), sub-A is used for searching for the error correction template library, the unfixed step of the length of sub-A.

Described method for correcting error of voice identification result, also comprise, for low confidence module x all in the recognition result, if A length is less than max-var-length, then with part sub-B adjacent among the B with x, with A, x form fragment (A, x, sub-B), A and sub-B are used for searching for the step of error correction template library, maximum one of A and sub-B can be empty module, the length of A and sub-B and must not be wherein greater than max-var-length, and the length of A and sub-B is fixing.

Described method for correcting error of voice identification result also comprises, recognition result is cut into fragment after, the believable part of each fragment is searched in the error correction template library fast, obtain the step of the high error correction masterplate of one or more and identification fragment correlativity.

Described method for correcting error of voice identification result also comprises, error correction masterplate searching system comprises two parts, and first is the foundation of error correction masterplate index, and second portion is the search of error correction masterplate.

Described method for correcting error of voice identification result, also comprise, wherein the ultimate principle of first is, the language fragments of all continuous numbers of words between 6 to 12 in the corpus as the error correction masterplate, at first from corpus, extract all error correction masterplates, adopt inverted file the error correction template library to be set up index then,, need compress inverted file in order to reduce the size of inverted file as index structure.

Described method for correcting error of voice identification result, also comprise, wherein the ultimate principle of second portion is, at first believable part is converted to boolean queries during inquiry, in index database, search for fast, have the characteristics of sequential, locality at voice identification result, when being converted to boolean queries, need to add the sequential requirement of believable part and the locality requirement between speech and speech.

Described method for correcting error of voice identification result, also comprise, all results that search is returned for the error correction masterplate use the error correction masterplate and the acoustics degree of obscuring of identification segment to select the step of optimum masterplate, for each masterplate T in identification Segment A and the error correction masterplate candidate list _i, calculate A and T _iDegree of obscuring C (A, T _i), as maximal value maxC (A, T wherein _i) when surpassing a reliable thresholding, we use this error correction masterplate to replace the identification fragment, if maxC (A, T _i) less than this thresholding, then keep the identification fragment.

Described method for correcting error of voice identification result, also comprise, the error correction masterplate comprises that with the calculating of the acoustics degree of obscuring of identification segment three parts constitute, first is the statistics of the female identity confusion situation of Chinese phonetic, second portion is that the posterior probability of the female identity confusion degree of Chinese phonetic is calculated, and third part is the fuzzy whole matching of identification segment and error correction masterplate.

Described method for correcting error of voice identification result, also comprise, the ultimate principle of first is, speech database is discerned, and obtain the situation of obscuring between situation and all simple or compound vowel of a Chinese syllable obscured between all initial consonants in the following manner: suppose can not produce between the sound mother and obscure, if its recognition result of one of them sample is pinyin string C ₁' V ₁' C ₂' V ₂' ... C _m' V _m', this recognition result and correct C ₁V ₁C ₂V ₂C _nV _nCarry out dynamically making that to whole it can be to last pinyin string maximum, it is right so just to obtain a large amount of pinyin string, i.e. (C ₁', C ₁), (V ₁', V ₁) ... (C _m', C _n) (V _m', V _n), add up the right occurrence number of these pinyin string, obtain the total sample number of each initial consonant and it is identified as the number of times of other each initial consonants, and obtain the total sample number of each simple or compound vowel of a Chinese syllable and it is identified as the number of times of other each simple or compound vowel of a Chinese syllable.

Described method for correcting error of voice identification result also comprises, wherein the ultimate principle of second portion is, according to the statistics of first, at first calculates the probability that each initial consonant is identified as other initial consonants, C _iBe confused into C _jBlur level, its computing formula is:

P (C_{j} | C_{i}) = \frac{Σ (C_{i}, C_{j})}{| C_{i} |}

∑ (C wherein _i, C _j) be C _iKnown

Wei C _jSum, | C _i| be C _iTotal sample number,

When recognition result is C _iThe time, correct result should be C _jPosterior probability:

\tilde{P} (C_{j} | C_{i}) = \frac{P (C_{i} | C_{j}) P (C_{j})}{\underset{k}{Σ} P (C_{i} | C_{k}) P (C_{k})}

Wherein

P (C_{j}) = \frac{| C_{j} |}{Σ | C_{i} |},

∑ | C _i| represent the total sample number of all initial consonants, the computing method and the initial consonant of simple or compound vowel of a Chinese syllable are similar.

Described method for correcting error of voice identification result also comprises, wherein the 3rd ultimate principle is, the pinyin string of establishing the identification Segment A is C ₁' V ₁' C ₂' V ₂' ... C _m' V _m' establish i masterplate T in the error correction masterplate candidate list _iPinyin string be C ₁V ₁C ₂V ₂C _nV _n, then define A and T _iAcoustics degree of obscuring C (A, T _i) be: find an alignment (1, i ₁), (2, i ₂) ... (k, i _k) ... (m, i _m), this alignment makes

\tilde{P} (T_{i} | A) = Π \tilde{P} (C_{k} | C_{i_{k}}) \tilde{P} (V_{k} | V_{i_{k}})

Obtain maximal value, defining this maximal value is A and T _iAcoustics degree of obscuring.

Described method for correcting error of voice identification result also comprises, when practical application, calculates after at first posterior probability being taken the logarithm, and problem is converted into feasible

Log \tilde{P} (T_{i} | A) = ΣLog \tilde{P} (C_{k} | C_{i_{k}}) + ΣLog \tilde{P} (V_{k} | V_{i_{k}})

Obtain maximal value, use this maximal value as A and T this moment _iLogarithm acoustics degree of obscuring.

Embodiment

The present invention mainly contains three modules, and the one, utilize the cutting of degree of confidence to recognition result, the 2nd, the acquisition of error correction masterplate candidate list, the 3rd, the calculating of identification segment and error correction masterplate acoustics degree of obscuring.Described in detail below.

Utilize the cutting of degree of confidence to recognition result.The long word of degree of confidence thresholding CM-threshold and system's error correction template at first is set counts max-var-length, the recognition result that degree of confidence is higher than CM-threshold is thought reliable, then recognition result is carried out cutting, and step is as follows:

1. with the recognition result piecemeal, the stroke that continuous degree of confidence is higher or lower than CM-threshold is a module, divide recognition result into one or more (A, x, B) structure constitutes, wherein A, B are the module that degree of confidence is higher than CM-threshold, and x is the module that degree of confidence is lower than CM-threshold, and maximum one of A and B are empty module.

2. for low confidence module x all in the recognition result

A) if the length of A more than or equal to max-var-length, is the part of max-var-length with length adjacent with x among the A then, be made as sub-A, with x form fragment (sub-A, x), sub-A is used for searching for the error correction template library, the length of sub-A is fixing;

B) if A length less than max-var-length, with part sub-B adjacent with x among the B, (sub-B), A and sub-B are used for searching for the error correction template library for A, x, and maximum one of A and sub-B can be empty module to form fragment with A, x.The length of A and sub-B and must not be greater than max-var-length wherein, the length of A and sub-B is fixing.

The acquisition of error correction masterplate candidate list.After recognition result is cut into fragment, the believable part in each fragment is submitted to error correction masterplate searching system, obtains and discern the high error correction masterplate candidate of fragment correlativity.Error correction masterplate searching system comprises two parts, and first is the foundation of error correction masterplate index, and second portion is the quick search to the error correction template library.

Wherein the ultimate principle of first is, the language fragments of all continuous numbers of words between 6 to 12 in the corpus as the error correction masterplate, at first from corpus, extract all error correction masterplates, adopt inverted file the error correction template library to be set up index then as index structure.In order to reduce the size of inverted file, need compress inverted file.

Wherein the ultimate principle of second portion is, at first the believable part in the fragment is converted to boolean queries, retrieves fast in index database.Have sequential, locality characteristics at voice identification result, when being converted to boolean queries, need to add the sequential requirement of believable part in the fragment and the locality requirement between speech and speech.

The calculating of acoustics degree of obscuring.The error correction masterplate often obtains one or more candidates, at this moment uses the error correction masterplate and the acoustics degree of obscuring of identification segment to select optimum masterplate.For each masterplate T in identification Segment A and the error correction masterplate candidate list _i, calculate A and T _iDegree of obscuring C (A, T _i), as maximal value maxC (A, T wherein _i) when surpassing a reliable thresholding, we use this error correction masterplate to replace the identification fragment, if maxC (A, T _i) less than this thresholding, then keep the identification fragment.

The calculating of degree of obscuring comprises that three parts constitute, and first is the statistics of the female identity confusion situation of Chinese phonetic, and second portion is that the posterior probability of the female identity confusion degree of Chinese phonetic is calculated, and third part is the fuzzy whole matching of identification segment and error correction masterplate.

Wherein the ultimate principle of first is, speech database is discerned, and obtain the situation of obscuring between situation and all simple or compound vowel of a Chinese syllable obscured between all initial consonants in the following manner: suppose can not produce between the sound mother and obscure, if its recognition result of one of them sample is pinyin string C ₁' V ₁' C ₂' V ₂' ... C _m' V _m', this recognition result and correct C ₁V ₁C ₂V ₂C _nV _nCarry out dynamically making that to whole it can be to last pinyin string maximum, it is right so just to obtain a large amount of pinyin string, i.e. (C ₁', C ₁), (V ₁', V ₁) ... (C _m', C _n) (V _m', V _n), add up the right occurrence number of these pinyin string, obtain the total sample number of each initial consonant and it is identified as the number of times of other each initial consonants, and obtain the total sample number of each simple or compound vowel of a Chinese syllable and it is identified as the number of times of other each simple or compound vowel of a Chinese syllable.

Wherein the ultimate principle of second portion is, according to the statistics of first, at first calculates the probability that each initial consonant is identified as other initial consonants, C _iBe confused into C _jBlur level, its computing formula is:

P (C_{j} | C_{i}) = \frac{Σ (C_{i}, C_{j})}{| C_{i} |}

∑ (C wherein _i, C _j) expression C _iBe identified as C _jSum, | C _i| expression C _iTotal sample number,

When recognition result is C _iThe time correct result should be C _jPosterior probability:

\tilde{P} (C_{j} | C_{i}) = \frac{P (C_{i} | C_{j}) P (C_{j})}{\underset{k}{Σ} P (C_{i} | C_{k}) P (C_{k})}

Wherein

P (C_{j}) = \frac{| C_{j} |}{Σ | C_{i} |},

∑ | C _i| represent the total sample number of all initial consonants.

Wherein the 3rd ultimate principle is, the pinyin string of establishing the identification Segment A is C ₁' V ₁' C ₂' V ₂' ... C _m' V _m', establish i masterplate T in the error correction masterplate candidate list _iPinyin string be C ₁V ₁C ₂V ₂C _nV _n, then define A and T _iAcoustics degree of obscuring C (A, T _i) be: find an alignment (1, i ₁), (2, i ₂) ... (k, i _k) ... (m, i _m), this alignment makes

\tilde{P} (T_{i} | A) = Π \tilde{P} (C_{k} | C_{i_{k}}) \tilde{P} (V_{k} | V_{i_{k}}) - - - (1)

When practical application, calculate after at first posterior probability being taken the logarithm, problem is converted into feasible

Log \tilde{P} (T_{i} | A) = Σ Log \tilde{P} (C_{k} | C_{i_{k}}) + ΣLog \tilde{P} (V_{k} | V_{i_{k}})

Claims

1. method for correcting error of voice identification result may further comprise the steps:

1) recognition system is discerned computing and confidence calculations to the input voice, obtains having the recognition result of degree of confidence;

2. method for correcting error of voice identification result according to claim 1 is characterized in that, also comprises, the input voice are being carried out confidence calculations when discerning computing, obtains having the step of the recognition result of degree of confidence.

3. a kind of method for correcting error of voice identification result according to claim 1, it is characterized in that, also comprise, when recognition result being cut into little identification fragment according to the height of putting letter, the long word of degree of confidence thresholding CM-threshold and system's error correction template at first is set counts max-var-length, think when degree of confidence is higher than CM-threshold that recognition result is reliable, reliable number of words must not be greater than max-var-length in the identification fragment after the cutting.

4. according to claim 1 or 3 described method for correcting error of voice identification result, it is characterized in that, comprise that also with the recognition result piecemeal, the stroke that continuous degree of confidence is higher or lower than CM-threshold is the step of a module, be about to recognition result and divide one or more (A into, x, B) pattern constitutes, and wherein A, B are the module that degree of confidence is higher than CM-threshold, x is the module that degree of confidence is lower than CM-threshold, and maximum one of A and B are empty module.

5. according to claim 1 or 3 or 4 described method for correcting error of voice identification result, it is characterized in that, also comprise, for low confidence module x all in the recognition result, if the length of A is more than or equal to max-var-length, be the part of max-var-length then with length adjacent among the A with x, be made as sub-A, with x form fragment (sub-A, x), sub-A is used for searching for the error correction template library, the unfixed step of the length of sub-A.

6. according to claim 1 or 3 or 4 described method for correcting error of voice identification result, it is characterized in that, also comprise, for low confidence module x all in the recognition result, if A length is less than max-var-length, then with part sub-B adjacent among the B with x, form fragment (A with A, x, x, sub-B), A and sub-B are used for searching for the step of error correction template library, and maximum one of A and sub-B can be empty module, the length of A and sub-B and must not be greater than max-var-length wherein, the length of A and sub-B is fixing.

7. according to claim 1 or 5 or 6 described method for correcting error of voice identification result, it is characterized in that, also comprise, after recognition result is cut into fragment, the believable part of each fragment is searched in the error correction template library fast, obtain step one or more and the error correction masterplate that identification fragment correlativity is high.

8. according to claim 1 or 7 described method for correcting error of voice identification result, it is characterized in that also comprise, error correction masterplate searching system comprises two parts, first is the foundation of error correction masterplate index, and second portion is the search of error correction masterplate.

9. according to claim 1 or 8 described method for correcting error of voice identification result, it is characterized in that, also comprise, wherein the ultimate principle of first is, the language fragments of all continuous numbers of words between 6 to 12 in the corpus at first extracted all error correction masterplates as the error correction masterplate from corpus, adopt inverted file the error correction template library to be set up index then as index structure, in order to reduce the size of inverted file, need compress inverted file.

10. according to claim 1 or 8 described method for correcting error of voice identification result, it is characterized in that, also comprise, wherein the ultimate principle of second portion is, at first believable part is converted to boolean queries during inquiry, in index database, searches for fast, have the characteristics of sequential, locality at voice identification result, when being converted to boolean queries, need to add to the sequential requirement of believable part and the locality requirement between speech and speech.

11. according to claim 1 or 7 described method for correcting error of voice identification result, it is characterized in that, also comprise, all results that search is returned for the error correction masterplate, use the error correction masterplate and the acoustics degree of obscuring of identification segment to select the step of optimum masterplate, for each masterplate T in identification Segment A and the error correction masterplate candidate list _i, calculate A and T _iDegree of obscuring C (A, T _i), as maximal value maxC (A, T wherein _i) when surpassing a reliable thresholding, we use this error correction masterplate to replace the identification fragment, if maxC (A, T _i) less than this thresholding, then keep the identification fragment.

12. according to claim 1 or 11 described method for correcting error of voice identification result, it is characterized in that, also comprise, the error correction masterplate comprises that with the calculating of the acoustics degree of obscuring of identification segment three parts constitute, first is the statistics of the female identity confusion situation of Chinese phonetic, second portion is that the posterior probability of the female identity confusion degree of Chinese phonetic is calculated, and third part is the fuzzy whole matching of identification segment and error correction masterplate.

13. according to claim 1 or 12 described method for correcting error of voice identification result, it is characterized in that, also comprise, the ultimate principle of first is, speech database is discerned, and obtain the situation of obscuring between situation and all simple or compound vowel of a Chinese syllable obscured between all initial consonants in the following manner: suppose can not produce between the sound mother and obscure, if its recognition result of one of them sample is pinyin string C ₁' V ₁' C ₂' V ₂' ... C _m' V _m', this recognition result and correct C ₁V ₁C ₂V ₂C _nV _nCarry out dynamically making that to whole it can be to last pinyin string maximum, it is right so just to obtain a large amount of pinyin string, i.e. (C ₁', C ₁), (V ₁', V ₁) ... (C _m', C _n) (V _m', V _n), add up the right occurrence number of these pinyin string, obtain the total sample number of each initial consonant and it is identified as the number of times of other each initial consonants, and obtain the total sample number of each simple or compound vowel of a Chinese syllable and it is identified as the number of times of other each simple or compound vowel of a Chinese syllable.

14., it is characterized in that also comprise, wherein the ultimate principle of second portion is according to claim 1 or 12 or 13 described method for correcting error of voice identification result,, calculate the probability that each initial consonant is identified as other initial consonants, C earlier according to the statistics of first _iBe confused into C _jBlur level, its computing formula is:

P (C_{j} | C_{i}) = \frac{Σ (C_{i}, C_{j})}{| C_{i} |}

∑ (C wherein _i, C _j) be C _iBe identified as C _jSum, | C _i| be C _iTotal sample number is when recognition result is C _iThe time, correct result should be C _jPosterior probability:

\tilde{P} (C_{j} | C_{i}) = \frac{P (C_{i} | C_{j}) P (C_{j})}{\underset{k}{Σ} P (C_{i} | C_{k}) P (C_{k})}

Wherein

P (C_{j}) = \frac{| C_{j} |}{Σ | C_{i} |},

15., it is characterized in that also comprise, wherein the 3rd ultimate principle is according to claim 1 or 12 described method for correcting error of voice identification result, the pinyin string of establishing the identification Segment A is C ₁' V ₁' C ₂' V ₂' ... C _m' V _m' establish i masterplate T in the error correction masterplate candidate list _iPinyin string be C ₁V ₁C ₂V ₂C _nV _n, then define A and T _iAcoustics degree of obscuring C (A, T _i) be: find an alignment (1, i ₁), (2, i ₂) ... (k, i _k) ... (m, i _m), this alignment makes

\tilde{P} (T_{i} | A) = Π \tilde{P} (C_{k} | C_{i_{k}}) \tilde{P} (V_{k} | V_{i_{k}})

16., it is characterized in that according to claim 1 or 12 or 15 described method for correcting error of voice identification result, also comprise, when practical application, calculate after at first posterior probability being taken the logarithm, problem is converted into feasible

Log \tilde{P} (T_{i} | A) = ΣLog \tilde{P} (C_{k} | C_{i_{k}}) + ΣLog \tilde{P} (V_{k} | V_{i_{k}})