WO2012001458A1 - Voice-tag method and apparatus based on confidence score - Google Patents
Voice-tag method and apparatus based on confidence score Download PDFInfo
- Publication number
- WO2012001458A1 WO2012001458A1 PCT/IB2010/052954 IB2010052954W WO2012001458A1 WO 2012001458 A1 WO2012001458 A1 WO 2012001458A1 IB 2010052954 W IB2010052954 W IB 2010052954W WO 2012001458 A1 WO2012001458 A1 WO 2012001458A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pronunciation
- tag
- confidence score
- tags
- recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000012360 testing method Methods 0.000 claims description 48
- 230000000694 effects Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000010365 information processing Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
Definitions
- the present invention relates to information processing technology, specifically to a voice-tag method and apparatus based on confidence score.
- the voice-tag technology is an application of speech recognition technology, which is widely used especially in embedded speech recognition systems.
- the working process of a voice-tag technology based system is as follows ⁇ firstly, the voice registration process is performed, that is, the user input a registration speech, the system converts the registration speech into a tag which represents the pronunciation of the speech; then, the speech recognition process is performed, that is, when the user input a testing speech, the system performs recognition on the testing speech based on its recognition network consisting of voice tag items to determine the content of the testing speech.
- the recognition network of a voice-tag system consists of not only the voice tag items of recognition speech but also other items whose pronunciations are decided by a dictionary or grapheme-to-phoneme (G2P) converting module, which can be called dictionary items.
- G2P grapheme-to-phoneme
- the original voice-tag technology is usually implemented based on template matching framework in which, in the registration process, one or more templates are extracted from a registration speech as the tags of the registration speech; in the recognition process, the Dynamic Time Warping (DTW) algorithm is applied between testing speech and template tags to do matching.
- DTW Dynamic Time Warping
- HMM Hidden Markov Model
- phoneme sequences are obtained by performing phoneme recognition on the registration speeches.
- the advantages of phoneme sequence tags are as follows ⁇ firstly, a phoneme sequence tag occupies less memory space than a template tag! secondly, phoneme sequence tag items are easily combined with dictionary items to form new items. The advantageous of phoneme sequence tags are very helpful to enlarge the number of items provided by a recognition network.
- phoneme sequence tags also have shortages ⁇ firstly, under the current phoneme recognition capability, phoneme recognition errors are unavoidable, with the result that a phoneme sequence tag may not correctly represent the pronunciation of a registration speech, thereby causing the recognition error! secondly, the mismatch between registration speech and testing speech may exist, which will also cause the recognition error.
- the voice recognition system may give an incorrect recognition result, for example the Initial and Final sequence "w an m ing" for the registration speech, thereby the incorrect sequence "w an m ing" will be added into the recognition network as the pronunciation tag of the registration speech
- the testing speech is also " ⁇ if the system determines that the testing speech is nearest to the sequence "w an m ing" in the recognition network, then the recognition result will be correct, however, since the system may determine that the testing speech is nearest to another sequence in the recognition network, an incorrect recognition result will be obtained.
- a voice tag item corresponding to the registration speech is constituted by a plurality of pronunciation tags based on different phoneme sequences. Specifically, when performing phoneme recognition on the registration speech, the N best phoneme sequence recognition results or phoneme lattice recognition result are obtained as the pronunciation tags of the registration speech.
- the above sequences are combined into a voice tag item corresponding to the registration speech " ⁇ 3 ⁇ 4" and added into the recognition network. Therefore, in the recognition process, as long as the recognition network determines that a testing speech is nearest to any one of the above three sequences, the match between the testing speech and the registration speech " ⁇ 3 ⁇ 4" can be carried out. Thus, the recognition rate can be improved.
- the multi-pronunciation registration since for a registration speech, comparing that one phoneme sequence is added into the recognition network in the single-pronunciation registration, in the multi-pronunciation registration, a plurality of phoneme sequences are added into the recognition network, the multi-pronunciation registration will increase the scale of recognition network. Further, constituting a voice tag item by using a plurality of pronunciation sequences will also increase the confusion of recognition network, especially will drop the recognition performance for dictionary items in the voice-tag system.
- the present invention is proposed to resolve the above problem in the prior art, the object of which is to provide a voice-tag method and apparatus based on confidence score, in order to in the multi-pronunciation registration technology, optimize voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags.
- a voice-tag method based on confidence score comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
- a voice-tag method based on confidence score comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; determining a confidence score based weight for each of the plurality of pronunciation tags! creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
- a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags! a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
- a voice-tag apparatus based on confidence score comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags! a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags!
- a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
- Fig.l depicts a flowchart of the voice-tag method based on confidence score according to the first embodiment of the invention!
- Fig.2 depicts an exemplary of phoneme lattice of a registration speech
- Fig.3 depicts a flowchart of the voice-tag method based on confidence score according to the second embodiment of the invention!
- Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention!
- Fig.5 depicts a block diagram of the voice-tag apparatus based on confidence score according to the fourth embodiment of the invention.
- Fig.l depicts a flowchart of the voice-tag method based on confidence tag according to the first embodiment of the invention.
- the confidence score is used as the basis of selection of pronunciation tags for a registration speech.
- the method performs phoneme recognition on a registration speech input by a user, to obtain a plurality of pronunciation tags of the registration speech.
- the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
- phoneme lattice is a multi-pronunciation representation generated by combining same parts in the plurality of phoneme sequences representing the pronunciations of the speech together.
- a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, is employed to perform phoneme recognition to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
- any phoneme recognition system or method presently known or future knowable may be employed but not limited to the above commonly used phoneme recognition system in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and there is no special limitation on this in the present invention.
- a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech.
- a confidence score is calculated for a single phoneme on each of arcs in the phoneme lattice.
- any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method may be adopted.
- At step 115 at least one best pronunciation tag is selected from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
- the pronunciation tag with the highest confidence score is selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
- the phoneme sequence with the highest confidence score is selected from the plurality of best phoneme sequences as the best pronunciation tag.
- the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the path in which the phonemes on the arcs thereof have the highest confidence scores in the phoneme lattice is reserved, while other arcs are removed, thereby constructing the best pronunciation tag of the registration speech by using the reserved path.
- the pronunciation tags whose confidence scores are higher than a preset confidence threshold are selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
- the phoneme sequences whose confidence scores are higher than the preset confidence threshold are selected from the plurality of best phoneme sequences. For example, in the case of the above three sequences 1 ⁇ 3 of the registration speech "JL (wang ming)", if the confidence threshold is set to 65, then the sequences 1 and 3 whose confidence scores are higher than the confidence threshold will be selected from the three sequences 1 ⁇ 3 as the best pronunciation tags of the registration speech "i a ⁇
- the plurality of pronunciation tags are the phoneme lattice of the registration speech
- the arcs whose phonemes have lower confidence scores than the preset confidence threshold are removed from the phoneme lattice, thereby constructing the best pronunciation tags of the registration speech by using the reserved arcs.
- the above confidence threshold may be decided according to the experience of developers. Specifically, for example, firstly, a large amount of testing data is prepared, then the phoneme recognition system used at step 105 is applied to perform phoneme recognition on the testing data, and further confidence scores are calculated for the phoneme recognition results, and then a suitable confidence threshold may be set with reference to the confidence scores of high quality recognition results in order to ensure that the high quality recognition results can be selected with the confidence threshold.
- a voice tag item corresponding to the registration speech is created based on the at least one best pronunciation tag to add into a recognition network.
- the recognition can be performed on the testing speech by using the recognition network. Since the creation and addition of a voice tag item are existing knowledge in the art, the detailed description thereof is omitted.
- the voice-tag method based on confidence score according to the first embodiment of the present invention.
- the voice tags can be optimized and the negative effects of multi-pronunciation registration on application of voice tags can be reduced.
- the scale of recognition network consisting of voice tags can be decreased, the confusion of recognition network can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved.
- the method of the present embodiment still keeps the advantages of the multi-pronunciation registration to some degree, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
- the voice-tag method based on confidence score according to the second embodiment of the present invention will be described in combination with Fig.3.
- the confidence score is used to combine a plurality of pronunciation tags of a registration speech.
- step 305 the method performs phoneme recognition on a registration speech inputted by a user, to obtain a plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 105 in Fig.l, the detailed description thereof is omitted.
- step 310 a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 110 in Fig.l, the detailed description thereof is omitted.
- a confidence score based weight is determined for each of the pronunciation tags of the registration speech.
- the confidence score based weight is calculated for each of the plurality of pronunciation tags in accordance with the following equation (l):
- weight i confidence score i / (confidence score 1+confidence score 2+...+confidence score n) (l)
- the weight i denotes the confidence score based weight of the i th pronunciation tag
- the confidence score 1 denotes the confidence score of the first pronunciation tag
- the confidence score 2 denotes the confidence score of the second pronunciation
- the confidence score n denotes the confidence score of the n th pronunciation tag and so on
- n denotes the number of the plurality of the pronunciation tags.
- the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of this pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
- each of the plurality of pronunciation tags of the registration speech is defined as a component of the voice tag of the registration speech by using the confidence score based weight.
- a voice tag item corresponding to the registration speech is created based on the plurality of pronunciation tags of the registration speech to add into a recognition network and meanwhile the confidence score based weight of each of the plurality of pronunciation tags is recorded.
- the voice tag item corresponding to the registration speech may be created directly based on the plurality of pronunciation tags obtained for the registration speech at step 305, or be created based on at least one best pronunciation tag which is selected from the plurality of pronunciation tags on the basis of the confidence score of each of the plurality of pronunciation tags like step 115 in the first embodiment.
- the foregoing detailed description about step 115 may be referred to, and the detailed description of this step is accordingly omitted.
- step 325 when a user inputs a testing speech, the recognition is performed on the testing speech by using the recognition network to obtain a plurality of best recognition result candidates of the testing speech.
- the recognition network obtains the nearest pronunciation sequence "w u m ing" and a similar sequence as well as the three sequences in the voice tag item corresponding to the registration speech "JL ⁇ ”, and finally outputs the following recognition results arrayed in the in the descending order of acoustic score for the testing speech:
- the plurality of recognition result candidates belonging to a same voice tag item are combined with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
- the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates of the testing speech are combined into one recognition result candidate, and a weighted sum of the acoustic scores of the plurality of recognition result candidates belonging to a same voice tag item is calculated on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
- the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, which corresponds to the voice tag item of the registration speech "i a ⁇
- the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, since they belong to one voice tag item and correspond to the registration speech "JL (wang ming)" before combination, even if they are combined, the obtained combined recognition result candidate still can be correspond to the registration speech "JL (wang ming)".
- the recognition result candidate with the highest acoustic score is selected from the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
- the recognition result candidates 2 and 3 of the testing speech " ⁇ ! ⁇ Kwu ming)" will also belong to a same voice tag item, then the recognition result candidates 2 and 3 will also be combined on the basis of the confidence score based weights. Further, if the combined recognition result of the recognition result candidates 2 and 3 still has the highest acoustic score, then it will be selected, thereby the voice tag item to which the recognition result candidates 2 and 3 belong will become the one matching to the testing speech " ⁇ ! ⁇ Kwu ming)", thus the correct content of the testing speech " ⁇ ! ⁇ Kwu ming)" can be recognized based on this voice tag item.
- the voice-tag method based on confidence score is a description of the voice-tag method based on confidence score according to the second embodiment of the present invention.
- the negative effects of multi-pronunciation registration on application of voice tags can be reduced.
- the confusion of recognition network consisting of voice tags can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved.
- the method of the present embodiment still keeps the advantages of the multi-pronunciation registration, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
- the present invention provides a voice-tag apparatus based on confidence score which will be described in detail below in conjunction with drawings.
- Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention.
- the voice-tag apparatus 40 based on confidence score of the present embodiment comprises: phoneme recognition unit 41, confidence score calculating unit 42, pronunciation tag selecting unit 43, voice tag creating unit 44 testing speech recognizing unit 45 and recognition network 46.
- the phoneme recognition unit 41 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
- the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
- the phoneme recognition unit 41 is implemented based on a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and the phoneme recognition unit 41 performs phoneme recognition on the registration speech inputted by the user to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
- the phoneme recognition unit 41 may be implemented with any phoneme recognition system or method presently known or future knowable, there is no limitation on this in the present invention.
- the confidence score calculating unit 42 is configured to calculate a confidence score for each of the plurality of pronunciation tags.
- the confidence score calculating unit 42 calculates a confidence score for each of the phoneme sequences. In addition, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, the confidence score calculating unit 42 calculates a confidence score for a single phoneme on each of arcs in the phoneme lattice.
- the confidence score calculating unit 42 may be implemented based on any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method.
- the pronunciation tag selecting unit 43 is configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
- the pronunciation tag selecting unit 43 selects the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
- the pronunciation tag selecting unit 43 selects the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag.
- the confidence threshold may be decided based on testing data prepared in advance and according to experience of the developers.
- the voice tag creating unit 44 is configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into the recognition network 46.
- the testing speech recognizing unit 45 is configured to perform recognition on a testing speech by using the recognition network 46 to recognize the content of the testing speech when a user inputted the testing speech.
- the above is a description of the voice-tag apparatus based on confidence score of the embodiment.
- the voice-tag apparatus 40 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the first embodiment described above.
- the recognition network 46 is included in the voice-tag apparatus 40 based on confidence score in this embodiment, it is not limited to this. The recognition network 46 may also reside outside the voice-tag apparatus 40 based on confidence score in other embodiments.
- the voice-tag apparatus 50 based on confidence score of the present embodiment comprises : phoneme recognition unit 51, confidence score calculating unit 52, confidence weight determining unit 53, voice tag creating unit 54, testing speech recognizing unit 55 recognition result combining unit 56 and recognition network 57.
- the phoneme recognition unit 51 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
- the confidence score calculating unit 52 is configured to calculate a confidence score for each of the plurality of pronunciation tags of the registration speech.
- the confidence weight determining unit 53 is configured to determine a confidence score based weight for each of the plurality of pronunciation tags. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
- the confidence weight determining unit 53 calculates the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags as the confidence score based weight of the pronunciation tag.
- the voice tag creating unit 54 is configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into the recognition network 57 and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags.
- the voice tag creating unit 54 selects at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags and creates the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.
- the testing speech recognizing unit 55 is configured to perform recognition on a testing speech by using the recognition network 57 to obtain a plurality of best recognition result candidates of the testing speech when a user inputted the testing speech.
- the recognition result combining unit 56 is configured to combine a plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates obtained by the testing speech recognizing unit 55 with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
- the recognition result combining unit 56 for the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates performs the following process : combines the plurality of recognition result candidates into one recognition result candidate, and calculates a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
- the recognition result combining unit 56 selects the best recognition result candidate, namely the one with the highest acoustic score among the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
- the above is a description of the voice-tag apparatus based on confidence score of the embodiment.
- the voice-tag apparatus 50 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the second embodiment described above.
- the recognition network 57 is included in the voice-tag apparatus 50 based on confidence score in this embodiment, it is not limited to this. The recognition network 57 may also reside outside the voice-tag apparatus 50 based on confidence score in other embodiments.
- the voice-tag apparatuses 40, 50 based on confidence score of the third and fourth embodiments as well as their components can be implemented with specifically designed circuits or chips or be implemented by a computing device (information processing device) executing corresponding programs.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a voice-tag method and apparatus based on confidence score. The voice-tag method based on confidence score comprises: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network. The present invention optimizes voice tags based on confidence score to reduce the confusion of recognition network consisting of voice tags in the multi-pronunciation registration based voice-tag technology.
Description
VOICE-TAG METHOD AND APPARATUS BASED ON CONFIDENCE SCORE
TECHNICAL FIELD
[OOOl] The present invention relates to information processing technology, specifically to a voice-tag method and apparatus based on confidence score.
TECHNICAL BACKGROUND
[0002] The voice-tag technology is an application of speech recognition technology, which is widely used especially in embedded speech recognition systems.
[0003] The working process of a voice-tag technology based system is as follows^ firstly, the voice registration process is performed, that is, the user input a registration speech, the system converts the registration speech into a tag which represents the pronunciation of the speech; then, the speech recognition process is performed, that is, when the user input a testing speech, the system performs recognition on the testing speech based on its recognition network consisting of voice tag items to determine the content of the testing speech. Usually, the recognition network of a voice-tag system consists of not only the voice tag items of recognition speech but also other items whose pronunciations are decided by a dictionary or grapheme-to-phoneme (G2P) converting module, which can be called dictionary items.
[0004] The original voice-tag technology is usually implemented based on template matching framework in which, in the registration process, one or more templates are extracted from a registration speech as the tags of the registration speech; in the recognition process, the Dynamic Time Warping (DTW) algorithm is applied between testing speech and template tags to do matching. Recently, along with the wide use of phoneme based Hidden Markov Model (HMM) in the speech recognition field, phoneme sequences are more used as the pronunciation tags of registration speeches in current mainstream voice-tag method. It should be noted that, depending on the language, the phoneme which is the unit of pronunciation may also be changed as other voice unit, for example, for the Chinese, the Initial and Final sequence may be used as the voice tag of a registration speech.
[0005] In the method which uses phoneme sequences as the pronunciation tags of
registration speeches, the phoneme sequences are obtained by performing phoneme recognition on the registration speeches. The advantages of phoneme sequence tags are as follows^ firstly, a phoneme sequence tag occupies less memory space than a template tag! secondly, phoneme sequence tag items are easily combined with dictionary items to form new items. The advantageous of phoneme sequence tags are very helpful to enlarge the number of items provided by a recognition network.
[0006] However, phoneme sequence tags also have shortages^ firstly, under the current phoneme recognition capability, phoneme recognition errors are unavoidable, with the result that a phoneme sequence tag may not correctly represent the pronunciation of a registration speech, thereby causing the recognition error! secondly, the mismatch between registration speech and testing speech may exist, which will also cause the recognition error.
[0007] Specifically, supposing that the registration speech is " E¾(wang ming)", then the correct Initial and Final sequence corresponding to the registration speech should be "w ang m ing". However, due to the current recognition capability, the voice recognition system may give an incorrect recognition result, for example the Initial and Final sequence "w an m ing" for the registration speech, thereby the incorrect sequence "w an m ing" will be added into the recognition network as the pronunciation tag of the registration speech In this case, when the testing speech is also "ΞΕ if the system determines that the testing speech is nearest to the sequence "w an m ing" in the recognition network, then the recognition result will be correct, however, since the system may determine that the testing speech is nearest to another sequence in the recognition network, an incorrect recognition result will be obtained.
[0008] Therefore, in the phoneme sequence tag based voice-tag technology, how to reduce the recognition errors caused by the above reasons becomes a current research emphases.
[0009] In order to overcome the shortages of the above phoneme sequence tag method, researchers proposed the following multi-pronunciation registration approach: for a registration speech, a voice tag item corresponding to the registration speech is constituted by a plurality of pronunciation tags based on different phoneme sequences. Specifically, when performing phoneme recognition on the registration speech, the N best phoneme sequence recognition results or phoneme lattice recognition result are
obtained as the pronunciation tags of the registration speech.
[OOIO] Specifically, by still taking the registration speech "ΞΕ¾" as an example, suppose that the voice recognition system gave the following three best Initial and Final sequences arrayed in the descending order of acoustic score after recognition of the registration speech:
1. "w an m ing";
2. "w an m in";
3. "w ang m ing";
then in the multi-pronunciation registration, the above sequences are combined into a voice tag item corresponding to the registration speech "ΞΕ¾" and added into the recognition network. Therefore, in the recognition process, as long as the recognition network determines that a testing speech is nearest to any one of the above three sequences, the match between the testing speech and the registration speech "ΞΕ¾" can be carried out. Thus, the recognition rate can be improved.
[0011] By using such a multi-pronunciation registration method, the negative effects on voice recognition due to phoneme recognition errors can be obviously reduced, and the recognition performance degradation due to the mismatch between registration speech and testing speech can be alleviated.
[0012] However, since for a registration speech, comparing that one phoneme sequence is added into the recognition network in the single-pronunciation registration, in the multi-pronunciation registration, a plurality of phoneme sequences are added into the recognition network, the multi-pronunciation registration will increase the scale of recognition network. Further, constituting a voice tag item by using a plurality of pronunciation sequences will also increase the confusion of recognition network, especially will drop the recognition performance for dictionary items in the voice-tag system.
SUMMARY OF THE INVENTION
[0013] The present invention is proposed to resolve the above problem in the prior art, the object of which is to provide a voice-tag method and apparatus based on confidence score, in order to in the multi-pronunciation registration technology, optimize voice tags based on confidence score to reduce the confusion of recognition
network consisting of voice tags.
[0014] According to one aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
[0015] According to another aspect of the invention, there is provided a voice-tag method based on confidence score, comprising: performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; determining a confidence score based weight for each of the plurality of pronunciation tags! creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0016] According to further another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags! a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
[0017] According to yet another aspect of the invention, there is provided a voice-tag apparatus based on confidence score, comprising: a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech; a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags! a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
BRIEF DESCRIPTION OF THE DRAWINGS
It is believed that the features, advantages and purposes of the present invention will be better understood from the following description of the detailed implementation of the present invention read in conjunction with the accompanying drawings, in which:
Fig.l depicts a flowchart of the voice-tag method based on confidence score according to the first embodiment of the invention!
Fig.2 depicts an exemplary of phoneme lattice of a registration speech;
Fig.3 depicts a flowchart of the voice-tag method based on confidence score according to the second embodiment of the invention!
Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention! and
Fig.5 depicts a block diagram of the voice-tag apparatus based on confidence score according to the fourth embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
Next, a detailed description of preferred embodiments of the present invention
will be given with reference to the drawings.
( First embodiment )
[0018] Firstly, the first embodiment of the present invention will be described in combination with Fig.l~2. Fig.l depicts a flowchart of the voice-tag method based on confidence tag according to the first embodiment of the invention. In the present embodiment, the confidence score is used as the basis of selection of pronunciation tags for a registration speech.
[0019] Specifically, as shown in Fig.l, firstly, at step 105, the method performs phoneme recognition on a registration speech input by a user, to obtain a plurality of pronunciation tags of the registration speech. Specifically, the plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech. So-called phoneme lattice is a multi-pronunciation representation generated by combining same parts in the plurality of phoneme sequences representing the pronunciations of the speech together.
[0020] At this step, for the registration speech input by the user, a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, is employed to perform phoneme recognition to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
[0021] However, the person skilled in the art can appreciate that as long as a plurality of pronunciation tags can be obtained at this step, any phoneme recognition system or method presently known or future knowable may be employed but not limited to the above commonly used phoneme recognition system in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and there is no special limitation on this in the present invention.
[0022] At step 110, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech.
[0023] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, a confidence score is calculated for each of the phoneme sequences.
Herein, by still taking the foregoing registration speech "JL (wang ming)" as an example, suppose that after the user inputted this registration speech "JL (wang ming)", the following three Initial and Final sequences arrayed in the descending order of acoustic score are obtained through recognition:
1. "w an m ing";
2. "w an m in";
3. "w ang m ing";
then at this step, a confidence score is calculated for each of the above three sequences, and it is supposed that the confidence scores are obtained as follows:
1. "w an m ing", confidence score: 70;
2. "w an m in", confidence score: 60;
3. "w ang m ing", confidence score: 75.
[0024] On the other hand, in the case that the plurality of pronunciation tags of the registration speech are phoneme lattice, a confidence score is calculated for a single phoneme on each of arcs in the phoneme lattice.
For example, suppose that after recognition on the registration speech "JL (wang ming)", another multi-pronunciation representation manner, namely the Initial and Final lattice as shown in Fig.2 corresponding to the above Initial and Final sequences 1~3 is obtained, which is one generated by combining same parts in the above sequences 1~3 together. In this case, at this step, for the Initial and Final lattice, a confidence score is calculated for each element (initial or final) "w", "an", "ang", "m", "in", "ing" on the arcs.
[0025] The person skilled in the art can appreciate that at this step, any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method may be adopted.
[0026] Next, at step 115, at least one best pronunciation tag is selected from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
[0027] In an embodiment, at this step, the pronunciation tag with the highest confidence score is selected from the plurality of pronunciation tags as the at least one
best pronunciation tag.
[0028] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequence with the highest confidence score is selected from the plurality of best phoneme sequences as the best pronunciation tag. On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the path in which the phonemes on the arcs thereof have the highest confidence scores in the phoneme lattice is reserved, while other arcs are removed, thereby constructing the best pronunciation tag of the registration speech by using the reserved path.
[0029] In addition, in another embodiment, at this step, the pronunciation tags whose confidence scores are higher than a preset confidence threshold are selected from the plurality of pronunciation tags as the at least one best pronunciation tag.
[0030] In this case, when the plurality of pronunciation tags are a plurality of best phoneme sequences of the registration speech, on the basis of the confidence scores of the respective phoneme sequences, the phoneme sequences whose confidence scores are higher than the preset confidence threshold are selected from the plurality of best phoneme sequences. For example, in the case of the above three sequences 1~3 of the registration speech "JL (wang ming)", if the confidence threshold is set to 65, then the sequences 1 and 3 whose confidence scores are higher than the confidence threshold will be selected from the three sequences 1~3 as the best pronunciation tags of the registration speech "ia^| (wang ming)".
[0031] On the other hand, when the plurality of pronunciation tags are the phoneme lattice of the registration speech, on the basis of the confidence scores of the phonemes on the respective arcs in the phoneme lattice, the arcs whose phonemes have lower confidence scores than the preset confidence threshold are removed from the phoneme lattice, thereby constructing the best pronunciation tags of the registration speech by using the reserved arcs.
[0032] Herein, the above confidence threshold may be decided according to the experience of developers. Specifically, for example, firstly, a large amount of testing data is prepared, then the phoneme recognition system used at step 105 is applied to
perform phoneme recognition on the testing data, and further confidence scores are calculated for the phoneme recognition results, and then a suitable confidence threshold may be set with reference to the confidence scores of high quality recognition results in order to ensure that the high quality recognition results can be selected with the confidence threshold.
[0033] At step 120, a voice tag item corresponding to the registration speech is created based on the at least one best pronunciation tag to add into a recognition network. Thus, when a user input a testing speech, the recognition can be performed on the testing speech by using the recognition network. Since the creation and addition of a voice tag item are existing knowledge in the art, the detailed description thereof is omitted.
[0034] The above is a description of the voice-tag method based on confidence score according to the first embodiment of the present invention. In the present embodiment, by selecting at least one best pronunciation tag from a plurality of pronunciation tags of a registration speech based on confidence scores to create a voice tag item corresponding to the registration speech, the voice tags can be optimized and the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the scale of recognition network consisting of voice tags can be decreased, the confusion of recognition network can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration to some degree, it can reduce the negative effect due to phoneme recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
( Second embodiment )
[0035] Next, the voice-tag method based on confidence score according to the second embodiment of the present invention will be described in combination with Fig.3. In the present embodiment, the confidence score is used to combine a plurality of pronunciation tags of a registration speech.
[0036] Specifically, as shown in Fig.3, firstly, at step 305, the method performs phoneme recognition on a registration speech inputted by a user, to obtain a plurality of pronunciation tags of the registration speech. Since the step is the same as the
above step 105 in Fig.l, the detailed description thereof is omitted.
[0037] At step 310, a confidence score is calculated for each of the plurality of pronunciation tags of the registration speech. Since the step is the same as the above step 110 in Fig.l, the detailed description thereof is omitted.
[0038] Next, at step 315, a confidence score based weight is determined for each of the pronunciation tags of the registration speech. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
[0039] In an embodiment, at this step, the confidence score based weight is calculated for each of the plurality of pronunciation tags in accordance with the following equation (l):
weight i= confidence score i / (confidence score 1+confidence score 2+...+confidence score n) (l)
wherein the weight i denotes the confidence score based weight of the ith pronunciation tag, the confidence score 1 denotes the confidence score of the first pronunciation tag, the confidence score 2 denotes the confidence score of the second pronunciation,..., and the confidence score n denotes the confidence score of the nth pronunciation tag and so on, n denotes the number of the plurality of the pronunciation tags. In other words, in accordance with the above equation (l), the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of this pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
[0040] Next, the description will be given in combination with a specific example. By still taking the foregoing registration speech "ia^| (wang ming)" as an example, suppose that the recognition results and confidence score calculation results are the same as that of the first embodiment, namely:
1. "w an m ing", confidence score: 70;
2. "w an m in", confidence score: 60;
3. "w ang m ing", confidence score: 75;
then in this case, at this step, the confidence score based weights are calculated in accordance with the above equation (l) as follows:
1. w an m mg ", confidence score: 70, weight = 70 / (70+60+75) = 0.34;
2. "w an m in", confidence score: 60, weight = 60 / (70+60+75) = 0.29;
3. "w ang m ing", confidence score: 75, weight = 75 / (70+60+75) = 0.37.
That is, in the present embodiment, each of the plurality of pronunciation tags of the registration speech is defined as a component of the voice tag of the registration speech by using the confidence score based weight.
[0041] Next, at step 320, a voice tag item corresponding to the registration speech is created based on the plurality of pronunciation tags of the registration speech to add into a recognition network and meanwhile the confidence score based weight of each of the plurality of pronunciation tags is recorded.
[0042] At this step, the voice tag item corresponding to the registration speech may be created directly based on the plurality of pronunciation tags obtained for the registration speech at step 305, or be created based on at least one best pronunciation tag which is selected from the plurality of pronunciation tags on the basis of the confidence score of each of the plurality of pronunciation tags like step 115 in the first embodiment. As to this step, the foregoing detailed description about step 115 may be referred to, and the detailed description of this step is accordingly omitted.
[0043] Next, at step 325, when a user inputs a testing speech, the recognition is performed on the testing speech by using the recognition network to obtain a plurality of best recognition result candidates of the testing speech.
[0044] Specifically, at this step, when performing recognition on the testing speech by using the recognition network, all pronunciation sequences, namely pronunciation tags near to the testing speech will be obtained from the recognition network by doing match as the plurality of best recognition result candidates of the testing speech.
[0045] For example, in the case that the user inputs the testing speech "^!^Kwu ming)", suppose that by obtaining all sequences near to the testing speech, the recognition network obtains the nearest pronunciation sequence "w u m ing" and a similar sequence as well as the three sequences in the voice tag item corresponding to the registration speech "JL^", and finally outputs the following recognition results arrayed in the in the descending order of acoustic score for the testing speech:
1. w an m in, acoustic score: 90;
2. w u m ing, acoustic score: 89;
3. w u n ing, acoustic score: 87;
4. w an m ing, acoustic score: 80;
5. w ang m ing, acoustic score: 70.
[0046] At step 330, among the plurality of best recognition result candidates of the testing speech, the plurality of recognition result candidates belonging to a same voice tag item are combined with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0047] Specifically, at this step, the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates of the testing speech are combined into one recognition result candidate, and a weighted sum of the acoustic scores of the plurality of recognition result candidates belonging to a same voice tag item is calculated on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
[0048] Next, the description will be given in combination with a specific example. By still taking the foregoing testing speech "^!^Kwu ming)" and the recognition result candidates 1 ~ 5 thereof as an example, suppose that the recognition result candidates 1, 4 and 5 belong to a same voice tag item, namely the voice tag item corresponding to the registration speech "JL (wang ming)" , while the recognition result candidates 2 and 3 belong to different voice tag items among the recognition result candidates 1 ~ 5 according to the recognition network, then at this step, the recognition result candidates 1, 4 and 5 will be combined into one recognition result candidate and a weighted sum of the acoustic scores of the recognition result candidates 1, 4 and 5 will be calculated on the basis of the confidence score based weights of the respective pronunciation tags corresponding to the recognition result candidates 1, 4 and 5, as the acoustic score of the combined recognition result candidate. Thereby, through combination, the recognition result candidates will become:
1. 4, 5. w an m in (w an m ing, w ang m ing), acoustic score after combination: 90*0.29+80*0.34+70*0.37=79.2;
2. w u m ing, acoustic score: 89;
3. w u n ing, acoustic score :87.
[0049] Thus, the recognition result candidates 1, 4 and 5 are combined into one
recognition result candidate, which corresponds to the voice tag item of the registration speech "ia^| (wang ming)".
[0050] Herein, it should be noted that although the recognition result candidates 1, 4 and 5 are combined into one recognition result candidate, since they belong to one voice tag item and correspond to the registration speech "JL (wang ming)" before combination, even if they are combined, the obtained combined recognition result candidate still can be correspond to the registration speech "JL (wang ming)".
[0052] At step 335, the recognition result candidate with the highest acoustic score is selected from the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
[0054] Therefore, in the above example, since the recognition result 2. w u m ing becomes the one with the highest acoustic score after the weight based combination of the recognition result candidates 1~5, it will be selected as the final recognition result, thus the correct recognition result can be obtained.
[0055] In addition, if it is supposed that the recognition result candidates 2 and 3 of the testing speech "^!^Kwu ming)" also belong to a same voice tag item, then the recognition result candidates 2 and 3 will also be combined on the basis of the confidence score based weights. Further, if the combined recognition result of the recognition result candidates 2 and 3 still has the highest acoustic score, then it will be selected, thereby the voice tag item to which the recognition result candidates 2 and 3 belong will become the one matching to the testing speech "^!^Kwu ming)", thus the correct content of the testing speech "^!^Kwu ming)" can be recognized based on this voice tag item.
[0056] The above is a description of the voice-tag method based on confidence score according to the second embodiment of the present invention. In the present embodiment, by combining recognition result candidates belonging to a same voice tag item with confidence score based weights, the negative effects of multi-pronunciation registration on application of voice tags can be reduced. Specifically, the confusion of recognition network consisting of voice tags can be reduced, and the recognition performance of voice tags, especially the dictionary items can be improved. As the same time, the method of the present embodiment still keeps the advantages of the multi-pronunciation registration, it can reduce the negative effect due to phoneme
recognition error, as well reduce recognition errors due to the mismatch between registration speech and testing speech.
( Third embodiment)
[0057] Under the same inventive conception, the present invention provides a voice-tag apparatus based on confidence score which will be described in detail below in conjunction with drawings.
[0058] Fig.4 depicts a block diagram of the voice-tag apparatus based on confidence score according to the third embodiment of the invention. As shown in Fig.4, the voice-tag apparatus 40 based on confidence score of the present embodiment comprises: phoneme recognition unit 41, confidence score calculating unit 42, pronunciation tag selecting unit 43, voice tag creating unit 44 testing speech recognizing unit 45 and recognition network 46.
[0059] Specifically, the phoneme recognition unit 41 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech. The plurality of pronunciation tags may be a plurality of best phoneme sequences or the phoneme lattice of the registration speech.
[0060] In an embodiment, the phoneme recognition unit 41 is implemented based on a phoneme recognition system commonly used in the art which adopts the Hidden Markov Model as the acoustic model and performs decoding by using the Viterbi searching algorithm, and the phoneme recognition unit 41 performs phoneme recognition on the registration speech inputted by the user to obtain a plurality of best phoneme sequences of the registration speech which are arrayed in the descending order of acoustic score or the phoneme lattice of the registration speech.
[0061] Of course, it is not limited to this, the phoneme recognition unit 41 may be implemented with any phoneme recognition system or method presently known or future knowable, there is no limitation on this in the present invention.
[0062] The confidence score calculating unit 42 is configured to calculate a confidence score for each of the plurality of pronunciation tags.
[0063] Specifically, in the case that the plurality of pronunciation tags of the registration speech are a plurality of best phoneme sequences, the confidence score calculating unit 42 calculates a confidence score for each of the phoneme sequences. In addition, in the case that the plurality of pronunciation tags of the registration speech
are phoneme lattice, the confidence score calculating unit 42 calculates a confidence score for a single phoneme on each of arcs in the phoneme lattice.
[0064] The confidence score calculating unit 42 may be implemented based on any method presently known or future knowable for calculating a confidence score for a phoneme sequence or a single phoneme, for example, the post-probability based confidence score calculating method or the anti-model based confidence score calculating method.
[0065] The pronunciation tag selecting unit 43 is configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags.
[0066] In an embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
[0067] In addition, in another embodiment, the pronunciation tag selecting unit 43 selects the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag. As mentioned above, the confidence threshold may be decided based on testing data prepared in advance and according to experience of the developers.
[0068] The voice tag creating unit 44 is configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into the recognition network 46.
[0069] The testing speech recognizing unit 45 is configured to perform recognition on a testing speech by using the recognition network 46 to recognize the content of the testing speech when a user inputted the testing speech.
[0070] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 40 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the first embodiment described above.
It should be noted that although the recognition network 46 is included in the voice-tag apparatus 40 based on confidence score in this embodiment, it is not limited to this. The recognition network 46 may also reside outside the voice-tag apparatus 40
based on confidence score in other embodiments.
( Fourth embodiment )
[0071] Next, the voice-tag apparatus based on confidence score according to the fourth embodiment of the present invention will be described in combination with Fig.5.
[0072] As shown in Fig.5, the voice-tag apparatus 50 based on confidence score of the present embodiment comprises : phoneme recognition unit 51, confidence score calculating unit 52, confidence weight determining unit 53, voice tag creating unit 54, testing speech recognizing unit 55 recognition result combining unit 56 and recognition network 57.
[0073] Specifically, the phoneme recognition unit 51 is configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech.
[0074] The confidence score calculating unit 52 is configured to calculate a confidence score for each of the plurality of pronunciation tags of the registration speech.
[0075] The confidence weight determining unit 53 is configured to determine a confidence score based weight for each of the plurality of pronunciation tags. Herein, the higher the confidence score of a pronunciation tag is, the larger weight will be determined for the pronunciation tag.
[0076] In an embodiment, the confidence weight determining unit 53, for each of the plurality of pronunciation tags, calculates the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags as the confidence score based weight of the pronunciation tag.
[0077] The voice tag creating unit 54 is configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into the recognition network 57 and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags.
[0078] In an embodiment, the voice tag creating unit 54 selects at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags and creates the voice tag item corresponding to the registration speech based on the selected at least one best
pronunciation tag.
[0079] The testing speech recognizing unit 55 is configured to perform recognition on a testing speech by using the recognition network 57 to obtain a plurality of best recognition result candidates of the testing speech when a user inputted the testing speech.
[0080] The recognition result combining unit 56 is configured to combine a plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates obtained by the testing speech recognizing unit 55 with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
[0081] In an embodiment, the recognition result combining unit 56 for the plurality of recognition result candidates belonging to a same voice tag item among the plurality of best recognition result candidates performs the following process: combines the plurality of recognition result candidates into one recognition result candidate, and calculates a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
[0082] In addition, the recognition result combining unit 56 selects the best recognition result candidate, namely the one with the highest acoustic score among the recognition result candidates formed by the plurality of best recognition result candidates after combination as the final recognition result.
[0083] The above is a description of the voice-tag apparatus based on confidence score of the embodiment. The voice-tag apparatus 50 based on confidence score of the embodiment can operationally implement the voice-tag method based on confidence score in the second embodiment described above.
It should be noted that although the recognition network 57 is included in the voice-tag apparatus 50 based on confidence score in this embodiment, it is not limited to this. The recognition network 57 may also reside outside the voice-tag apparatus 50 based on confidence score in other embodiments.
[0084] It can be appreciated by the person skilled in the art that the voice-tag apparatuses 40, 50 based on confidence score of the third and fourth embodiments as
well as their components can be implemented with specifically designed circuits or chips or be implemented by a computing device (information processing device) executing corresponding programs.
[0085] While the voice-tag method and apparatus based on confidence score of the present invention have been described in detail with some exemplary embodiments, these embodiments are not exhaustive, and those skilled in the art may make various variations and modifications within the spirit and scope of the present invention. Therefore, the present invention is not limited to these embodiments, the scope of which is only defined by appended claims.
Claims
1. A voice-tag method based on confidence score, comprising:
performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
calculating a confidence score for each of the plurality of pronunciation tags! selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
2. A voice-tag method based on confidence score, comprising:
performing phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
determining a confidence score based weight for each of the plurality of pronunciation tags!
creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly recording the confidence score based weight of each of the plurality of pronunciation tags! and
when recognition on a testing speech is performed by using the recognition network, combining a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
3. The method according to claim 2, wherein the step of determining a confidence score based weight for each of the plurality of pronunciation tags further comprises: calculating a confidence score for each of the plurality of pronunciation tags! and determining the confidence score based weight for each of the plurality of pronunciation tags, wherein the higher the confidence score of the pronunciation tag is, the larger weight will be determined for the pronunciation tag.
4. The method according to claim 2, wherein:
the confidence score based weight of each of the plurality of pronunciation tags is the ratio of the confidence score of the pronunciation tag to the sum of confidence scores of all the plurality of pronunciation tags.
5. The method according to claim 2, wherein the step of creating a voice tag item corresponding to the registration speech based on the plurality of pronunciation tag further comprises :
selecting at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and creating the voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag.
6. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises:
selecting the pronunciation tag with the highest confidence score from the plurality of pronunciation tags as the at least one best pronunciation tag.
7. The method according to claim 1 or 5, wherein the step of selecting at least one best pronunciation tag further comprises:
selecting the pronunciation tags whose confidence scores are higher than a preset confidence threshold from the plurality of pronunciation tags as the at least one best pronunciation tag.
8. The method according to claim 2, wherein the step of combining further comprises^
for the plurality of recognition result candidates belonging to a same voice tag item among the recognition result candidates:
combining the plurality of recognition result candidates into one recognition result candidate! and
calculating a weighted sum of the acoustic scores of the plurality of recognition result candidates on the basis of the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates, as the acoustic score of the combined recognition result candidate.
9. A voice-tag apparatus based on confidence score, comprising:
a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
a confidence score calculating unit configured to calculate a confidence score for each of the plurality of pronunciation tags!
a pronunciation tag selecting unit configured to select at least one best pronunciation tag from the plurality of pronunciation tags based on the confidence score of each of the plurality of pronunciation tags! and
a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the selected at least one best pronunciation tag to add into a recognition network.
10. A voice-tag apparatus based on confidence score, comprising:
a phoneme recognition unit configured to perform phoneme recognition on a registration speech to obtain a plurality of pronunciation tags of the registration speech;
a confidence weight determining unit configured to determine a confidence score based weight for each of the plurality of pronunciation tags!
a voice tag creating unit configured to create a voice tag item corresponding to the registration speech based on the plurality of pronunciation tags to add into a recognition network and correspondingly record the confidence score based weight of each of the plurality of pronunciation tags! and
a recognition result combining unit configured to, when recognition on a testing speech is performed by using the recognition network, combine a plurality of recognition result candidates belonging to a same voice tag item among recognition result candidates with the confidence score based weights of the pronunciation tags corresponding to the plurality of recognition result candidates.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2010/052954 WO2012001458A1 (en) | 2010-06-29 | 2010-06-29 | Voice-tag method and apparatus based on confidence score |
CN2010800015191A CN102439660A (en) | 2010-06-29 | 2010-06-29 | Voice-tag method and apparatus based on confidence score |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2010/052954 WO2012001458A1 (en) | 2010-06-29 | 2010-06-29 | Voice-tag method and apparatus based on confidence score |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012001458A1 true WO2012001458A1 (en) | 2012-01-05 |
Family
ID=45401457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2010/052954 WO2012001458A1 (en) | 2010-06-29 | 2010-06-29 | Voice-tag method and apparatus based on confidence score |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102439660A (en) |
WO (1) | WO2012001458A1 (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104078050A (en) | 2013-03-26 | 2014-10-01 | 杜比实验室特许公司 | Device and method for audio classification and audio processing |
US9715878B2 (en) | 2013-07-12 | 2017-07-25 | GM Global Technology Operations LLC | Systems and methods for result arbitration in spoken dialog systems |
DE102014109122A1 (en) * | 2013-07-12 | 2015-01-15 | Gm Global Technology Operations, Llc | Systems and methods for result-based arbitration in speech dialogue systems |
CN103500579B (en) * | 2013-10-10 | 2015-12-23 | 中国联合网络通信集团有限公司 | Audio recognition method, Apparatus and system |
CN103559881B (en) * | 2013-11-08 | 2016-08-31 | 科大讯飞股份有限公司 | Keyword recognition method that languages are unrelated and system |
CN106157969B (en) * | 2015-03-24 | 2020-04-03 | 阿里巴巴集团控股有限公司 | Method and device for screening voice recognition results |
CN107808662B (en) * | 2016-09-07 | 2021-06-22 | 斑马智行网络(香港)有限公司 | Method and device for updating grammar rule base for speech recognition |
CN106340297A (en) * | 2016-09-21 | 2017-01-18 | 广东工业大学 | Speech recognition method and system based on cloud computing and confidence calculation |
TWI697890B (en) * | 2018-10-12 | 2020-07-01 | 廣達電腦股份有限公司 | Speech correction system and speech correction method |
CN110070854A (en) * | 2019-04-17 | 2019-07-30 | 北京爱数智慧科技有限公司 | Voice annotation quality determination method, device, equipment and computer-readable medium |
CN112447173A (en) * | 2019-08-16 | 2021-03-05 | 阿里巴巴集团控股有限公司 | Voice interaction method and device and computer storage medium |
CN110364146B (en) * | 2019-08-23 | 2021-07-27 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1165590A (en) * | 1997-08-25 | 1999-03-09 | Nec Corp | Voice recognition dialing device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003052737A1 (en) * | 2001-12-17 | 2003-06-26 | Asahi Kasei Kabushiki Kaisha | Speech recognition method, remote controller, information terminal, telephone communication terminal and speech recognizer |
US7313527B2 (en) * | 2003-01-23 | 2007-12-25 | Intel Corporation | Registering an utterance and an associated destination anchor with a speech recognition engine |
CN1753083B (en) * | 2004-09-24 | 2010-05-05 | 中国科学院声学研究所 | Speech sound marking method, system and speech sound discrimination method and system based on speech sound mark |
-
2010
- 2010-06-29 CN CN2010800015191A patent/CN102439660A/en active Pending
- 2010-06-29 WO PCT/IB2010/052954 patent/WO2012001458A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1165590A (en) * | 1997-08-25 | 1999-03-09 | Nec Corp | Voice recognition dialing device |
Non-Patent Citations (2)
Title |
---|
ANNE-MARIE DEROUAULT ET AL.: "Improving Speech Recognition Accuracy with Contextual Phonemes and MMI Traning", ICASSP-89 1989 INTERNATIONAL CONFERENCE ON, vol. 1, May 1989 (1989-05-01), pages 116 - 119 * |
YAN MING CHENG ET AL.: "VOICE-TO-PHONEME Conversion Algorithms for SPEAKER-INDEPENDENT VOICE-TAG Applications in Embedded Platforms", AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, 2005 IEEE WORKSHOP ON, 27 November 2005 (2005-11-27), pages 403 - 408 * |
Also Published As
Publication number | Publication date |
---|---|
CN102439660A (en) | 2012-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012001458A1 (en) | Voice-tag method and apparatus based on confidence score | |
CN106683677B (en) | Voice recognition method and device | |
JP4410265B2 (en) | Speech recognition apparatus and method | |
US8271282B2 (en) | Voice recognition apparatus, voice recognition method and recording medium | |
JP6284462B2 (en) | Speech recognition method and speech recognition apparatus | |
US20220262352A1 (en) | Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation | |
US7921014B2 (en) | System and method for supporting text-to-speech | |
JP2008216756A (en) | Technique for acquiring character string or the like to be newly recognized as phrase | |
CN105654940B (en) | Speech synthesis method and device | |
JP5752060B2 (en) | Information processing apparatus, large vocabulary continuous speech recognition method and program | |
JPWO2009081861A1 (en) | Word category estimation device, word category estimation method, speech recognition device, speech recognition method, program, and recording medium | |
CN112242144A (en) | Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium | |
JP6690484B2 (en) | Computer program for voice recognition, voice recognition device and voice recognition method | |
CN112750445B (en) | Voice conversion method, device and system and storage medium | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
KR101483947B1 (en) | Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof | |
JP6027754B2 (en) | Adaptation device, speech recognition device, and program thereof | |
JP5738216B2 (en) | Feature amount correction parameter estimation device, speech recognition system, feature amount correction parameter estimation method, speech recognition method, and program | |
KR102299269B1 (en) | Method and apparatus for building voice database by aligning voice and script | |
KR101066472B1 (en) | Apparatus and method speech recognition based initial sound | |
JP2010230913A (en) | Voice processing apparatus, voice processing method, and voice processing program | |
JP5104732B2 (en) | Extended recognition dictionary learning device, speech recognition system using the same, method and program thereof | |
JP5772219B2 (en) | Acoustic model generation apparatus, acoustic model generation method, and computer program for acoustic model generation | |
JP2003271185A (en) | Device and method for preparing information for voice recognition, device and method for recognizing voice, information preparation program for voice recognition, recording medium recorded with the program, voice recognition program and recording medium recorded with the program | |
US20120130715A1 (en) | Method and apparatus for generating a voice-tag |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080001519.1 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10854018 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10854018 Country of ref document: EP Kind code of ref document: A1 |