CN102013253A

CN102013253A - Speech recognition method based on speed difference of voice unit and system thereof

Info

Publication number: CN102013253A
Application number: CN2009101728759A
Authority: CN
Inventors: 赵蕤; 鄢翔; 何磊
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-09-07
Filing date: 2009-09-07
Publication date: 2011-04-13
Anticipated expiration: 2029-09-07
Also published as: CN102013253B

Abstract

The present invention relates to a speech recognition method based on the speed difference of the voice unit, comprising: preprocessing an input voice; extracting acoustics characteristics of the voice; according to the acoustics model trained in advance and the extracted acoustics characteristics, decoding the voice to obtain a plurality of candidate recognition results, wherein, each of the candidate recognition results possesses an acoustics score and a section length of the voice units contained by the voice; based on the section length of the voice units contained by the voice, calculating the speed difference of the voice unit for each of the candidate recognition results; based on the speed difference of the voice unit and the acoustics score, calculating a comprehensive score for the candidate recognition result; and selecting the candidate recognition result with the highest comprehensive score from the plurality of candidate recognition results as the final recognition result of the voice. In addition, the present invention also provides a corresponding speech recognition system.

Description

Audio recognition method and speech recognition system based on the difference of voice unit word speed

Technical field

The present invention relates to speech recognition technology, particularly, relate to the method and the corresponding speech recognition system of carrying out speech recognition according to the difference of voice unit word speed.

Background technology

Usually, speech recognition process can comprise the extraction of pre-service, acoustic feature of voice signal and search decoding etc.When carrying out speech recognition, at first the voice signal to input carries out pre-service, and it comprises that pre-filtering, sampling and quantification, windowing divide frame, end-point detection, pre-emphasis etc.Then, pretreated voice signal is carried out feature extraction, to obtain acoustic features such as linear predictor coefficient LPC, cepstrum coefficient CEP, Mel cepstrum coefficient MFCC and perception linear prediction PLP.According to the acoustic model of acoustic feature that is obtained and training in advance, use and voice signal is decoded, to obtain corresponding recognition result such as the search strategy of Viterbi algorithm.

In the process of speech recognition, segment length's information is owing to the influence that is not subjected to noise or channel, and is therefore extremely important for the robustness of speech recognition.Carry out in the method for speech recognition in the existing segment length's of utilization information, commonly voice unit (for example state, phoneme, speech etc.) segment length is carried out explicit modeling with stochastic distribution (for example normal distribution, γ distribution, gauss hybrid models GMM etc.), then segment length's score is carried out together the decoding of voice in conjunction with the acoustics score.Such method can improve the performance of speech recognition to a certain extent.

For example, the article of being shown at David Burshtern " Robust Parametric Modeling of Durations in Hidden Markov Models " (is published in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1995) described in detail in and used γ to distribute the scheme of state modeling.(be published in International Conference on Acoustics at the article " Phone Duration Modeling for LVCSR " that D.Povey showed, Speech and Signal Processing (ICASSP), 2004) described the scheme of Discrete Distribution of using in detail to the phoneme modeling.

Yet itself is subjected to the influence of word speed easily segment length's information, therefore, will can further improve the performance of speech recognition in the word speed information adding segment length model.Yet, how in speech recognition, to consider segment length's information and word speed information simultaneously and do not increase the time and memory consumption becomes the emphasis of research.

The existing basic thought that word speed information is added the method for segment length's model is to remove the negative effect of word speed to segment length's model.

A kind of method commonly used is with word speed the segment length to be carried out normalized, the average segment length of all voice units in wherein word speed is defined as in short.Yet,, therefore, can't in identifying, carry out segment length's normalization in real time because word speed only could be calculated the whole word of acquisition.About this word speed of utilizing to the normalized method of segment length, in the article " Modeling Word Duration for Better Speech Recognition " (Proc.Of Speech Transcription Workshop, 2000) that Gadde and V.R.R. showed, be described in detail.

Another kind method is the segment length's modeling respectively to different word speeds, for example, to high word speed, middle word speed and slow each self-built model of word speed, in identifying, selects the highest model of score then.Yet the degree of accuracy of these models is not high, and owing to need calculate the probability of three kinds of models respectively, therefore, will significantly increase calculated amount and computing time.About this method to the modeling of different word speeds difference, at Yun Tang, the article that Wenju Liu and Bo Xu are shown " Trigram Duration Modeling in Speech Recognition " (is published in International Symposium on Chinese Spoken Language Processing, 2004) and in the article " Duration Modeling for Mandarin Speech Recognition Using Prosodic Information, Speech Prosody " (being published in 2004) shown of Wern-Jun Wang and Chun-Jen Lee all be described in detail.

Another kind of segment length's method for normalizing is to utilize previous voice unit segment length to the normalization of current speech elementary section progress row, yet, in the method, need calculate and store the normalization segment length model of all possible two context voice units in advance, therefore, memory consumption is bigger.This method is at U.S. Pat patent:Masahide Arui, Shinichi Tanaka, Takashi Masuko is described in detail in " Apparatus, Method and Computer Program Product for Speech Recognition ".

Summary of the invention

The present invention just is being based on above technical matters and is proposing, its purpose is to provide a kind of audio recognition method and speech recognition system of the difference based on the voice unit word speed, it has considered the influence of word speed for the segment length, can improve speech recognition performance, but need not the segment length is carried out modeling, and memory consumption and computing time are all very little.

According to an aspect of the present invention, provide a kind of audio recognition method of the difference based on the voice unit word speed, comprising: the voice of being imported are carried out pre-service; Extract the acoustic feature of described voice; Based on the acoustic model of training in advance and the acoustic feature of the described voice that extracted, described voice are decoded, obtaining a plurality of recognition result candidates of described voice, each of wherein said a plurality of recognition result candidates has the acoustics score and the segment length of the voice unit that comprised; For described a plurality of recognition result candidates each,, calculate this recognition result candidate's voice unit word speed difference value based on the segment length of the voice unit that is comprised; Based on the voice unit word speed difference value harmony that the is calculated branch that learns, calculate this recognition result candidate's integrate score; And from described a plurality of recognition result candidates, select the highest recognition result candidate of described integrate score, as the final recognition result of described voice.

According to another aspect of the present invention, provide a kind of speech recognition system of the difference based on the voice unit word speed, comprising: speech processing module is used for the voice of being imported are carried out pre-service; Characteristic extracting module is used to extract the acoustic feature of described voice; Decoder module, be used for based on the acoustic model of training in advance and the acoustic feature of the described voice that extracted, described voice are decoded, obtaining a plurality of recognition result candidates of described voice, each of wherein said a plurality of recognition result candidates has the acoustics score and the segment length of the voice unit that comprised; Voice unit word speed difference value computing module is used for each for described a plurality of recognition result candidates, based on the segment length of the voice unit that is comprised, calculates this recognition result candidate's voice unit word speed difference value; The integrate score computing module is used for each for described a plurality of recognition result candidates, based on the voice unit word speed difference value harmony that the is calculated branch that learns, calculates this recognition result candidate's integrate score; And the selection module, be used for selecting the highest recognition result candidate of described integrate score, as the final recognition result of described voice from described a plurality of recognition result candidates.

Description of drawings

Fig. 1 is the process flow diagram of the audio recognition method of the difference based on the voice unit word speed according to an embodiment of the invention;

Fig. 2 is the schematic block diagram based on the speech recognition system of the difference of voice unit word speed of first embodiment according to the invention;

Fig. 3 is the schematic block diagram based on the speech recognition system of the difference of voice unit word speed according to second embodiment of the present invention;

Fig. 4 is the schematic block diagram based on the speech recognition system of the difference of voice unit word speed according to the 3rd embodiment of the present invention;

Fig. 5 is the schematic block diagram based on the speech recognition system of the difference of voice unit word speed according to the 4th embodiment of the present invention.

Embodiment

By below in conjunction with the detailed description of accompanying drawing to specific embodiments of the invention, above-mentioned and other goal of the invention, technical characterictic and advantage of the present invention will be more obvious.

Fig. 1 shows the process flow diagram of the audio recognition method of the difference based on the voice unit word speed according to an embodiment of the invention.Below in conjunction with accompanying drawing, present embodiment is described in detail.

In the present embodiment, word speed in supposing in short is stable, promptly the word speed of in short interior each voice unit is substantially the same, therefore, for the similar voice identification result candidate of acoustics score, the recognition result candidate that the recognition result candidate that the word speed difference of voice unit is little is bigger than word speed difference more may be correct recognition result.Present embodiment just is being based on the above-mentioned fact, utilizes the word speed difference of voice unit, and must assign to select best recognition result in conjunction with acoustics.

As shown in Figure 1,, the voice of being imported are carried out pre-service, extract the acoustic feature of the voice of being imported then at step S101.The pre-service of voice and feature extraction operation are known for the person of ordinary skill of the art, therefore, omit its detailed explanation at this.By step S101, can obtain the acoustic feature of voice, for example, linear predictor coefficient LPC, cepstrum coefficient CEP, Mel cepstrum coefficient MFCC and perception linear prediction PLP etc.

Then, at step S105,, voice are decoded, to obtain a plurality of recognition result candidates of these voice based on the acoustic feature that the acoustic model and the utilization of training in advance are extracted.Usually, the decoding of voice is according to search strategy, for example Viterbi algorithm, N-best search, multipass search etc., the speech decoding sequence of the voice that searching is imported.The decoding of voice is known for the person of ordinary skill of the art, therefore, omits its detailed description at this.In the present embodiment, search strategy can adopt the Viterbi algorithm.All has corresponding acoustics score and the segment length of the voice unit that comprised through each recognition result candidate of obtaining after the decoding.

Then, at step S110,,, calculate this recognition result candidate's voice unit word speed difference value based on the segment length of the voice unit that is comprised for a plurality of recognition result candidates that in step S105, obtain each.

In the present embodiment, voice unit can be any one in state, phoneme, syllable, speech or the phrase.The average segment length's of corresponding voice unit ratio in the actual segment length that the word speed of voice unit is defined in the voice unit that obtains among the step S105 and the sound bank, promptly

r_{u} = \frac{d_{u}}{m_{u}} - - - (1)

Wherein, r _uThe word speed of representing u voice unit, d _uThe segment length who represents u voice unit, m _uThe average segment length of the voice unit corresponding in the expression sound bank with u voice unit.

In step S110, at first,, calculate the word speed of each voice unit among this recognition result candidate according to formula (1), calculate this recognition result candidate's voice unit word speed difference value then.

In one embodiment, voice unit word speed difference value is defined as maximal value and the difference of minimum value, the i.e. extreme difference of word speed of word speed of all voice units of certain recognition result candidate.Suppose that the recognition result candidate comprises N voice unit, then voice unit word speed difference value can be calculated according to following formula:

s _d＝max(r ₁，r ₂，...，r _N)-min(r ₁，r ₂，...，r _N)，

Wherein, s _dExpression voice unit word speed difference value.In this case, from the word speed of all voice units of being calculated, select maximal value and minimum value, and calculate both poor.

In another embodiment, voice unit word speed difference value is defined as the variance of word speed of all voice units of certain recognition result candidate, promptly

s _d＝var(r ₁，r ₂，...，r _N)。

In this case, calculate the variance of all word speeds according to formula of variance.

In another embodiment, voice unit word speed difference value is defined as the standard deviation of word speed of all voice units of certain recognition result candidate, promptly

s _d＝stdv(r ₁，r ₂，...，r _N)。

In this case, calculate the standard deviation of all word speeds according to the standard deviation formula.

In another embodiment, voice unit word speed difference value is defined as the coefficient of variation of all voice unit word speeds of certain recognition result candidate, i.e. the ratio of the standard deviation of all voice unit word speeds and mean value, shown in following formula:

s _d＝stdv(r ₁，r ₂，...，r _N)/mean(r ₁，r ₂，r _N)

In this case, calculate the standard deviation and the mean value of all voice unit word speeds respectively, and calculate both ratio.

Though more than described the method for several computing voices unit word speed difference value, but those of ordinary skill in the art is to be understood that, can also use the method for other computing voice unit word speed difference value, as long as can obtain the total difference of all voice unit word speeds.

Like this, by step S110, can access each recognition result candidate's voice unit word speed difference value.Then, at step S115,, calculate each recognition result candidate's integrate score according to each the recognition result candidate's who is calculated voice unit word speed difference value and acoustics score.

Calculating for integrate score, consider recognition result for the best, its acoustics score should be high more good more, and voice unit word speed difference value is low more good more, therefore, when the branch that learns based on voice unit word speed difference value harmony calculates integrate score, usually voice unit word speed difference value is carried out inversion operation, calculate in conjunction with the acoustics score again.Provide the embodiment of several calculating integrate scores below.Certainly, those of ordinary skill in the art should be appreciated that the method except the calculating integrate score of the following stated, can also adopt other method to calculate integrate score.

In one embodiment, for each recognition result candidate, the reciprocal value of computing voice unit word speed difference value at first, according to predetermined weight coefficient this reciprocal value is weighted again, then with reciprocal value after the weighting and the addition of acoustics score, thereby obtain this recognition result candidate's integrate score.

In another embodiment, the opposite number of computing voice unit word speed difference value at first, according to predetermined weight coefficient this opposite number is weighted again,, thereby obtains this recognition result candidate's integrate score then with opposite number after the weighting and the addition of acoustics score.

In another embodiment, the reciprocal value of computing voice unit word speed difference value at first, according to predetermined weight coefficient this reciprocal value is weighted again, then reciprocal value after the weighting and acoustics score is multiplied each other, thereby obtain this recognition result candidate's integrate score.

In the embodiment of above-mentioned calculating integrate score, weight coefficient can be adjusted according to different identification missions.

At last, at step S120,, select the highest recognition result candidate of integrate score, as the final recognition result of the voice of being imported according to each recognition result candidate's integrate score.

By above description as can be seen, the audio recognition method based on the difference of voice unit word speed of present embodiment has been considered in the speech recognition word speed to segment length's influence, thereby can improve the performance of speech recognition, and has avoided the modeling to the segment length.In addition, the method for present embodiment only need be stored the average segment length of each voice unit in advance, and memory consumption is less, and the calculating of voice unit word speed difference value is simple, and computing time is short.The method of present embodiment is applicable to any speech recognition system, particularly little vocabulary speech recognition system.

Under same inventive concept, Fig. 2 shows the schematic block diagram based on the speech recognition system 200 of the difference of voice unit word speed of first embodiment according to the invention.Below in conjunction with accompanying drawing, present embodiment is described in detail, wherein, suitably omit its explanation for the part identical with front embodiment.

As shown in Figure 2, the speech recognition system 200 based on the difference of voice unit word speed of present embodiment comprises: pretreatment module 201, and it carries out pre-service to the voice of being imported; Characteristic extracting module 202, it extracts the acoustic feature of these voice; Decoder module 203, it is based on the acoustic model of training in advance and utilize the acoustic feature of the voice that extracted, and voice is decoded, to obtain a plurality of recognition result candidates of these voice; Voice unit word speed difference value computing module 204, it based on the segment length of the voice unit that is comprised, calculates this recognition result candidate's voice unit word speed difference value for each of a plurality of recognition result candidates; Integrate score computing module 205, it is for each of a plurality of recognition result candidates, based on the voice unit word speed difference value harmony that the is calculated branch that learns, calculates this recognition result candidate's integrate score; And select module 206, it selects the highest recognition result candidate of integrate score from a plurality of recognition result candidates, as the final recognition result of the voice of being imported.

In the present embodiment, after voice are transfused to pretreatment module 201, carry out the pre-service of voice, characteristic extracting module 202 is extracted the acoustic feature of these voice then.The acoustic feature that is extracted is provided for decoder module 203 with the acoustic model of training in advance, voice are decoded according to search strategy by decoder module 203, to obtain a plurality of recognition result candidates, wherein each recognition result candidate has the acoustics score and the segment length of the voice unit that comprised.As previously mentioned, voice unit can be any one in state, phoneme, syllable, speech or the phrase.

Behind a plurality of recognition result candidates of decoder module 203 output, voice unit word speed difference value computing module 204 is to each recognition result candidate, based on the segment length of the voice unit that is comprised, and computing voice unit word speed difference value.

In the present embodiment, in voice unit word speed difference value computing module 204, at first, word speed computing unit 2041 calculates the word speed of this voice unit for each voice unit among each recognition result candidate.As previously mentioned, the average segment length's of corresponding voice unit ratio in the word speed segment length (i.e. the actual segment length of the voice unit that obtains by decoder module 203) that is defined as voice unit and the sound bank.Then, the difference of maximal value and minimum value in the word speed of extreme difference computing unit 2042 all voice units of calculating is as this recognition result candidate's voice unit word speed difference value.

Then, in integrate score computing module 205,, calculate this recognition result candidate's integrate score according to each recognition result candidate's the voice unit word speed difference value harmony branch that learns.In the present embodiment, at first, computing unit 2051 reciprocal calculates the reciprocal value of this recognition result candidate's voice unit word speed difference value, then, weighted units 2052 is weighted the reciprocal value of being calculated according to predetermined weight coefficient, at last, sum unit 2053 is with reciprocal value after the weighting and the addition of acoustics score, with the integrate score as this recognition result candidate.

Alternatively, when calculating recognition result candidate's integrate score, can also replace reciprocal value with opposite number, that is, in integrate score computing module 205, at first the opposite number computing unit calculates the opposite number of this recognition result candidate's voice unit word speed difference value, then, weighted units is weighted the opposite number that is calculated according to predetermined weight coefficient, and opposite number and the acoustics score addition of sum unit after with weighting then is with the integrate score as this recognition result candidate.

In addition, alternatively, integrate score computing module 205 also can comprise: computing unit reciprocal, and it calculates the reciprocal value of this recognition result candidate's voice unit word speed difference value; Weighted units, it is weighted the reciprocal value of being calculated according to predetermined weight coefficient; And the product computing unit, its reciprocal value and acoustics score after with weighting multiplies each other, with the integrate score as this recognition result candidate.

In above-mentioned integrate score computing module 205, weight coefficient can be adjusted according to different voice recognition tasks.

At last, all recognition result candidates and integrate score thereof all are provided for selects module 206, by selecting module 206 according to integrate score, selects the highest recognition result candidate of integrate score from a plurality of recognition result candidates, as the final recognition result of voice.

Fig. 3 shows the schematic block diagram based on the speech recognition system 300 of the difference of voice unit word speed according to second embodiment of the present invention, and wherein, the part identical with front embodiment used identical Reference numeral, and suitably omits its explanation.Below in conjunction with accompanying drawing, present embodiment is described in detail.

The structure of the speech recognition system 300 of present embodiment and speech recognition system 200 shown in Figure 2 basic identical, difference is: the structure difference of voice unit word speed difference value computing module 304.

In the voice unit word speed difference value computing module 304 of present embodiment, at first, word speed computing unit 3041 calculates the word speed of this voice unit for each voice unit among each recognition result candidate.Then, calculate the variance of word speed of all voice units of each recognition result candidate by variance computing unit 3042, with voice unit word speed difference value as this recognition result candidate.

Equally, the difference based on the speech recognition system 400 of the difference of voice unit word speed and Fig. 2 and speech recognition system 200,300 shown in Figure 3 according to the 3rd embodiment of the present invention illustrated in fig. 4 also is: the structure of voice unit word speed difference value computing module 404 is different.

In the voice unit word speed difference value computing module 404 of present embodiment, at first, word speed computing unit 4041 calculates the word speed of this voice unit for each voice unit among each recognition result candidate.Then, the standard deviation of the word speed of all voice units of standard deviation computing unit 4042 each recognition result candidate of calculating is as this recognition result candidate's voice unit word speed difference value.

Equally, the difference based on the speech recognition system 500 of the difference of voice unit word speed and Fig. 2, Fig. 3 and speech recognition system 200,300,400 shown in Figure 4 according to the 4th embodiment of the present invention illustrated in fig. 5 also is: the structure of voice unit word speed difference value computing module 504 is different.

In the voice unit word speed difference value computing module 504 of present embodiment, at first, word speed computing unit 5041 calculates the word speed of this voice unit for each voice unit among each recognition result candidate.Then, standard deviation computing unit 5042 and average calculation unit 5043 are calculated the standard deviation and the mean value of word speed of all voice units of each recognition result candidate respectively, calculate the ratio of above-mentioned standard deviation and mean value again by ratio calculation unit 5044, as this recognition result candidate's voice unit word speed difference value.

Should be understood that, the foregoing description based on the speech recognition system 200,300,400 of the difference of voice unit word speed and 500 and each ingredient can constitute with special-purpose circuit or chip, also can realize by the corresponding program of computing machine (processor) execution.And the speech recognition system based on the difference of voice unit word speed of the foregoing description can realize the audio recognition method of difference based on the voice unit word speed shown in Figure 1 in operation.

Though more than describe audio recognition method and the speech recognition system of each embodiment of the present invention in detail by some exemplary embodiments based on the difference of voice unit word speed, but above these embodiment are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiment, and scope of the present invention is only defined by the appended claims.

Claims

1. audio recognition method based on the difference of voice unit word speed comprises:

The voice of being imported are carried out pre-service;

Extract the acoustic feature of described voice;

Based on the acoustic model of training in advance and the acoustic feature of the described voice that extracted, described voice are decoded, obtaining a plurality of recognition result candidates of described voice, each of wherein said a plurality of recognition result candidates has the acoustics score and the segment length of the voice unit that comprised;

For each of described a plurality of recognition result candidates,

Based on the segment length of the voice unit that is comprised, calculate this recognition result candidate's voice unit word speed difference value;

Based on the voice unit word speed difference value harmony that the is calculated branch that learns, calculate this recognition result candidate's integrate score; And

From described a plurality of recognition result candidates, select the highest recognition result candidate of described integrate score, as the final recognition result of described voice.

2. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's voice unit word speed difference value comprises:

For each voice unit among this recognition result candidate, calculate the word speed of this voice unit, wherein said word speed is the average segment length's of voice unit corresponding in segment length and the sound bank of this voice unit a ratio; And

Calculate the difference of maximal value and minimum value in the word speed of all voice units, as this recognition result candidate's voice unit word speed difference value.

3. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's voice unit word speed difference value comprises:

Calculate the variance of the word speed of all voice units, as this recognition result candidate's voice unit word speed difference value.

4. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's voice unit word speed difference value comprises:

Calculate the standard deviation of the word speed of all voice units, as this recognition result candidate's voice unit word speed difference value.

5. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's voice unit word speed difference value comprises:

For each voice unit among this recognition result candidate, calculate the word speed of this voice unit, wherein said word speed is the average segment length's of voice unit corresponding in segment length and the sound bank of this voice unit a ratio;

Calculate the standard deviation and the mean value of the word speed of all voice units; And

Calculate the ratio of described standard deviation and described mean value, as this recognition result candidate's voice unit word speed difference value.

6. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's integrate score comprises:

Calculate the reciprocal value of this recognition result candidate's voice unit word speed difference value;

Described reciprocal value is weighted; And

With reciprocal value after the weighting and the addition of described acoustics score, with integrate score as this recognition result candidate.

7. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's integrate score comprises:

Calculate the opposite number of this recognition result candidate's voice unit word speed difference value;

Described opposite number is weighted; And

With opposite number after the weighting and the addition of described acoustics score, with integrate score as this recognition result candidate.

8. audio recognition method according to claim 1, wherein, the step of this recognition result of described calculating candidate's integrate score comprises:

Described reciprocal value is weighted; And

Reciprocal value after the weighting and described acoustics score are multiplied each other, with integrate score as this recognition result candidate.

9. audio recognition method according to claim 1, wherein, described voice unit is any one in state, phoneme, syllable, speech or the phrase.

10. speech recognition system based on the difference of voice unit word speed comprises:

Pretreatment module is used for the voice of being imported are carried out pre-service;

Characteristic extracting module is used to extract the acoustic feature of described voice;

Decoder module, be used for based on the acoustic model of training in advance and the acoustic feature of the described voice that extracted, described voice are decoded, obtaining a plurality of recognition result candidates of described voice, each of wherein said a plurality of recognition result candidates has the acoustics score and the segment length of the voice unit that comprised;

Voice unit word speed difference value computing module is used for each for described a plurality of recognition result candidates, based on the segment length of the voice unit that is comprised, calculates this recognition result candidate's voice unit word speed difference value;

The integrate score computing module is used for each for described a plurality of recognition result candidates, based on the voice unit word speed difference value harmony that the is calculated branch that learns, calculates this recognition result candidate's integrate score; And

Select module, be used for selecting the highest recognition result candidate of described integrate score, as the final recognition result of described voice from described a plurality of recognition result candidates.