CN101334998A

CN101334998A - Chinese speech recognition system based on heterogeneous model differentiated fusion

Info

Publication number: CN101334998A
Application number: CNA2008100414660A
Authority: CN
Inventors: 朱杰; 黄浩
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2008-08-07
Filing date: 2008-08-07
Publication date: 2008-12-31

Abstract

The invention relates to a Chinese speech recognition system which pertains to the speech recognition technology field and is based on heterogeneous model differential fusion. The system comprises: a model-probability weighty-distribution module, a differential model-probability weighty-training module, a model-probability weighty-smoothing module and a speech recognition module of differential fusion. The model-probability weighty-distribution module is responsible for generating the relevant model-probability weight sets for the linguistic context of every arc of a lattice and carrying out initialization; the differential model-probability weighty-training module utilizes minimum tone error rule to differentially train the output of heterogeneous model and obtain a minimum tone error cumulant, and a differential model-probability weight sets is obtained according to the minimum tone error cumulant; the model-probability weighty-smoothing module carries out the smoothing process on the relevant model-probability weight sets which is input into the context; the speech recognition module of differential fusion carries out speech recognition output by the weight sets after the smoothing process. The system can reduce the relative error recognition rate of speech recognition.

Description

Chinese speech recognition system based on heterogeneous model differentiated fusion

Technical field

What the present invention relates to is a kind of system that is used for the speech recognition technology field, specifically is a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion.

Background technology

Large vocabulary continuous speech recognition system develops to the direction of multi-modal many information fusion day by day at present, and the degree of obscuring that utilizes multiple heterogeneous model to reduce speech recognition system is the important means that the current speech recognition system improves recognition performance.Adopting a special case of multiple heterogeneous model is Chinese speech recognition system, and Chinese speech identification is that Chinese language is that a kind of band is transferred language with a bigger difference of English Phonetics identification.In national standard, 6763 of Chinese characters in common use have been listed in the regulation.Syllable is the natural unit of Chinese speech, and a syllable represented in a Chinese characters in the Chinese.Have 1282 band tuning joint in the standard Chinese, and do not have 412 with tuning joint (promptly have identical sound-simple or compound vowel of a Chinese syllable combination, hereinafter be called basic syllable).Each syllable all corresponding certain tone of this explanation in Chinese, one has five kinds of tones in the standard Chinese: high and level tone, rising tone, go up sound, falling tone and softly.For the syllable of same initial consonant and simple or compound vowel of a Chinese syllable formation, its tone difference, then corresponding usually Chinese character is also different, so tone is being born the effect that justice distinguished in important structure word in standard Chinese.That is to say that the tone model provides a kind of effective means of distinguishing the different character/word of unisonance.Especially in natural spoken language, the syntax, discontinuous or words and phrases that grammer is obscured appear not meeting through regular meeting, and at this time, the tone model just can effectively reduce the puzzled degree of natural spoken language identification.

In the big vocabulary continuous speech recognition of Chinese system, utilize tone information to improve the continuous speech recognition system performance, a kind of approach wherein is to utilize spectrum signature that continuous speech is carried out the hidden Markov modeling earlier, is called the spectrum signature model; Utilize the tone feature to set up the tone model.In identifying, utilize the spectrum signature model to carry out speech recognition earlier and obtain lattice (lattice) output, every arc in lattice can pass through the initial and concluding time that Viterbi (Viterbi) alignment obtains voiced segments, and each voiced segments is calculated the tone score.On the lattice structure basis, various models (spectrum signature model, tone model) are merged, reduce misclassification rate at the secondary decode procedure.

Find by prior art documents, people such as Lei Xin are at " International Conference onSpeech and Language Proceesing " (voice Language Processing international conference collection of thesis) pp.1277-1280, Sep.2006 delivers people such as " Improved Tone Modeling for Mandarin Broadcast News SpeechRecognition " (improved tone modeling in the speech recognition of Chinese Broadcast Journalism) and Wang Huanliang at " The 5th International Symposium on Chinese Spoken LanguageProcessing " (the 5th Chinese characters spoken language Language Processing international conference) " Improved Mandarin SpeechRecognition by Lattice Rescoring with Enhanced Tone models " .pp.445-443, in (2006. utilizing improved tone model to improve lattice decoding in the Chinese speech identification), what adopt all is didactic method, rule of thumb or carry out heterogeneous model by the weight of spectrum signature model harmony mode transfer type that the way of search is chosen the overall situation and merge, this method can not obtain best continuous speech recognition effect usually, this is because spectrum signature model and tone model stand-alone training can not mate in the continuous speech recognition process preferably; On the other hand, Quan Ju model weight can not be to concrete phonetics/semantics sight modeling.If when heterogeneous model quantity increased, the search volume also was exponential increase, has also increased the difficulty of manually choosing.

Summary of the invention

The objective of the invention is to deficiency at existing system, a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion is provided, reaches optimum recognition result with being more suitable for thereby this system makes that each class model can match each other in the coefficient speech recognition system of multiple model.

The present invention is achieved by the following technical solutions, the present invention includes: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:

Model probability weight allocation module is responsible for the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and carries out initialization;

The property distinguished model probability weight training module receives initialized model probability weight sets, produce front and back to data, and utilize the minimum phone error criterion property distinguished training output heterogeneous model to obtain minimum phone error accumulation amount, obtain the model probability weight sets of the property distinguished according to minimum phone error accumulation amount;

The level and smooth module of model probability weight is to importing the model probability weight sets that carries out between the context-sensitive model probability weight sets after smoothing processing obtains smoothly;

The sound identification module of the property distinguished fusion utilizes the weight sets after the smoothing processing to carry out speech recognition output.

Described model probability weight allocation module, produce weight sets according to lattice phonetics/semantic context of co-text, context of co-text comprises the sight of band tuning joint type, initial consonant model, rhythm pattern master and the Chinese character speech of current syllable, and model probability weight allocation module common property is given birth to four kinds of weight sets:

Band tuning joint associated weight collection is given a pair of model probability weight to each band tuning joint;

Rhythm pattern master associated weight collection, each different simple or compound vowel of a Chinese syllable three-tone model is given a group model probability right;

Model combination associated weight collection is given a pair of model probability weight at each initial consonant-simple or compound vowel of a Chinese syllable three-tone model combination;

Speech associated weight collection is given a pair of model probability weight at each band tuning joint of each the word correspondence in the whole speech of each Chinese.

Described differentiation model probability weight training module comprises: front and back are to data computation submodule, minimum phone error accumulation amount calculating sub module, model probability weight updating submodule, wherein:

Front and back are to the input of data computation submodule according to the initial weight collection, and the forward-backward algorithm that carries out lattice calculates, and comprises the forward direction probability P that every arc q is arrived all paths of this arc head node from every arc of start node _α(q), arrive the backward probability P in all paths of this arc tail node from terminal node _β(q); Arrive the average forward direction accuracy A in all paths of this arc head node from every arc of start node _α(q), arrive the average back in all paths of this arc tail node to accuracy A from terminal node _β(q);

Minimum phone error accumulation amount calculating sub module is utilized the output P of front and back to calculating sub module _α(q) and P _β(q) obtain posterior probability γ by every arc _q, utilize A _α(q) and A _β(q) obtain average correctness c (q), and obtain the minimum phone mistake arc data γ that adds up according to above-mentioned data by all paths of every arc _q ^MPE,

γ_{q}^{MPE} = γ_{q} (c (q) - c_{avg}),

C wherein _AvgAverage correctness for all paths among the lattice;

Model probability weight updating submodule is utilized the output γ of minimum phone error accumulation amount calculating sub module _q ^MPE, iteration is upgraded the model probability weight, and is specific as follows:

η_{m, i}^{'} = \frac{κ γ_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i}}{\underset{i}{Σ} ({κγ}_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i})} - - - (1)

Wherein: η ' _{M, i}Be to upgrade the model probability weight that obtains, η _{M, i}Be the model probability weight of a preceding iteration, i represents to belong to i the heterogeneous model of arc q; M represents the affiliated m group model probability right of arc q; And satisfy following condition: η _{M, i}＞0, η _{M, i}＞0,

\underset{i}{Σ} η_{m, i} = 1 .

κ is the balance constant that reduces the probability dynamic range; Log (O _i| ξ _i) be model ξ _iThe logarithm probability of (spectrum signature model or tone model), O _iBe model ξ _iObserved value (spectrum signature or tone feature), C is the level and smooth control constant that experience is chosen.The property distinguished model probability weight training module repeats above-mentioned three module processes and carries out the iteration renewal until the objective function convergence, and with final η ' _{M, i}As output.

The level and smooth module of described model probability weight, smoothly overcome weight between four kinds of context-sensitive model probability weight sets for the output of the property distinguished model probability weight training module and train easy over-fitting problem, be specially: context-sensitive model probability weight sets increasing along with parameter, when the discrimination of training set is improved, the test set discrimination is descended on the contrary, carry out interpolation between in four kinds of model probability weight sets that the level and smooth module employing of the model probability weight property distinguished model probability weight training module obtains two or more and produce level and smooth model weight, expression formula is: η _Smooth=ρ η _FMD+ (1-ρ) η _MCD, wherein: η _SmoothBe the weight through smoothly obtaining, r is a smoothing factor, η _FMDBe simple or compound vowel of a Chinese syllable model model correlation model weight, η _MCDIt is model combination correlation model weight.

The sound identification module of described differentiation fusion, utilization contains the spectrum signature model of spectrum signature training to recognition data identification generation lattice, every among lattice arc is carried out the spectrum signature model, the probability calculation of tone model, acoustics/semantic context according to every arc among the lattice, choose the central weight of differentiation weight sets that the property distinguished model probability weight training module produces, in weighing, carry out smoothing processing by the level and smooth module of model probability weight, and to the spectrum signature model, tone model score is weighted and obtains total acoustics score, finds the highest path of probability as the output result according to Viterbi (Viterbi) method from lattice at last.

The sound identification module of described differentiation fusion, it is weighted spectrum signature model, tone model score and obtains total acoustics score, is specially:

\log p (q) = η_{m, 1}^{'} [α \log p (O_{q}^{S} | θ_{q}^{S})] + η_{m, 2}^{'} [β \log p (O_{q}^{T} | θ_{q}^{T})] + \log p_{LM} + WP - - - (3)

Wherein: logP (q) is the combined sound of q bar arc among the lattice branch that learns, log p (O _q ^S| θ _q ^S) be the spectrum signature logarithm probability of this arc, O _q ^SBe the spectrum signature observation sequence of this arc correspondence, θ _q ^SSpectrum signature model for this arc correspondence; Logp (O _q ^T| θ _q ^T) be tone model θ from q bar arc _q ^TProduce tone feature (sequence) O _q ^TThe time the logarithm probability, α and β are predefined overall spectrum signature model and tone model weight, log P _LMBe language model logarithm probability, WP is the speech penalty value, and α, β and WP rule of thumb choose, η ' _{M, 1}And η ' _{M, 2}Be the m group model probability right η ' in the level and smooth weight sets _{M, i}, i=1 wherein, 2.

Compared with prior art, the present invention has following beneficial effect: the present invention is according to the differentiation information in the voice system under the heterogeneous model in the Chinese speech identification, utilize the weight of the property distinguished to train the optimum matching effect that obtains between the multiple model, employing context dependent model probability weight is caught phonetics, the phonetics sight in the identifying.Can obtain significant discrimination in the integrating process of Chinese language tone model promotes, band tuning joint and two kinds of recognition results of Chinese character output show, have obtained 9.5% and 4.7% relative misclassification rate decline respectively than world model's weight based on the property distinguished model weight.Big vocabulary speech recognition system recognition result shows that weight smoothly can overcome because the over-fitting problem of bringing can train weight to increase the time, thereby further improves the recognition performance of system.The present invention is that Chinese large vocabulary continuous speech recognition system pushes a practical gordian technique to.

Description of drawings

Fig. 1 is a system architecture diagram of the present invention.

Embodiment

Below embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Present embodiment is further described under band tuning joint output and Chinese character output recognition system based on the Chinese large vocabulary unspecified person speech recognition system of 28000 order speech.

As shown in Figure 1, present embodiment comprises: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:

Described model probability weight allocation module, the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and initialization respectively, and context of co-text comprises the sight of band tuning joint type, rhythm pattern master, model combination and the Chinese character speech of current syllable in the present embodiment;

Table 1 has shown context-sensitive three-tone (triphone) modeling method (being quiet section before and after establishing) of Chinese four words " fragmentarily ".The pronunciation of each Chinese character is a band tuning joint, and each band tuning joint is divided into initial consonant and two parts of simple or compound vowel of a Chinese syllable.Based on context difference, each part is with a context-sensitive three-tone model representation.The described four kinds of model associated weight distribution method distances of model probability weight allocation module:

As for Chinese character " star ", its band tuning joint is [xing1], and band tuning joint weight strategy belongs to same band tuning joint to first " star " and second " star ", gives with a pair of model weight.

Rhythm pattern master as elder generation's latter two " star " is respectively [x-ing1+x] and [x-ing1+d].In rhythm pattern master associated weight strategy, two word pronunciations are identical, because with different context dependent model modelings, so give different model weights respectively.This weight strategy can carry out modeling simultaneously to the sound-rhythm parent type of current syllable and the initial consonant type of follow-up syllable;

Being respectively [il-x+ing1x-ing1+x] and [ing1-x+ing1x-ing1+d] as the model combination of latter two " star " word earlier, is two different three-tone model combinations, gives a pair of model weight respectively in the model combination;

As for giving a pair of spectrum signature model, tone model weight in each the band tuning joint in the middle of " fragmentarily " this speech, come modeling is carried out in the variation of tone coarticulation in the whole speech of Chinese.

Table 1 context dependent model weight allocation example

Described differentiation model probability weight training module, utilize the MPE criterion to obtain the model probability weight of heterogeneous model when identifying merges, comprise front and back to calculating sub module, minimum phone error accumulation amount calculating sub module and model probability weight updating submodule, wherein:

Before and after calculate forward direction-back to data to data computation submodule output: to every arc q, probability and summation by the forward direction probability of all forerunner's nodes multiply by the arc of the arc head node that links forerunner's node and q obtain the forward direction probability P from the lattice start node to these all paths of arc arc head node _α(q); By probability and the summation that the backward probability of all descendant nodes multiply by the arc of the arc tail node that links descendant node and q, obtain the backward probability P of this arc of terminal node _β(q); Correctness by all forerunner's nodes adds that link forerunner node to the correctness of the arc of present node and by the posterior probability weighted mean of this arc, obtains arriving from start node the forward direction correctness A in all paths of arc head node _α(q) correctness by all descendant nodes adds that the link descendant node to the correctness of the arc of present node and by the posterior probability weighted mean of this arc, obtains arriving the back to correctness A of all paths of arc tail node from terminal node _β(q).

Minimum phone mistake cumulative data calculating sub module is utilized the forward direction probability P _α(q) and backward probability P _β(q), calculate the posterior probability γ in all paths that include q bar arc _q: γ _q=P _α(q) P _β(q)/and P (O), P (O) is the general probability in all paths among the lattice, its value is taken as the backward probability P of start node _β(q); The average correctness C (q) that calculating is supposed by all sentences that include q bar arc: C (q)=A _α(q)+A _β(q)+and Acc (q), wherein Acc (q) is the correctness that q bar arc and retrtieval contrast obtain; To every arc q, calculate the average correctness c in all paths among the lattice to the forward direction correctness of correctness or terminal node according to the back of start node _AvgAccording to formula

γ_{q}^{MPE} = γ_{q} (c (q) - c_{avg}),

Calculate the minimum phone mistake arc data γ that adds up _q ^MPE

Weight iteration updating submodule be input as γ in the minimum phone mistake cumulative data calculating sub module _q ^MPE, obtaining the model probability weight of the property distinguished training, its expression formula is specific as follows:

η_{m, i}^{'} = \frac{κ γ_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i}}{\underset{i}{Σ} ({κγ}_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i})} - - - (1)

\underset{i}{Σ} η_{m, i} = 1 .

κ is the balance constant that reduces the probability dynamic range; Log (O _i| ξ _i) be model ξ _iThe logarithm probability of (spectrum signature model or tone model), O _iFoot model ξ _iObserved value (spectrum signature or tone feature), C is the level and smooth control constant that experience is chosen, the choosing method of smoothing constant is:

C = E Σ_{i} | κ γ_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{η} |,

Wherein: E is the level and smooth control constant that experience is chosen, the selected part school motto is practiced data as checking data, the speed of convergence of minimum phone false target function by the assessment checking data, come empirical the choosing of level and smooth control constant E of determining, select E=100 can reach the best identified result in the example.κ value experience in example is chosen for 15, repeat above-mentioned module process more the new model weights till objective function convergence.

The level and smooth module of described model probability weight, the output of four kinds of context dependent weight sets that the reception property distinguished model probability weight training module obtains, to smoothly obtaining level and smooth model probability weight sets between the various context dependent model probability weight sets, as: the property distinguished model probability weight training module obtains after the relevant weight relevant with the model combination of rhythm pattern master, weight under two kinds of situations is produced level and smooth model weight in the phase Calais in proportion according to formula (2), wherein the size of smoothing factor ρ is [0,1] interval selected by checking data objective function empirical value, choose ρ=0.35 in an embodiment and reach optimal result.Weight sets after the level and smooth module of model probability weight is level and smooth is exported to the identification module of the property distinguished fusion.

Weighting obtains total acoustics score, wherein, and by overall spectrum signature model probability weights α, tone characteristic model probability weights β and the speech penalty value WP in the selected integrated formula of model (3) of checksum set.

In the present embodiment, under the band tuning joint output task, overall spectrum signature model probability weight, tone model probability weight and speech penalty value are chosen for α=1 respectively, β=4.5, WP=35; Choose α=1 in the Chinese character output task, β=2.2, WP=20;

Utilize the integrated formula of model (3) to calculate the PTS of every arc according to these weights;

The Viterbi method is as follows: at first, the lattice of test data is carried out calculating as the forward direction probability in calculating word modules before and after in the property the distinguished model probability weight training module; Then, from the terminal node of lattice, seek the arc (promptly select an arc: this forearc drives the model probability sum maximum of forward direction probability and this arc of node) that most possibly leads to this node; Secondly, forerunner's node of this arc is done the start node of aforementioned calculation until lattice; At last, with the path of all arcs of process in this process as the output result.

Provide system's concrete recognition result under band tuning joint output and Chinese character output task in the present embodiment below.

Table 2 has provided continuous speech band tuning joint output recognition result, at first provides and only uses the overall tone model weight recognition result that the tone model is integrated under the legacy system.Provide the recognition result that uses different spectrum signature models and tone model to make up.MSR in the table 1 (Microsoft Research) baseline adopts the spectrum signature model of maximal possibility estimation, and MPE (minimum phone mistake) is the spectrum signature model that adopts the training of the minimum phone method property distinguished.The result shows in the tone model adding continuous speech recognition decode procedure (adopting the model weight of the overall situation), false recognition rate has had remarkable decline, is reduced to 41.3% and be reduced to 34.8% from 40.9% of minimum phone mistake from 48.7% of Microsoft's baseline system respectively.

Table 2 band tuning joint output continuous speech recognition result

Table 2 latter half provides the recognition result that uses the training of the native system property distinguished model weight.The spectrum signature model all adopts minimum phone error spectrum characteristic model.In band tuning joint identification mission, only test first three and plant weight sets.The equal initialization of initial value of each model weight strategy is from world model's weight, adopts three kinds of weight sets (band tuning joint is relevant, the rhythm pattern master is relevant, model combination relevant) band tuning joint misclassification rate significantly to be reduced to 34.1%, 32.9% and 32.5% respectively from only using 34.8% of world model's weight as can be seen from Table 1.This shows the differentiation model probability weight training module of system's proposition and the performance that model probability weight allocation module can effectively improve recognition system.

In the sound identification module of the property distinguished fusion, run into the model weight of not training and then give default overall weight.Rhythm pattern master associated weight collection has brought 1.2% improvement in performance with tuning joint associated weight collection, and system assigns weight according to the influence of front and back initial consonant type and can obtain tangible discrimination and improve.On the other hand, the model weight that adopts model combination corresponding strategies to adopt the simple or compound vowel of a Chinese syllable corresponding strategies to obtain has obtained misclassification rate consistance decline (0.4%).By the level and smooth module of model probability weight, band tuning joint misclassification rate is 31.5%, and single model combination associated weight that adopts obtains nearly 1.0% further performance boost.Show that the level and smooth weight sets that the level and smooth module of model probability weight obtains can reduce the phenomenon of training over-fitting and further improve the system identification result.

Be identified as for Chinese character output: utilize spectrum signature model and language model to identify lattice earlier based on the Chinese character speech, then every arc of lattice calculate tone model score carry out the secondary decoding to the Chinese character output sequence.Language model is two gram language model (bigram) of training from the 50M word Chinese language text corpus statistics of the Peoples Daily 1-6 month in 1998, and recognition dictionary amounts to 2.8 ten thousand speech.Table 3 has provided the recognition result of Chinese character output.Wherein MPE spectrum signature model trains to draw on the lattice of Chinese character speech on the spectrum signature model based of maximum likelihood.

Table 3 continuous speech Chinese character output recognition result

From the Chinese character output result as can be seen, for Microsoft's baseline spectrum signature model (MSR baseline) and MPE spectrum signature model, add the entering tone model and use world model's weight, spectrum signature model at maximum likelihood and minimum phone error training, the word misclassification rate has had significant reduction, be reduced to 13.9% and 12.9% from 16.0% and 14.8% respectively, show that the adding of tone information can effectively improve the recognition performance of system's continuous speech word output identification mission.Adopt the integrated misclassification rate that has reduced significantly under the Chinese character output task of training pattern weight tone model of the property distinguished, wherein the weight strategy that speech is relevant can obtain the better recognition result, by level and smooth with carrying out between rhythm pattern master associated weight and the speech associated weight, obtaining the system word misclassification rate is 12.3%.Adopt world model's weight method to obtain 0.6% performance boost.

Band tuning joint and two kinds of recognition results of Chinese character output show that native system weight legacy system has not obtained 9.5% and 4.7% relative misclassification rate decline.This shows the validity that obtains best recognition effect based on the Chinese speech recognition system of heterogeneous model differentiated fusion.

Claims

1, a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion, it is characterized in that, comprise: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:

2, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, described model probability weight allocation module, produce weight sets according to lattice phonetics/semantic context of co-text, context of co-text comprises the sight of band tuning joint type, initial consonant model, rhythm pattern master and the Chinese character speech of current syllable, and model probability weight allocation module common property is given birth to four kinds of weight sets:

3, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, described differentiation model probability weight training module, comprise: front and back are to data computation submodule, minimum phone error accumulation amount calculating sub module, model probability weight updating submodule, wherein:

γ_{q}^{MPE} = γ_{q} (c (q) - c_{avg}),

C wherein _AvgAverage correctness for all paths among the lattice;

η_{m, i}^{'} = \frac{{κγ}_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i}}{\underset{i}{Σ} ({κγ}_{q}^{MPE} η_{m, i} \log (O_{i} | ξ_{i}) |_{h} + {Cη}_{m, i})}

\underset{i}{Σ} η_{m, i} = 1,

κ is the balance constant that reduces the probability dynamic range; Log (O _i| ξ _i) be model ξ _iThe logarithm probability, O _iBe model ξ _iObserved value, C is the level and smooth control constant that experience is chosen, the property distinguished model probability weight training module repeats above-mentioned three module processes to carry out iteration and upgrades until the objective function convergence, and with final η ' _{M, i}As output.

4, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the level and smooth module of described model probability weight, smoothly overcome weight between four kinds of context-sensitive model probability weight sets for the output of the property distinguished model probability weight training module and train easy over-fitting problem, be specially: context-sensitive model probability weight sets increasing along with parameter, when the discrimination of training set is improved, the test set discrimination is descended on the contrary, carry out interpolation between in four kinds of model probability weight sets that the level and smooth module employing of the model probability weight property distinguished model probability weight training module obtains two or more and produce level and smooth model weight, expression formula is: η _Smooth=ρ η _FMD+ (1-ρ) η _MCD, wherein: η _SmoothBe the weight through smoothly obtaining, r is a smoothing factor, η _FMDBe simple or compound vowel of a Chinese syllable model model correlation model weight, η _MCDIt is model combination correlation model weight.

5, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the sound identification module of described differentiation fusion, utilization contains the spectrum signature model of spectrum signature training to recognition data identification generation lattice, every among lattice arc is carried out the spectrum signature model, the probability calculation of tone model, acoustics/semantic context according to every arc among the lattice, choose the central weight of differentiation weight sets that the property distinguished model probability weight training module produces, in weighing, carry out smoothing processing by the level and smooth module of model probability weight, and to the spectrum signature model, tone model score is weighted and obtains total acoustics score, finds the highest path of probability as the output result according to the Viterbi method from lattice at last.

6, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the sound identification module of described differentiation fusion, it is weighted spectrum signature model, tone model score and obtains total acoustics score, is specially:

\log p (q) = η_{m, 1}^{'} [α \log p (O_{q}^{S} | θ_{q}^{S})] + η_{m, 2}^{'} [β \log p (O_{q}^{T} | θ_{q}^{T})] + \log p_{LM} + WP

Wherein: logP (q) is the combined sound of q bar arc among the lattice branch that learns, log p (O _q ^S| θ _q ^S) be the spectrum signature logarithm probability of this arc, O _q ^SBe the spectrum signature observation sequence of this arc correspondence, θ _q ^SSpectrum signature model for this arc correspondence; Log p (O _q ^T| θ _q ^T) be tone model θ from q bar arc _q ^TProduce tone feature (sequence) O _q ^TThe time the logarithm probability, α and β are predefined overall spectrum signature model and tone model weight, log P _LMBe language model logarithm probability, WP is the speech penalty value, and α, β and WP rule of thumb choose, η ' _{M, 1}And η ' _{M, 2}Be the m group model probability right η ' in the level and smooth weight sets _{M, i}I=1 wherein, 2.