CN101334998A - Chinese speech recognition system based on heterogeneous model differentiated fusion - Google Patents

Chinese speech recognition system based on heterogeneous model differentiated fusion Download PDF

Info

Publication number
CN101334998A
CN101334998A CNA2008100414660A CN200810041466A CN101334998A CN 101334998 A CN101334998 A CN 101334998A CN A2008100414660 A CNA2008100414660 A CN A2008100414660A CN 200810041466 A CN200810041466 A CN 200810041466A CN 101334998 A CN101334998 A CN 101334998A
Authority
CN
China
Prior art keywords
model
weight
probability
module
arc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100414660A
Other languages
Chinese (zh)
Inventor
朱杰
黄浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNA2008100414660A priority Critical patent/CN101334998A/en
Publication of CN101334998A publication Critical patent/CN101334998A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a Chinese speech recognition system which pertains to the speech recognition technology field and is based on heterogeneous model differential fusion. The system comprises: a model-probability weighty-distribution module, a differential model-probability weighty-training module, a model-probability weighty-smoothing module and a speech recognition module of differential fusion. The model-probability weighty-distribution module is responsible for generating the relevant model-probability weight sets for the linguistic context of every arc of a lattice and carrying out initialization; the differential model-probability weighty-training module utilizes minimum tone error rule to differentially train the output of heterogeneous model and obtain a minimum tone error cumulant, and a differential model-probability weight sets is obtained according to the minimum tone error cumulant; the model-probability weighty-smoothing module carries out the smoothing process on the relevant model-probability weight sets which is input into the context; the speech recognition module of differential fusion carries out speech recognition output by the weight sets after the smoothing process. The system can reduce the relative error recognition rate of speech recognition.

Description

Chinese speech recognition system based on heterogeneous model differentiated fusion
Technical field
What the present invention relates to is a kind of system that is used for the speech recognition technology field, specifically is a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion.
Background technology
Large vocabulary continuous speech recognition system develops to the direction of multi-modal many information fusion day by day at present, and the degree of obscuring that utilizes multiple heterogeneous model to reduce speech recognition system is the important means that the current speech recognition system improves recognition performance.Adopting a special case of multiple heterogeneous model is Chinese speech recognition system, and Chinese speech identification is that Chinese language is that a kind of band is transferred language with a bigger difference of English Phonetics identification.In national standard, 6763 of Chinese characters in common use have been listed in the regulation.Syllable is the natural unit of Chinese speech, and a syllable represented in a Chinese characters in the Chinese.Have 1282 band tuning joint in the standard Chinese, and do not have 412 with tuning joint (promptly have identical sound-simple or compound vowel of a Chinese syllable combination, hereinafter be called basic syllable).Each syllable all corresponding certain tone of this explanation in Chinese, one has five kinds of tones in the standard Chinese: high and level tone, rising tone, go up sound, falling tone and softly.For the syllable of same initial consonant and simple or compound vowel of a Chinese syllable formation, its tone difference, then corresponding usually Chinese character is also different, so tone is being born the effect that justice distinguished in important structure word in standard Chinese.That is to say that the tone model provides a kind of effective means of distinguishing the different character/word of unisonance.Especially in natural spoken language, the syntax, discontinuous or words and phrases that grammer is obscured appear not meeting through regular meeting, and at this time, the tone model just can effectively reduce the puzzled degree of natural spoken language identification.
In the big vocabulary continuous speech recognition of Chinese system, utilize tone information to improve the continuous speech recognition system performance, a kind of approach wherein is to utilize spectrum signature that continuous speech is carried out the hidden Markov modeling earlier, is called the spectrum signature model; Utilize the tone feature to set up the tone model.In identifying, utilize the spectrum signature model to carry out speech recognition earlier and obtain lattice (lattice) output, every arc in lattice can pass through the initial and concluding time that Viterbi (Viterbi) alignment obtains voiced segments, and each voiced segments is calculated the tone score.On the lattice structure basis, various models (spectrum signature model, tone model) are merged, reduce misclassification rate at the secondary decode procedure.
Find by prior art documents, people such as Lei Xin are at " International Conference onSpeech and Language Proceesing " (voice Language Processing international conference collection of thesis) pp.1277-1280, Sep.2006 delivers people such as " Improved Tone Modeling for Mandarin Broadcast News SpeechRecognition " (improved tone modeling in the speech recognition of Chinese Broadcast Journalism) and Wang Huanliang at " The 5th International Symposium on Chinese Spoken LanguageProcessing " (the 5th Chinese characters spoken language Language Processing international conference) " Improved Mandarin SpeechRecognition by Lattice Rescoring with Enhanced Tone models " .pp.445-443, in (2006. utilizing improved tone model to improve lattice decoding in the Chinese speech identification), what adopt all is didactic method, rule of thumb or carry out heterogeneous model by the weight of spectrum signature model harmony mode transfer type that the way of search is chosen the overall situation and merge, this method can not obtain best continuous speech recognition effect usually, this is because spectrum signature model and tone model stand-alone training can not mate in the continuous speech recognition process preferably; On the other hand, Quan Ju model weight can not be to concrete phonetics/semantics sight modeling.If when heterogeneous model quantity increased, the search volume also was exponential increase, has also increased the difficulty of manually choosing.
Summary of the invention
The objective of the invention is to deficiency at existing system, a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion is provided, reaches optimum recognition result with being more suitable for thereby this system makes that each class model can match each other in the coefficient speech recognition system of multiple model.
The present invention is achieved by the following technical solutions, the present invention includes: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:
Model probability weight allocation module is responsible for the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and carries out initialization;
The property distinguished model probability weight training module receives initialized model probability weight sets, produce front and back to data, and utilize the minimum phone error criterion property distinguished training output heterogeneous model to obtain minimum phone error accumulation amount, obtain the model probability weight sets of the property distinguished according to minimum phone error accumulation amount;
The level and smooth module of model probability weight is to importing the model probability weight sets that carries out between the context-sensitive model probability weight sets after smoothing processing obtains smoothly;
The sound identification module of the property distinguished fusion utilizes the weight sets after the smoothing processing to carry out speech recognition output.
Described model probability weight allocation module, produce weight sets according to lattice phonetics/semantic context of co-text, context of co-text comprises the sight of band tuning joint type, initial consonant model, rhythm pattern master and the Chinese character speech of current syllable, and model probability weight allocation module common property is given birth to four kinds of weight sets:
Band tuning joint associated weight collection is given a pair of model probability weight to each band tuning joint;
Rhythm pattern master associated weight collection, each different simple or compound vowel of a Chinese syllable three-tone model is given a group model probability right;
Model combination associated weight collection is given a pair of model probability weight at each initial consonant-simple or compound vowel of a Chinese syllable three-tone model combination;
Speech associated weight collection is given a pair of model probability weight at each band tuning joint of each the word correspondence in the whole speech of each Chinese.
Described differentiation model probability weight training module comprises: front and back are to data computation submodule, minimum phone error accumulation amount calculating sub module, model probability weight updating submodule, wherein:
Front and back are to the input of data computation submodule according to the initial weight collection, and the forward-backward algorithm that carries out lattice calculates, and comprises the forward direction probability P that every arc q is arrived all paths of this arc head node from every arc of start node α(q), arrive the backward probability P in all paths of this arc tail node from terminal node β(q); Arrive the average forward direction accuracy A in all paths of this arc head node from every arc of start node α(q), arrive the average back in all paths of this arc tail node to accuracy A from terminal node β(q);
Minimum phone error accumulation amount calculating sub module is utilized the output P of front and back to calculating sub module α(q) and P β(q) obtain posterior probability γ by every arc q, utilize A α(q) and A β(q) obtain average correctness c (q), and obtain the minimum phone mistake arc data γ that adds up according to above-mentioned data by all paths of every arc q MPE, γ q MPE = γ q ( c ( q ) - c avg ) , C wherein AvgAverage correctness for all paths among the lattice;
Model probability weight updating submodule is utilized the output γ of minimum phone error accumulation amount calculating sub module q MPE, iteration is upgraded the model probability weight, and is specific as follows:
η m , i ′ = κ γ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i Σ i ( κγ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i ) - - - ( 1 )
Wherein: η ' M, iBe to upgrade the model probability weight that obtains, η M, iBe the model probability weight of a preceding iteration, i represents to belong to i the heterogeneous model of arc q; M represents the affiliated m group model probability right of arc q; And satisfy following condition: η M, i>0, η M, i>0, Σ i η m , i = 1 . κ is the balance constant that reduces the probability dynamic range; Log (O i| ξ i) be model ξ iThe logarithm probability of (spectrum signature model or tone model), O iBe model ξ iObserved value (spectrum signature or tone feature), C is the level and smooth control constant that experience is chosen.The property distinguished model probability weight training module repeats above-mentioned three module processes and carries out the iteration renewal until the objective function convergence, and with final η ' M, iAs output.
The level and smooth module of described model probability weight, smoothly overcome weight between four kinds of context-sensitive model probability weight sets for the output of the property distinguished model probability weight training module and train easy over-fitting problem, be specially: context-sensitive model probability weight sets increasing along with parameter, when the discrimination of training set is improved, the test set discrimination is descended on the contrary, carry out interpolation between in four kinds of model probability weight sets that the level and smooth module employing of the model probability weight property distinguished model probability weight training module obtains two or more and produce level and smooth model weight, expression formula is: η Smooth=ρ η FMD+ (1-ρ) η MCD, wherein: η SmoothBe the weight through smoothly obtaining, r is a smoothing factor, η FMDBe simple or compound vowel of a Chinese syllable model model correlation model weight, η MCDIt is model combination correlation model weight.
The sound identification module of described differentiation fusion, utilization contains the spectrum signature model of spectrum signature training to recognition data identification generation lattice, every among lattice arc is carried out the spectrum signature model, the probability calculation of tone model, acoustics/semantic context according to every arc among the lattice, choose the central weight of differentiation weight sets that the property distinguished model probability weight training module produces, in weighing, carry out smoothing processing by the level and smooth module of model probability weight, and to the spectrum signature model, tone model score is weighted and obtains total acoustics score, finds the highest path of probability as the output result according to Viterbi (Viterbi) method from lattice at last.
The sound identification module of described differentiation fusion, it is weighted spectrum signature model, tone model score and obtains total acoustics score, is specially:
log p ( q ) = η m , 1 ′ [ α log p ( O q S | θ q S ) ] + η m , 2 ′ [ β log p ( O q T | θ q T ) ] + log p LM + WP - - - ( 3 )
Wherein: logP (q) is the combined sound of q bar arc among the lattice branch that learns, log p (O q S| θ q S) be the spectrum signature logarithm probability of this arc, O q SBe the spectrum signature observation sequence of this arc correspondence, θ q SSpectrum signature model for this arc correspondence; Logp (O q T| θ q T) be tone model θ from q bar arc q TProduce tone feature (sequence) O q TThe time the logarithm probability, α and β are predefined overall spectrum signature model and tone model weight, log P LMBe language model logarithm probability, WP is the speech penalty value, and α, β and WP rule of thumb choose, η ' M, 1And η ' M, 2Be the m group model probability right η ' in the level and smooth weight sets M, i, i=1 wherein, 2.
Compared with prior art, the present invention has following beneficial effect: the present invention is according to the differentiation information in the voice system under the heterogeneous model in the Chinese speech identification, utilize the weight of the property distinguished to train the optimum matching effect that obtains between the multiple model, employing context dependent model probability weight is caught phonetics, the phonetics sight in the identifying.Can obtain significant discrimination in the integrating process of Chinese language tone model promotes, band tuning joint and two kinds of recognition results of Chinese character output show, have obtained 9.5% and 4.7% relative misclassification rate decline respectively than world model's weight based on the property distinguished model weight.Big vocabulary speech recognition system recognition result shows that weight smoothly can overcome because the over-fitting problem of bringing can train weight to increase the time, thereby further improves the recognition performance of system.The present invention is that Chinese large vocabulary continuous speech recognition system pushes a practical gordian technique to.
Description of drawings
Fig. 1 is a system architecture diagram of the present invention.
Embodiment
Below embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
Present embodiment is further described under band tuning joint output and Chinese character output recognition system based on the Chinese large vocabulary unspecified person speech recognition system of 28000 order speech.
As shown in Figure 1, present embodiment comprises: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:
Model probability weight allocation module is responsible for the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and carries out initialization;
The property distinguished model probability weight training module receives initialized model probability weight sets, produce front and back to data, and utilize the minimum phone error criterion property distinguished training output heterogeneous model to obtain minimum phone error accumulation amount, obtain the model probability weight sets of the property distinguished according to minimum phone error accumulation amount;
The level and smooth module of model probability weight is to importing the model probability weight sets that carries out between the context-sensitive model probability weight sets after smoothing processing obtains smoothly;
The sound identification module of the property distinguished fusion utilizes the weight sets after the smoothing processing to carry out speech recognition output.
Described model probability weight allocation module, the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and initialization respectively, and context of co-text comprises the sight of band tuning joint type, rhythm pattern master, model combination and the Chinese character speech of current syllable in the present embodiment;
Table 1 has shown context-sensitive three-tone (triphone) modeling method (being quiet section before and after establishing) of Chinese four words " fragmentarily ".The pronunciation of each Chinese character is a band tuning joint, and each band tuning joint is divided into initial consonant and two parts of simple or compound vowel of a Chinese syllable.Based on context difference, each part is with a context-sensitive three-tone model representation.The described four kinds of model associated weight distribution method distances of model probability weight allocation module:
As for Chinese character " star ", its band tuning joint is [xing1], and band tuning joint weight strategy belongs to same band tuning joint to first " star " and second " star ", gives with a pair of model weight.
Rhythm pattern master as elder generation's latter two " star " is respectively [x-ing1+x] and [x-ing1+d].In rhythm pattern master associated weight strategy, two word pronunciations are identical, because with different context dependent model modelings, so give different model weights respectively.This weight strategy can carry out modeling simultaneously to the sound-rhythm parent type of current syllable and the initial consonant type of follow-up syllable;
Being respectively [il-x+ing1x-ing1+x] and [ing1-x+ing1x-ing1+d] as the model combination of latter two " star " word earlier, is two different three-tone model combinations, gives a pair of model weight respectively in the model combination;
As for giving a pair of spectrum signature model, tone model weight in each the band tuning joint in the middle of " fragmentarily " this speech, come modeling is carried out in the variation of tone coarticulation in the whole speech of Chinese.
Table 1 context dependent model weight allocation example
Figure A20081004146600101
Described differentiation model probability weight training module, utilize the MPE criterion to obtain the model probability weight of heterogeneous model when identifying merges, comprise front and back to calculating sub module, minimum phone error accumulation amount calculating sub module and model probability weight updating submodule, wherein:
Before and after calculate forward direction-back to data to data computation submodule output: to every arc q, probability and summation by the forward direction probability of all forerunner's nodes multiply by the arc of the arc head node that links forerunner's node and q obtain the forward direction probability P from the lattice start node to these all paths of arc arc head node α(q); By probability and the summation that the backward probability of all descendant nodes multiply by the arc of the arc tail node that links descendant node and q, obtain the backward probability P of this arc of terminal node β(q); Correctness by all forerunner's nodes adds that link forerunner node to the correctness of the arc of present node and by the posterior probability weighted mean of this arc, obtains arriving from start node the forward direction correctness A in all paths of arc head node α(q) correctness by all descendant nodes adds that the link descendant node to the correctness of the arc of present node and by the posterior probability weighted mean of this arc, obtains arriving the back to correctness A of all paths of arc tail node from terminal node β(q).
Minimum phone mistake cumulative data calculating sub module is utilized the forward direction probability P α(q) and backward probability P β(q), calculate the posterior probability γ in all paths that include q bar arc q: γ q=P α(q) P β(q)/and P (O), P (O) is the general probability in all paths among the lattice, its value is taken as the backward probability P of start node β(q); The average correctness C (q) that calculating is supposed by all sentences that include q bar arc: C (q)=A α(q)+A β(q)+and Acc (q), wherein Acc (q) is the correctness that q bar arc and retrtieval contrast obtain; To every arc q, calculate the average correctness c in all paths among the lattice to the forward direction correctness of correctness or terminal node according to the back of start node AvgAccording to formula γ q MPE = γ q ( c ( q ) - c avg ) , Calculate the minimum phone mistake arc data γ that adds up q MPE
Weight iteration updating submodule be input as γ in the minimum phone mistake cumulative data calculating sub module q MPE, obtaining the model probability weight of the property distinguished training, its expression formula is specific as follows:
η m , i ′ = κ γ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i Σ i ( κγ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i ) - - - ( 1 )
Wherein: η ' M, iBe to upgrade the model probability weight that obtains, η M, iBe the model probability weight of a preceding iteration, i represents to belong to i the heterogeneous model of arc q; M represents the affiliated m group model probability right of arc q; And satisfy following condition: η M, i>0, η M, i>0, Σ i η m , i = 1 . κ is the balance constant that reduces the probability dynamic range; Log (O i| ξ i) be model ξ iThe logarithm probability of (spectrum signature model or tone model), O iFoot model ξ iObserved value (spectrum signature or tone feature), C is the level and smooth control constant that experience is chosen, the choosing method of smoothing constant is:
C = E Σ i | κ γ q MPE η m , i log ( O i | ξ i ) | η | ,
Wherein: E is the level and smooth control constant that experience is chosen, the selected part school motto is practiced data as checking data, the speed of convergence of minimum phone false target function by the assessment checking data, come empirical the choosing of level and smooth control constant E of determining, select E=100 can reach the best identified result in the example.κ value experience in example is chosen for 15, repeat above-mentioned module process more the new model weights till objective function convergence.
The level and smooth module of described model probability weight, the output of four kinds of context dependent weight sets that the reception property distinguished model probability weight training module obtains, to smoothly obtaining level and smooth model probability weight sets between the various context dependent model probability weight sets, as: the property distinguished model probability weight training module obtains after the relevant weight relevant with the model combination of rhythm pattern master, weight under two kinds of situations is produced level and smooth model weight in the phase Calais in proportion according to formula (2), wherein the size of smoothing factor ρ is [0,1] interval selected by checking data objective function empirical value, choose ρ=0.35 in an embodiment and reach optimal result.Weight sets after the level and smooth module of model probability weight is level and smooth is exported to the identification module of the property distinguished fusion.
The sound identification module of described differentiation fusion, utilization contains the spectrum signature model of spectrum signature training to recognition data identification generation lattice, every among lattice arc is carried out the spectrum signature model, the probability calculation of tone model, acoustics/semantic context according to every arc among the lattice, choose the central weight of differentiation weight sets that the property distinguished model probability weight training module produces, in weighing, carry out smoothing processing by the level and smooth module of model probability weight, and to the spectrum signature model, tone model score is weighted and obtains total acoustics score, finds the highest path of probability as the output result according to Viterbi (Viterbi) method from lattice at last.
Weighting obtains total acoustics score, wherein, and by overall spectrum signature model probability weights α, tone characteristic model probability weights β and the speech penalty value WP in the selected integrated formula of model (3) of checksum set.
In the present embodiment, under the band tuning joint output task, overall spectrum signature model probability weight, tone model probability weight and speech penalty value are chosen for α=1 respectively, β=4.5, WP=35; Choose α=1 in the Chinese character output task, β=2.2, WP=20;
Utilize the integrated formula of model (3) to calculate the PTS of every arc according to these weights;
The Viterbi method is as follows: at first, the lattice of test data is carried out calculating as the forward direction probability in calculating word modules before and after in the property the distinguished model probability weight training module; Then, from the terminal node of lattice, seek the arc (promptly select an arc: this forearc drives the model probability sum maximum of forward direction probability and this arc of node) that most possibly leads to this node; Secondly, forerunner's node of this arc is done the start node of aforementioned calculation until lattice; At last, with the path of all arcs of process in this process as the output result.
Provide system's concrete recognition result under band tuning joint output and Chinese character output task in the present embodiment below.
Table 2 has provided continuous speech band tuning joint output recognition result, at first provides and only uses the overall tone model weight recognition result that the tone model is integrated under the legacy system.Provide the recognition result that uses different spectrum signature models and tone model to make up.MSR in the table 1 (Microsoft Research) baseline adopts the spectrum signature model of maximal possibility estimation, and MPE (minimum phone mistake) is the spectrum signature model that adopts the training of the minimum phone method property distinguished.The result shows in the tone model adding continuous speech recognition decode procedure (adopting the model weight of the overall situation), false recognition rate has had remarkable decline, is reduced to 41.3% and be reduced to 34.8% from 40.9% of minimum phone mistake from 48.7% of Microsoft's baseline system respectively.
Table 2 band tuning joint output continuous speech recognition result
Figure A20081004146600131
Table 2 latter half provides the recognition result that uses the training of the native system property distinguished model weight.The spectrum signature model all adopts minimum phone error spectrum characteristic model.In band tuning joint identification mission, only test first three and plant weight sets.The equal initialization of initial value of each model weight strategy is from world model's weight, adopts three kinds of weight sets (band tuning joint is relevant, the rhythm pattern master is relevant, model combination relevant) band tuning joint misclassification rate significantly to be reduced to 34.1%, 32.9% and 32.5% respectively from only using 34.8% of world model's weight as can be seen from Table 1.This shows the differentiation model probability weight training module of system's proposition and the performance that model probability weight allocation module can effectively improve recognition system.
In the sound identification module of the property distinguished fusion, run into the model weight of not training and then give default overall weight.Rhythm pattern master associated weight collection has brought 1.2% improvement in performance with tuning joint associated weight collection, and system assigns weight according to the influence of front and back initial consonant type and can obtain tangible discrimination and improve.On the other hand, the model weight that adopts model combination corresponding strategies to adopt the simple or compound vowel of a Chinese syllable corresponding strategies to obtain has obtained misclassification rate consistance decline (0.4%).By the level and smooth module of model probability weight, band tuning joint misclassification rate is 31.5%, and single model combination associated weight that adopts obtains nearly 1.0% further performance boost.Show that the level and smooth weight sets that the level and smooth module of model probability weight obtains can reduce the phenomenon of training over-fitting and further improve the system identification result.
Be identified as for Chinese character output: utilize spectrum signature model and language model to identify lattice earlier based on the Chinese character speech, then every arc of lattice calculate tone model score carry out the secondary decoding to the Chinese character output sequence.Language model is two gram language model (bigram) of training from the 50M word Chinese language text corpus statistics of the Peoples Daily 1-6 month in 1998, and recognition dictionary amounts to 2.8 ten thousand speech.Table 3 has provided the recognition result of Chinese character output.Wherein MPE spectrum signature model trains to draw on the lattice of Chinese character speech on the spectrum signature model based of maximum likelihood.
Table 3 continuous speech Chinese character output recognition result
Figure A20081004146600141
From the Chinese character output result as can be seen, for Microsoft's baseline spectrum signature model (MSR baseline) and MPE spectrum signature model, add the entering tone model and use world model's weight, spectrum signature model at maximum likelihood and minimum phone error training, the word misclassification rate has had significant reduction, be reduced to 13.9% and 12.9% from 16.0% and 14.8% respectively, show that the adding of tone information can effectively improve the recognition performance of system's continuous speech word output identification mission.Adopt the integrated misclassification rate that has reduced significantly under the Chinese character output task of training pattern weight tone model of the property distinguished, wherein the weight strategy that speech is relevant can obtain the better recognition result, by level and smooth with carrying out between rhythm pattern master associated weight and the speech associated weight, obtaining the system word misclassification rate is 12.3%.Adopt world model's weight method to obtain 0.6% performance boost.
Band tuning joint and two kinds of recognition results of Chinese character output show that native system weight legacy system has not obtained 9.5% and 4.7% relative misclassification rate decline.This shows the validity that obtains best recognition effect based on the Chinese speech recognition system of heterogeneous model differentiated fusion.

Claims (6)

1, a kind of Chinese speech recognition system based on heterogeneous model differentiated fusion, it is characterized in that, comprise: the sound identification module of model probability weight allocation module, the property distinguished model probability weight training module, the level and smooth module of model probability weight and the property distinguished fusion, wherein:
Model probability weight allocation module is responsible for the residing context of co-text of every arc of lattice is produced context-sensitive model probability weight sets and carries out initialization;
The property distinguished model probability weight training module receives initialized model probability weight sets, produce front and back to data, and utilize the minimum phone error criterion property distinguished training output heterogeneous model to obtain minimum phone error accumulation amount, obtain the model probability weight sets of the property distinguished according to minimum phone error accumulation amount;
The level and smooth module of model probability weight is to importing the model probability weight sets that carries out between the context-sensitive model probability weight sets after smoothing processing obtains smoothly;
The sound identification module of the property distinguished fusion utilizes the weight sets after the smoothing processing to carry out speech recognition output.
2, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, described model probability weight allocation module, produce weight sets according to lattice phonetics/semantic context of co-text, context of co-text comprises the sight of band tuning joint type, initial consonant model, rhythm pattern master and the Chinese character speech of current syllable, and model probability weight allocation module common property is given birth to four kinds of weight sets:
Band tuning joint associated weight collection is given a pair of model probability weight to each band tuning joint;
Rhythm pattern master associated weight collection, each different simple or compound vowel of a Chinese syllable three-tone model is given a group model probability right;
Model combination associated weight collection is given a pair of model probability weight at each initial consonant-simple or compound vowel of a Chinese syllable three-tone model combination;
Speech associated weight collection is given a pair of model probability weight at each band tuning joint of each the word correspondence in the whole speech of each Chinese.
3, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, described differentiation model probability weight training module, comprise: front and back are to data computation submodule, minimum phone error accumulation amount calculating sub module, model probability weight updating submodule, wherein:
Front and back are to the input of data computation submodule according to the initial weight collection, and the forward-backward algorithm that carries out lattice calculates, and comprises the forward direction probability P that every arc q is arrived all paths of this arc head node from every arc of start node α(q), arrive the backward probability P in all paths of this arc tail node from terminal node β(q); Arrive the average forward direction accuracy A in all paths of this arc head node from every arc of start node α(q), arrive the average back in all paths of this arc tail node to accuracy A from terminal node β(q);
Minimum phone error accumulation amount calculating sub module is utilized the output P of front and back to calculating sub module α(q) and P β(q) obtain posterior probability γ by every arc q, utilize A α(q) and A β(q) obtain average correctness c (q), and obtain the minimum phone mistake arc data γ that adds up according to above-mentioned data by all paths of every arc q MPE, γ q MPE = γ q ( c ( q ) - c avg ) , C wherein AvgAverage correctness for all paths among the lattice;
Model probability weight updating submodule is utilized the output γ of minimum phone error accumulation amount calculating sub module q MPE, iteration is upgraded the model probability weight, and is specific as follows:
η m , i ′ = κγ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i Σ i ( κγ q MPE η m , i log ( O i | ξ i ) | h + Cη m , i )
Wherein: η ' M, iBe to upgrade the model probability weight that obtains, η M, iBe the model probability weight of a preceding iteration, i represents to belong to i the heterogeneous model of arc q; M represents the affiliated m group model probability right of arc q; And satisfy following condition: η M, i>0, η M, i>0, Σ i η m , i = 1 , κ is the balance constant that reduces the probability dynamic range; Log (O i| ξ i) be model ξ iThe logarithm probability, O iBe model ξ iObserved value, C is the level and smooth control constant that experience is chosen, the property distinguished model probability weight training module repeats above-mentioned three module processes to carry out iteration and upgrades until the objective function convergence, and with final η ' M, iAs output.
4, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the level and smooth module of described model probability weight, smoothly overcome weight between four kinds of context-sensitive model probability weight sets for the output of the property distinguished model probability weight training module and train easy over-fitting problem, be specially: context-sensitive model probability weight sets increasing along with parameter, when the discrimination of training set is improved, the test set discrimination is descended on the contrary, carry out interpolation between in four kinds of model probability weight sets that the level and smooth module employing of the model probability weight property distinguished model probability weight training module obtains two or more and produce level and smooth model weight, expression formula is: η Smooth=ρ η FMD+ (1-ρ) η MCD, wherein: η SmoothBe the weight through smoothly obtaining, r is a smoothing factor, η FMDBe simple or compound vowel of a Chinese syllable model model correlation model weight, η MCDIt is model combination correlation model weight.
5, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the sound identification module of described differentiation fusion, utilization contains the spectrum signature model of spectrum signature training to recognition data identification generation lattice, every among lattice arc is carried out the spectrum signature model, the probability calculation of tone model, acoustics/semantic context according to every arc among the lattice, choose the central weight of differentiation weight sets that the property distinguished model probability weight training module produces, in weighing, carry out smoothing processing by the level and smooth module of model probability weight, and to the spectrum signature model, tone model score is weighted and obtains total acoustics score, finds the highest path of probability as the output result according to the Viterbi method from lattice at last.
6, according to the described Chinese speech recognition system of claim 1 based on heterogeneous model differentiated fusion, it is characterized in that, the sound identification module of described differentiation fusion, it is weighted spectrum signature model, tone model score and obtains total acoustics score, is specially:
log p ( q ) = η m , 1 ′ [ α log p ( O q S | θ q S ) ] + η m , 2 ′ [ β log p ( O q T | θ q T ) ] + log p LM + WP
Wherein: logP (q) is the combined sound of q bar arc among the lattice branch that learns, log p (O q S| θ q S) be the spectrum signature logarithm probability of this arc, O q SBe the spectrum signature observation sequence of this arc correspondence, θ q SSpectrum signature model for this arc correspondence; Log p (O q T| θ q T) be tone model θ from q bar arc q TProduce tone feature (sequence) O q TThe time the logarithm probability, α and β are predefined overall spectrum signature model and tone model weight, log P LMBe language model logarithm probability, WP is the speech penalty value, and α, β and WP rule of thumb choose, η ' M, 1And η ' M, 2Be the m group model probability right η ' in the level and smooth weight sets M, iI=1 wherein, 2.
CNA2008100414660A 2008-08-07 2008-08-07 Chinese speech recognition system based on heterogeneous model differentiated fusion Pending CN101334998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100414660A CN101334998A (en) 2008-08-07 2008-08-07 Chinese speech recognition system based on heterogeneous model differentiated fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100414660A CN101334998A (en) 2008-08-07 2008-08-07 Chinese speech recognition system based on heterogeneous model differentiated fusion

Publications (1)

Publication Number Publication Date
CN101334998A true CN101334998A (en) 2008-12-31

Family

ID=40197555

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100414660A Pending CN101334998A (en) 2008-08-07 2008-08-07 Chinese speech recognition system based on heterogeneous model differentiated fusion

Country Status (1)

Country Link
CN (1) CN101334998A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996629A (en) * 2009-08-21 2011-03-30 通用汽车有限责任公司 Method of recognizing speech
CN102968989A (en) * 2012-12-10 2013-03-13 中国科学院自动化研究所 Improvement method of Ngram model for voice recognition
WO2014101826A1 (en) * 2012-12-28 2014-07-03 安徽科大讯飞信息科技股份有限公司 Method and system for improving accuracy of voice recognition
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN107093422A (en) * 2017-01-10 2017-08-25 上海优同科技有限公司 A kind of audio recognition method and speech recognition system
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN108630197A (en) * 2017-03-23 2018-10-09 三星电子株式会社 Training method and equipment for speech recognition
CN110070857A (en) * 2019-04-25 2019-07-30 北京梧桐车联科技有限责任公司 The model parameter method of adjustment and device, speech ciphering equipment of voice wake-up model
CN110364162A (en) * 2018-11-15 2019-10-22 腾讯科技(深圳)有限公司 A kind of remapping method and device, storage medium of artificial intelligence

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996629B (en) * 2009-08-21 2012-10-03 通用汽车有限责任公司 Method of recognizing speech
CN101996629A (en) * 2009-08-21 2011-03-30 通用汽车有限责任公司 Method of recognizing speech
CN102968989A (en) * 2012-12-10 2013-03-13 中国科学院自动化研究所 Improvement method of Ngram model for voice recognition
CN102968989B (en) * 2012-12-10 2014-08-13 中国科学院自动化研究所 Improvement method of Ngram model for voice recognition
WO2014101826A1 (en) * 2012-12-28 2014-07-03 安徽科大讯飞信息科技股份有限公司 Method and system for improving accuracy of voice recognition
CN106384587B (en) * 2015-07-24 2019-11-15 科大讯飞股份有限公司 A kind of audio recognition method and system
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN107093422A (en) * 2017-01-10 2017-08-25 上海优同科技有限公司 A kind of audio recognition method and speech recognition system
CN108630197A (en) * 2017-03-23 2018-10-09 三星电子株式会社 Training method and equipment for speech recognition
CN108630197B (en) * 2017-03-23 2023-10-31 三星电子株式会社 Training method and device for speech recognition
CN107123417A (en) * 2017-05-16 2017-09-01 上海交通大学 Optimization method and system are waken up based on the customized voice that distinctive is trained
CN107123417B (en) * 2017-05-16 2020-06-09 上海交通大学 Customized voice awakening optimization method and system based on discriminant training
CN110364162A (en) * 2018-11-15 2019-10-22 腾讯科技(深圳)有限公司 A kind of remapping method and device, storage medium of artificial intelligence
CN110070857A (en) * 2019-04-25 2019-07-30 北京梧桐车联科技有限责任公司 The model parameter method of adjustment and device, speech ciphering equipment of voice wake-up model

Similar Documents

Publication Publication Date Title
US9911413B1 (en) Neural latent variable model for spoken language understanding
US10388274B1 (en) Confidence checking for speech processing and query answering
US10332508B1 (en) Confidence checking for speech processing and query answering
CN101334998A (en) Chinese speech recognition system based on heterogeneous model differentiated fusion
Campbell et al. Phonetic speaker recognition with support vector machines
US10170107B1 (en) Extendable label recognition of linguistic input
US10490182B1 (en) Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks
Bharathi et al. Findings of the shared task on Speech Recognition for Vulnerable Individuals in Tamil
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
Qian et al. Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT).
Razavi et al. On modeling context-dependent clustered states: Comparing HMM/GMM, hybrid HMM/ANN and KL-HMM approaches
CN113436612B (en) Intention recognition method, device, equipment and storage medium based on voice data
Aggarwal et al. Integration of multiple acoustic and language models for improved Hindi speech recognition system
KR101424193B1 (en) System And Method of Pronunciation Variation Modeling Based on Indirect data-driven method for Foreign Speech Recognition
Avram et al. Towards a romanian end-to-end automatic speech recognition based on deepspeech2
Jiang et al. Improvements on a trainable letter-to-sound converter
Kolář et al. Automatic sentence boundary detection in conversational speech: A cross-lingual evaluation on English and Czech
Razavi et al. An HMM-based formalism for automatic subword unit derivation and pronunciation generation
Pui-Fung et al. Decision tree based tone modeling for Chinese speech recognition
US11817090B1 (en) Entity resolution using acoustic data
Qian et al. Tone-enhanced generalized character posterior probability (GCPP) for cantonese lvcsr
Arısoy Turkish dictation system for radiology and broadcast news applications
Lei et al. DBN-based multi-stream models for Mandarin toneme recognition
Chen Resource-dependent acoustic and language modeling for spoken keyword search.
Hwang et al. Building a highly accurate Mandarin speech recognizer with language-independent technologies and language-dependent modules

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20081231