CN1296887C - Training method for embedded automatic sound identification system - Google Patents

Training method for embedded automatic sound identification system Download PDF

Info

Publication number
CN1296887C
CN1296887C CNB2004100667948A CN200410066794A CN1296887C CN 1296887 C CN1296887 C CN 1296887C CN B2004100667948 A CNB2004100667948 A CN B2004100667948A CN 200410066794 A CN200410066794 A CN 200410066794A CN 1296887 C CN1296887 C CN 1296887C
Authority
CN
China
Prior art keywords
training
template
distinctiveness
voice
section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100667948A
Other languages
Chinese (zh)
Other versions
CN1588538A (en
Inventor
朱杰
蔡铁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNB2004100667948A priority Critical patent/CN1296887C/en
Publication of CN1588538A publication Critical patent/CN1588538A/en
Application granted granted Critical
Publication of CN1296887C publication Critical patent/CN1296887C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention relates to a training method for an embedded automatic voice recognition system in the technical field of intelligent information processing. The training method comprises two steps, namely improved multi-section vector quantization template training and generalized probability descending distinctiveness training; in the first step, a dynamic time regulating method is used to divide training statements which belongs to the same type into a plurality of voice sections, and most relevant voice frames are polymerized in one section; according to voice time sequence characteristics, statistic characteristics of each section and Chinese syllabicity, the total segmenting number of a template is set according to the number of syllables contained by command words to be recognized. In the second step, a multi-section vector quantization voice template is combined to embed a generalized probability descending distinctiveness training algorithm into a recognizer which is based on a dynamic time regulating method; the distances between the training statements and a reference template are respectively used as distinctive functions to carry out distinctiveness training of a reference template set on the basis of a training set; after multiple times of repeated distinctiveness training, the discriminatory capability of templates is enhanced, and an optimized voice template is obtained.

Description

The training method that is used for embedded automatic sound identification system
Technical field
What the present invention relates to is the training method of the speech recognition system in a kind of intelligent information processing technology field, specifically is a kind of training method that is used for embedded automatic sound identification system.
Background technology
The speech model that speech recognition system adopted (or template) needs the acoustic feature of reflection voice rationally, and its probability distribution of effectively describing speech feature space has determined the performance of speech recognition.For being suitable for miniaturization, portable use, embedded automatic sound identification system mostly uses special hardware system to realize, as MCU, DSP and speech recognition special chip.Because real-time, the reliability requirements of the limited and identification of system resource, the shared storage space of the template of each recognition unit must be as far as possible little, and it is high that template quality is wanted, and adopts dynamic time warping (DTW) recognizer proper simultaneously.
Find by literature search, people such as L.Zhou are at " IEICE Trans.on Information and Systems " Vol.E78-D, No.9, pp.1178-1187, Sep.1995. " the Multisegment Multiple VQcodebooks-Based Speaker Independent Isolated Word Recognition Using UnbiasedMel Cepstrum " that delivers, (" IEICE information and system's periodical ", " adopt no inclined to one side Mel cepstrum to realize discerning " based on the unspecified person alone word of many VQ of multistage code book) this article employing multistage vector quantization (MSVQ) method training utterance template, compare with the standard VQ method that whole speech is quantized, the MSVQ method has kept the temporal aspect of voice, and is highly beneficial to discerning.Under the training data condition of limited, be better than recognition system based on CDHMM based on the performance of the Chinese isolated word recognition system of MSVQ.The MSVQ template generation method can be summarized as two steps, will belong to of a sort training statement earlier and be divided into several sections in time, generates a standard VQ code book with the LBG method in every section then.But the segmentation method of MSVQ is that in chronological sequence order is with the even segmentation of statement, and this even segmentation fails to consider fully the statistical property of the different sections of voice, will influence the performance of template, the further raising of restriction discrimination.Simultaneously, sound template is normally got a typical statement or all training data clusters of this speech is obtained.Training method is generally estimated (MLE) based on maximum similarity, and its target is the similarity maximum that makes training sample and template.This training method exists certain limitation: because each reference template is all produced by the training statement of this speech oneself, possible similar part in the different speech pronunciations is not distinguished, like this when identification is compared, fail to obtain enough attention with other other key component of speech pronunciation phase region in the pronunciation, be difficult to arrive the requirement of high discrimination.When particularly having the speech that pronounces to obscure mutually, discrimination can descend greatly.For improving the separating capacity of template, realize high discrimination, also must further improve the performance of template.
Summary of the invention
The present invention is directed to the above-mentioned deficiency of prior art, a kind of training method that is used for embedded automatic sound identification system is proposed, make multistage vector quantization (MSVQ) method of its application enhancements, and at DTW recognition methods that system adopted, proposed to be applicable to extensive probability decline (GPD) distinctiveness training method of MSVQ sound template and DTW algorithm, with the performance of further raising template.
The present invention is achieved by the following technical solutions, the present invention includes two parts of improved MSVQ template training and extensive probability decline (GPD) distinctiveness training:
(1) improved MSVQ template training: the thought that adopts dynamic programming, utilize the DTW method will belong to of a sort training statement and rationally be divided into some voice segments in time, maximally related speech frame is aggregated in one section, fully takes into account the temporal aspect of voice and the statistical property of different sections.And the syllable of considering Chinese constitutes, and the syllable number that is comprised according to order speech to be identified is provided with the segmentation sum of template.
(2) extensive probability (GPD) distinctiveness training that descends:, GPD distinctiveness training algorithm is embedded in the recognizer based on the DTW method in conjunction with the MSVQ sound template.As distinctive function, reference template collection (MSVQ template) is carried out the distinctiveness training based on training set by the distance between definition training statement and the reference template, make the identification error rate reach minimum.Through repeatedly repeated distinctiveness training, increase the separating capacity between template, obtain more optimal sound template.
Do not consider the deficiency of the different section of voice statistical property for remedying multistage vector quantization (MSVQ) method, the present invention adopts minimum distortion criterion and dynamic programming technology to obtain the voice segment of random length, improve the rationality of voice segment in the MSVQ method, maximally related those frame aggregations are trained to template in one section.In addition, because segmentation sum is relevant with the syllable number that this speech comprises, and each syllable is made of 3 to 4 phonemes usually in the Chinese, so each syllable is divided into 4 sections, in each phoneme corresponding templates one section.Like this, the template that is obtained by the MSVQ method training after improving has not only comprised all speakers' phonetic feature in the training set, and has kept the temporal aspect of voice, and therefore representative strong, discrimination is higher.The template volume is little simultaneously, is suitable for the very limited embedded recognition system of resource.
For improving the separating capacity of template, the present invention adopts the distinctiveness training further to optimize the sound template that the MSVQ method obtains, angle from minimum misclassification rate (MCE), the main separating capacity of considering template makes the identification error minimum, rather than describes the difference of training data as far as possible accurately.By in conjunction with the MSVQ template, extensive probability (GPD) distinctiveness training algorithm that descends is embedded in the recognizer based on the DTW recognition methods, obtain more optimal MSVQ sound template.
For embedded automatic sound identification system, the present invention proposes the complete sound template training method of a cover.With the template that the method training obtains, volume is little, distinctiveness is strong, performance is high, is the key that guarantees embedded automatic sound identification system Real time identification and high discrimination.
Description of drawings
Fig. 1 distinctiveness training synoptic diagram
Fig. 2 is distinctiveness training experiment result repeatedly
Embodiment
For understanding technical scheme of the present invention better, be further described below in conjunction with accompanying drawing and specific embodiment.
The present invention at first adopts improved MSVQ method training to obtain basic templates, it comprises two steps: elder generation is according to the thought of dynamic programming, to belong to of a sort training statement and be divided into several sections in time with the DTW algorithm, make maximally related those frame aggregations together, the several syllable number that comprise according to this speech of segmented general are determined; Generate a standard VQ code book with the LBG method in every section then.Again on the basis of MSVQ template, template optimized by extensive probability (GPD) distinctiveness training algorithm that descends again to carrying out, increase the discrimination of template from the angle that minimizes misclassification rate (MCE), make the discrimination of system obtain bigger raising.
Embodiment
1, improved MSVQ template
If being the T voice signal, frame length represents by a feature vector sequence usually: X={x 1, x 2..., x T.For with maximally related those frame aggregations in one section, segmentation method is based on the minimum distortion criterion.Segmentation sum N in addition sRelevant with the syllable number that this speech comprises, each syllable constitutes (each syllable being divided into 3 sections here, corresponding one section of each phoneme) by 3 to 4 phonemes usually in the Chinese.At first defining the border is t lAnd t L+1Distortion D in the section of-1 l section lFor:
D l = Σ t = t l t l + 1 - 1 d ( x t , c l )
Wherein, c lBe the barycenter of this section, d (. .) be distortion measure.The average of getting all vectors of this section is a barycenter.Distortion D lThe intensity of variation that has reflected eigenvector in the l section.Then, for L nonoverlapping section continuously, total distortion D is:
D = Σ l = 1 L D l = Σ l = 1 L Σ t = t l t l + 1 - 1 d ( x t , c l ) , t 1 = 1 , t L + 1 = T + 1 ,
By changing section boundaries t lThereby make the D minimum.According to the thought of dynamic programming, adopt and can solve this optimization problem effectively with the DTW algorithm: at first get a typical statement, it is divided into plurality of sections, the speech frame of same section condenses together, and finally forms a typical template; Then the similar training statement with other of this template is made the DTW coupling, so respectively train statement will be divided into identical hop count according to the optimal path of DTW process, the frame number of each section correspondence will be different, but the speech frame of same section will have similar statistical property and phonetic feature.
Obtain after the rational segment information, each section is designed to a VQ code book respectively, adopt the LBG algorithm to obtain.In order to reduce the volume of template, the size of every section VQ code book is made as 1, the average (barycenter) of promptly getting all vectors of this section as this segment encode this, so just obtained a sound template that volume is little, performance is high.
2. the distinctiveness training algorithm is based on the realization in the DTW recognition system of MSVQ
Be to increase the separating capacity of template, must be to enough paying attention to other other key component of speech pronunciation phase region in the pronunciation, the distinctiveness training method can satisfy this requirement.Extensive probability decline (GPD) distinctiveness training algorithm is a kind of efficient and simple method, can finely be used for minimizing misclassification rate.Below in conjunction with the MSVQ template, GPD distinctiveness training algorithm is embedded in the recognizer based on the DTW recognition methods MSVQ template of more being optimized.
A given training statement collection ={ x 1, x 2..., x N, x wherein iBelong to M speech C i, i=1,2 ..., among the M one. x i = { x p , s i , p = 1,2 , · · · , P i , s = 1,2 , · · · , S } Be by P iIndividual frame is formed, and every frame is a S dimension speech characteristic vector, is made up of cepstrum coefficient usually.Each speech is represented by a reference template.Reference template collection Λ={ λ i={ (R i, W i), i=1,2 ..., M} wherein R i = { r q , s i , q = 1,2 , · · · , Q i , s = 1,2 , · · · , S } Be the cepstrum coefficient sequence, W i = { w q i ,q=1,2, · · · , Q i } Be the difference weighting function be used for revising template apart from score value.According to the GPD algorithm, reference template collection Λ is carried out the distinctiveness training based on training set , make the identification error rate reach minimum.The flow process of distinctiveness training as shown in Figure 1.
1) definition training statement x and speech C jReference template r jBetween distance as distinctive function:
g j ( x , Λ ) = Σ q = 1 Q w q j δ p q j
W wherein q jBe speech C jThe difference weight of reference template.δ Pq jBe in the optimal path that after the DTW coupling, obtains, speech C jQ frame of reference template and x in corresponding p qDistance between the frame.Here adopt Euclidean distance:
δ p q j = Σ s = 1 S ( r q , s j - x p q , s ) 2
Can obtain a continuous distinctive function g that can carry out the gradient operation by above definition to it k(x; Λ).
2) the definition misclassification is estimated, and recognition result is embedded wherein
d k ( x ) = g k ( x ; Λ ) - ln { 1 M - 1 Σ j , j ≠ k e - g j ( x ; Λ ) η } - 1 / η
Wherein η is an arithmetic number.
3) cost function is as giving a definition:
l k ( d k ) = 1 1 + e - d k
It can correctly be similar to the identification error rate.
4) adjust the reference template parameter adaptively with the GPD algorithm, thereby make cost function reach minimum.Given one belongs to speech C kTraining statement x, the regulation rule of reference template parameter is as follows:
During j=k, r q , s , t + 1 k = r q , s , t k - ϵ t v k φ k w q , t + 1 k = w q , t k - ϵ t v k δ p q k
During j ≠ k, r q , s , t + 1 j = r q , s , t j + ϵ t v k π j , k φ j w q , t + 1 j = w q , t j - ϵ t v k π j , k δ p q j
Wherein
v k=l k(d k)(1-l k(d k))
φ k = 2 w q k ( r q , s k - x p q , s )
π j , k = e - g j η Σ j ′ , j ′ ≠ k e - g j ′ η
ϵ t = ϵ 0 ( 1 - t T )
T represents iteration the t time, and T is a maximum iteration time, ε 0It is a less positive number.Generally just can obtain convergency value through tens iteration.
Be better than training preceding template through the reference template after the distinctiveness training, but be not optimum result, this is because some the parameter settings in initial reference template and the algorithm are not that the most excellent factor causes.Reference template after training is carried out once and even repeatedly distinctiveness training again the template that can more be optimized.
The present invention is based on the training method of embedded automatic sound identification system and traditional training method and compare, with the discrimination of the system evaluation index as performance, experimental result sees Table 1.The obvious raising of discrimination can be found from table 1, and method of the present invention has higher performance.
The experimental result of table 1 distinctiveness training
The test set discrimination The training set discrimination Template size
Before the distinctiveness training 91.0 86.1 44KB
After the distinctiveness training 92.5 94.3 44KB
Misclassification rate descends 16.7% 58.9%
More than Shi Yan sound bank comprises 50 order speech, has 12 people's recording data, is the male sex entirely, and by the mandarin pronunciation, word speed is normal word speed, and everyone reads 2 times by each speech.Wherein comprise the speech that much is confused easily, as " intensification " " cooling ", " left-hand rotation " " right-hand rotation ".Playback environ-ment is a laboratory environment, and sample frequency is 8kHz, voice after feature extraction as experimental data.Wherein 10 people's recording data are as training set, and other 2 people's recording data are as test set.In the experiment training set has also been done the discrimination test.Front-end processing comprises end-point detection, pre-emphasis 1-0.95z -1, 30ms hamming code window, 10ms frame move, eigenvector is made up of 8 dimension LPC cepstrum coefficients.Reference template adopts the MSVQ template, and segments is 12.
The MSVQ template is carried out repeated distinctiveness training, and experimental result as shown in Figure 2.Ordinate is represented the number of times of repetition training, and wherein the template of distinctiveness training is not done in 0 expression.Experimental result can obtain as drawing a conclusion from Fig. 2: after the distinctiveness training, template is optimized, and system recognition rate all is significantly improved.By repeatedly carrying out the distinctiveness training, template is further optimized, and discrimination is further enhanced.After the training of 5 distinctivenesses, the discrimination of training set is reached 99.4%, test set reaches 94.5%.It is particularly evident to the discrimination raising of training set repeatedly to carry out the distinctiveness training.

Claims (2)

1, a kind of training method that is used for embedded automatic sound identification system is characterized in that, comprises two parts of improved multistage vector quantization template training and extensive probability decline distinctiveness training:
(1) improved multistage vector quantization template training: utilize the dynamic time warping method will belong to of a sort training statement and be divided into some voice segments in time, maximally related speech frame is aggregated in one section, constitute according to the statistical property of the temporal aspect of voice and each section and the syllable of Chinese, the syllable number that is comprised according to order speech to be identified is provided with the segmentation sum of template;
(2) extensive probability decline distinctiveness training: in conjunction with multistage vector quantization sound template, extensive probability decline distinctiveness training algorithm is embedded in the recognizer based on the dynamic time warping method, by the distance between definition training statement and the reference template as distinctive function, the reference template collection is carried out the distinctiveness training based on training set, make the identification error rate reach minimum, through repeatedly repeated distinctiveness training, increase the separating capacity between template, obtain more optimal sound template.
2, the training method that is used for embedded automatic sound identification system according to claim 1, it is characterized in that, adopt minimum distortion criterion and dynamic programming technology to obtain the voice segment of random length, improve the rationality of voice segment in the multistage vector quantization method, maximally related those frame aggregations are trained to template in one section, because template segmentation sum is relevant with the syllable number that this speech comprises, and each syllable is made of 3 to 4 phonemes usually in the Chinese, therefore each syllable is carried out segmentation according to the phoneme number that it comprised, in each phoneme corresponding templates one section.
CNB2004100667948A 2004-09-29 2004-09-29 Training method for embedded automatic sound identification system Expired - Fee Related CN1296887C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100667948A CN1296887C (en) 2004-09-29 2004-09-29 Training method for embedded automatic sound identification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100667948A CN1296887C (en) 2004-09-29 2004-09-29 Training method for embedded automatic sound identification system

Publications (2)

Publication Number Publication Date
CN1588538A CN1588538A (en) 2005-03-02
CN1296887C true CN1296887C (en) 2007-01-24

Family

ID=34604096

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100667948A Expired - Fee Related CN1296887C (en) 2004-09-29 2004-09-29 Training method for embedded automatic sound identification system

Country Status (1)

Country Link
CN (1) CN1296887C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8762148B2 (en) 2006-02-27 2014-06-24 Nec Corporation Reference pattern adaptation apparatus, reference pattern adaptation method and reference pattern adaptation program
CN1835076B (en) * 2006-04-07 2010-05-12 安徽中科大讯飞信息科技有限公司 Speech evaluating method of integrally operating speech identification, phonetics knowledge and Chinese dialect analysis
CN101577118B (en) * 2009-06-12 2011-05-04 北京大学 Implementation method of voice interaction system facing intelligent service robot
CN103236261B (en) * 2013-04-02 2015-09-16 四川长虹电器股份有限公司 A kind of method of particular person speech recognition
CN104751856B (en) * 2013-12-31 2017-12-22 ***通信集团公司 A kind of speech sentences recognition methods and device
CN109754784B (en) * 2017-11-02 2021-01-29 华为技术有限公司 Method for training filtering model and method for speech recognition
CN110060667B (en) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 Batch processing method and device for voice information, computer equipment and storage medium
CN112863523B (en) * 2019-11-27 2023-05-16 华为技术有限公司 Voice anti-counterfeiting method and device, terminal equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5440662A (en) * 1992-12-11 1995-08-08 At&T Corp. Keyword/non-keyword classification in isolated word speech recognition
US5613037A (en) * 1993-12-21 1997-03-18 Lucent Technologies Inc. Rejection of non-digit strings for connected digit speech recognition
CN1223739A (en) * 1996-06-28 1999-07-21 微软公司 Method and system for dynamically adjusted training for speech recognition
US6292776B1 (en) * 1999-03-12 2001-09-18 Lucent Technologies Inc. Hierarchial subband linear predictive cepstral features for HMM-based speech recognition
CN1391211A (en) * 2001-04-20 2003-01-15 皇家菲利浦电子有限公司 Exercising method and system to distinguish parameters

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5440662A (en) * 1992-12-11 1995-08-08 At&T Corp. Keyword/non-keyword classification in isolated word speech recognition
US5613037A (en) * 1993-12-21 1997-03-18 Lucent Technologies Inc. Rejection of non-digit strings for connected digit speech recognition
CN1223739A (en) * 1996-06-28 1999-07-21 微软公司 Method and system for dynamically adjusted training for speech recognition
US6292776B1 (en) * 1999-03-12 2001-09-18 Lucent Technologies Inc. Hierarchial subband linear predictive cepstral features for HMM-based speech recognition
CN1391211A (en) * 2001-04-20 2003-01-15 皇家菲利浦电子有限公司 Exercising method and system to distinguish parameters

Also Published As

Publication number Publication date
CN1588538A (en) 2005-03-02

Similar Documents

Publication Publication Date Title
Bai et al. Speaker recognition based on deep learning: An overview
Bhangale et al. Survey of deep learning paradigms for speech processing
US6219642B1 (en) Quantization using frequency and mean compensated frequency input data for robust speech recognition
CN102063899B (en) Method for voice conversion under unparallel text condition
US20040260550A1 (en) Audio processing system and method for classifying speakers in audio data
CN104008751A (en) Speaker recognition method based on BP neural network
Todkar et al. Speaker recognition techniques: A review
Jung et al. A unified deep learning framework for short-duration speaker verification in adverse environments
Zou et al. Improved voice activity detection based on support vector machine with high separable speech feature vectors
CN1296887C (en) Training method for embedded automatic sound identification system
Jung et al. Linear-scale filterbank for deep neural network-based voice activity detection
CN1300763C (en) Automatic sound identifying treating method for embedded sound identifying system
Nakamura et al. Speaker adaptation applied to HMM and neural networks
Górriz et al. An effective cluster-based model for robust speech detection and speech recognition in noisy environments
Bai et al. Voice activity detection based on time-delay neural networks
Aibinu et al. Evaluating the effect of voice activity detection in isolated Yoruba word recognition system
Velayatipour et al. A review on speech-music discrimination methods
Feki et al. Audio stream analysis for environmental sound classification
Zeinali et al. A fast speaker identification method using nearest neighbor distance
Beritelli et al. Adaptive V/UV speech detection based on acoustic noise estimation and classification
Bie et al. DNN-based voice activity detection for speaker recognition
Mingliang et al. Chinese dialect identification using clustered support vector machine
Ma et al. An improved VQ based algorithm for recognizing speaker-independent isolated words
Quang et al. Improving Speaker Verification in Noisy Environment Using DNN Classifier
Morris et al. GMM based clustering and speaker separability in the Timit speech database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070124

Termination date: 20091029