CN1296887C

CN1296887C - Training method for embedded automatic sound identification system

Info

Publication number: CN1296887C
Application number: CNB2004100667948A
Authority: CN
Inventors: 朱杰; 蔡铁
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2004-09-29
Filing date: 2004-09-29
Publication date: 2007-01-24
Anticipated expiration: 2024-09-29
Also published as: CN1588538A

Abstract

The present invention relates to a training method for an embedded automatic voice recognition system in the technical field of intelligent information processing. The training method comprises two steps, namely improved multi-section vector quantization template training and generalized probability descending distinctiveness training; in the first step, a dynamic time regulating method is used to divide training statements which belongs to the same type into a plurality of voice sections, and most relevant voice frames are polymerized in one section; according to voice time sequence characteristics, statistic characteristics of each section and Chinese syllabicity, the total segmenting number of a template is set according to the number of syllables contained by command words to be recognized. In the second step, a multi-section vector quantization voice template is combined to embed a generalized probability descending distinctiveness training algorithm into a recognizer which is based on a dynamic time regulating method; the distances between the training statements and a reference template are respectively used as distinctive functions to carry out distinctiveness training of a reference template set on the basis of a training set; after multiple times of repeated distinctiveness training, the discriminatory capability of templates is enhanced, and an optimized voice template is obtained.

Description

The training method that is used for embedded automatic sound identification system

Technical field

What the present invention relates to is the training method of the speech recognition system in a kind of intelligent information processing technology field, specifically is a kind of training method that is used for embedded automatic sound identification system.

Background technology

The speech model that speech recognition system adopted (or template) needs the acoustic feature of reflection voice rationally, and its probability distribution of effectively describing speech feature space has determined the performance of speech recognition.For being suitable for miniaturization, portable use, embedded automatic sound identification system mostly uses special hardware system to realize, as MCU, DSP and speech recognition special chip.Because real-time, the reliability requirements of the limited and identification of system resource, the shared storage space of the template of each recognition unit must be as far as possible little, and it is high that template quality is wanted, and adopts dynamic time warping (DTW) recognizer proper simultaneously.

Find by literature search, people such as L.Zhou are at " IEICE Trans.on Information and Systems " Vol.E78-D, No.9, pp.1178-1187, Sep.1995. " the Multisegment Multiple VQcodebooks-Based Speaker Independent Isolated Word Recognition Using UnbiasedMel Cepstrum " that delivers, (" IEICE information and system's periodical ", " adopt no inclined to one side Mel cepstrum to realize discerning " based on the unspecified person alone word of many VQ of multistage code book) this article employing multistage vector quantization (MSVQ) method training utterance template, compare with the standard VQ method that whole speech is quantized, the MSVQ method has kept the temporal aspect of voice, and is highly beneficial to discerning.Under the training data condition of limited, be better than recognition system based on CDHMM based on the performance of the Chinese isolated word recognition system of MSVQ.The MSVQ template generation method can be summarized as two steps, will belong to of a sort training statement earlier and be divided into several sections in time, generates a standard VQ code book with the LBG method in every section then.But the segmentation method of MSVQ is that in chronological sequence order is with the even segmentation of statement, and this even segmentation fails to consider fully the statistical property of the different sections of voice, will influence the performance of template, the further raising of restriction discrimination.Simultaneously, sound template is normally got a typical statement or all training data clusters of this speech is obtained.Training method is generally estimated (MLE) based on maximum similarity, and its target is the similarity maximum that makes training sample and template.This training method exists certain limitation: because each reference template is all produced by the training statement of this speech oneself, possible similar part in the different speech pronunciations is not distinguished, like this when identification is compared, fail to obtain enough attention with other other key component of speech pronunciation phase region in the pronunciation, be difficult to arrive the requirement of high discrimination.When particularly having the speech that pronounces to obscure mutually, discrimination can descend greatly.For improving the separating capacity of template, realize high discrimination, also must further improve the performance of template.

Summary of the invention

The present invention is directed to the above-mentioned deficiency of prior art, a kind of training method that is used for embedded automatic sound identification system is proposed, make multistage vector quantization (MSVQ) method of its application enhancements, and at DTW recognition methods that system adopted, proposed to be applicable to extensive probability decline (GPD) distinctiveness training method of MSVQ sound template and DTW algorithm, with the performance of further raising template.

The present invention is achieved by the following technical solutions, the present invention includes two parts of improved MSVQ template training and extensive probability decline (GPD) distinctiveness training:

(1) improved MSVQ template training: the thought that adopts dynamic programming, utilize the DTW method will belong to of a sort training statement and rationally be divided into some voice segments in time, maximally related speech frame is aggregated in one section, fully takes into account the temporal aspect of voice and the statistical property of different sections.And the syllable of considering Chinese constitutes, and the syllable number that is comprised according to order speech to be identified is provided with the segmentation sum of template.

(2) extensive probability (GPD) distinctiveness training that descends:, GPD distinctiveness training algorithm is embedded in the recognizer based on the DTW method in conjunction with the MSVQ sound template.As distinctive function, reference template collection (MSVQ template) is carried out the distinctiveness training based on training set by the distance between definition training statement and the reference template, make the identification error rate reach minimum.Through repeatedly repeated distinctiveness training, increase the separating capacity between template, obtain more optimal sound template.

Do not consider the deficiency of the different section of voice statistical property for remedying multistage vector quantization (MSVQ) method, the present invention adopts minimum distortion criterion and dynamic programming technology to obtain the voice segment of random length, improve the rationality of voice segment in the MSVQ method, maximally related those frame aggregations are trained to template in one section.In addition, because segmentation sum is relevant with the syllable number that this speech comprises, and each syllable is made of 3 to 4 phonemes usually in the Chinese, so each syllable is divided into 4 sections, in each phoneme corresponding templates one section.Like this, the template that is obtained by the MSVQ method training after improving has not only comprised all speakers' phonetic feature in the training set, and has kept the temporal aspect of voice, and therefore representative strong, discrimination is higher.The template volume is little simultaneously, is suitable for the very limited embedded recognition system of resource.

For improving the separating capacity of template, the present invention adopts the distinctiveness training further to optimize the sound template that the MSVQ method obtains, angle from minimum misclassification rate (MCE), the main separating capacity of considering template makes the identification error minimum, rather than describes the difference of training data as far as possible accurately.By in conjunction with the MSVQ template, extensive probability (GPD) distinctiveness training algorithm that descends is embedded in the recognizer based on the DTW recognition methods, obtain more optimal MSVQ sound template.

For embedded automatic sound identification system, the present invention proposes the complete sound template training method of a cover.With the template that the method training obtains, volume is little, distinctiveness is strong, performance is high, is the key that guarantees embedded automatic sound identification system Real time identification and high discrimination.

Description of drawings

Fig. 1 distinctiveness training synoptic diagram

Fig. 2 is distinctiveness training experiment result repeatedly

Embodiment

For understanding technical scheme of the present invention better, be further described below in conjunction with accompanying drawing and specific embodiment.

The present invention at first adopts improved MSVQ method training to obtain basic templates, it comprises two steps: elder generation is according to the thought of dynamic programming, to belong to of a sort training statement and be divided into several sections in time with the DTW algorithm, make maximally related those frame aggregations together, the several syllable number that comprise according to this speech of segmented general are determined; Generate a standard VQ code book with the LBG method in every section then.Again on the basis of MSVQ template, template optimized by extensive probability (GPD) distinctiveness training algorithm that descends again to carrying out, increase the discrimination of template from the angle that minimizes misclassification rate (MCE), make the discrimination of system obtain bigger raising.

Embodiment

1, improved MSVQ template

If being the T voice signal, frame length represents by a feature vector sequence usually: X={x ₁, x ₂..., x _T.For with maximally related those frame aggregations in one section, segmentation method is based on the minimum distortion criterion.Segmentation sum N in addition _sRelevant with the syllable number that this speech comprises, each syllable constitutes (each syllable being divided into 3 sections here, corresponding one section of each phoneme) by 3 to 4 phonemes usually in the Chinese.At first defining the border is t _lAnd t _L+1Distortion D in the section of-1 l section _lFor:

D_{l} = Σ_{t = t_{l}}^{t_{l + 1} - 1} d (x_{t}, c_{l})

Wherein, c _lBe the barycenter of this section, d (. .) be distortion measure.The average of getting all vectors of this section is a barycenter.Distortion D _lThe intensity of variation that has reflected eigenvector in the l section.Then, for L nonoverlapping section continuously, total distortion D is:

D = Σ_{l = 1}^{L} D_{l} = Σ_{l = 1}^{L} Σ_{t = t_{l}}^{t_{l + 1} - 1} d (x_{t}, c_{l}), t_{1} = 1, t_{L + 1} = T + 1,

By changing section boundaries t _lThereby make the D minimum.According to the thought of dynamic programming, adopt and can solve this optimization problem effectively with the DTW algorithm: at first get a typical statement, it is divided into plurality of sections, the speech frame of same section condenses together, and finally forms a typical template; Then the similar training statement with other of this template is made the DTW coupling, so respectively train statement will be divided into identical hop count according to the optimal path of DTW process, the frame number of each section correspondence will be different, but the speech frame of same section will have similar statistical property and phonetic feature.

Obtain after the rational segment information, each section is designed to a VQ code book respectively, adopt the LBG algorithm to obtain.In order to reduce the volume of template, the size of every section VQ code book is made as 1, the average (barycenter) of promptly getting all vectors of this section as this segment encode this, so just obtained a sound template that volume is little, performance is high.

2. the distinctiveness training algorithm is based on the realization in the DTW recognition system of MSVQ

Be to increase the separating capacity of template, must be to enough paying attention to other other key component of speech pronunciation phase region in the pronunciation, the distinctiveness training method can satisfy this requirement.Extensive probability decline (GPD) distinctiveness training algorithm is a kind of efficient and simple method, can finely be used for minimizing misclassification rate.Below in conjunction with the MSVQ template, GPD distinctiveness training algorithm is embedded in the recognizer based on the DTW recognition methods MSVQ template of more being optimized.

A given training statement collection ={ x ¹, x ²..., x ^N, x wherein ⁱBelong to M speech C ⁱ, i=1,2 ..., among the M one.

x^{i} = {x_{p, s}^{i}, p = 1,2, \cdot \cdot \cdot, P^{i}, s = 1,2, \cdot \cdot \cdot, S}

Be by P ⁱIndividual frame is formed, and every frame is a S dimension speech characteristic vector, is made up of cepstrum coefficient usually.Each speech is represented by a reference template.Reference template collection Λ={ λ ⁱ={ (R ⁱ, W ⁱ), i=1,2 ..., M} wherein

R^{i} = {r_{q, s}^{i}, q = 1,2, \cdot \cdot \cdot, Q^{i}, s = 1,2, \cdot \cdot \cdot, S}

Be the cepstrum coefficient sequence,

W^{i} = {w_{q}^{i},q=1,2, \cdot \cdot \cdot, Q^{i}}

Be the difference weighting function be used for revising template apart from score value.According to the GPD algorithm, reference template collection Λ is carried out the distinctiveness training based on training set , make the identification error rate reach minimum.The flow process of distinctiveness training as shown in Figure 1.

1) definition training statement x and speech C ^jReference template r ^jBetween distance as distinctive function:

g_{j} (x, Λ) = Σ_{q = 1}^{Q} w_{q}^{j} δ_{p_{q}}^{j}

W wherein _q ^jBe speech C ^jThe difference weight of reference template.δ _Pq ^jBe in the optimal path that after the DTW coupling, obtains, speech C ^jQ frame of reference template and x in corresponding p _qDistance between the frame.Here adopt Euclidean distance:

δ_{p_{q}}^{j} = Σ_{s = 1}^{S} {(r_{q, s}^{j} - x_{p_{q}, s})}^{2}

Can obtain a continuous distinctive function g that can carry out the gradient operation by above definition to it _k(x; Λ).

2) the definition misclassification is estimated, and recognition result is embedded wherein

d_{k} (x) = g_{k} (x; Λ) - \ln {\frac{1}{M - 1} \underset{j, j &NotEqual; k}{Σ} e^{- g_{j} (x; Λ) η}}^{- 1 / η}

Wherein η is an arithmetic number.

3) cost function is as giving a definition:

l_{k} (d_{k}) = \frac{1}{1 + e^{- d_{k}}}

It can correctly be similar to the identification error rate.

4) adjust the reference template parameter adaptively with the GPD algorithm, thereby make cost function reach minimum.Given one belongs to speech C ^kTraining statement x, the regulation rule of reference template parameter is as follows:

During j=k,

\{\begin{matrix} r_{q, s, t + 1}^{k} = r_{q, s, t}^{k} - ϵ_{t} v_{k} φ_{k} \\ w_{q, t + 1}^{k} = w_{q, t}^{k} - ϵ_{t} v_{k} δ_{p_{q}}^{k} \end{matrix}

During j ≠ k,

\{\begin{matrix} r_{q, s, t + 1}^{j} = r_{q, s, t}^{j} + ϵ_{t} v_{k} π_{j, k} φ_{j} \\ w_{q, t + 1}^{j} = w_{q, t}^{j} - ϵ_{t} v_{k} π_{j, k} δ_{p_{q}}^{j} \end{matrix}

Wherein

v _k＝l _k(d _k)(1-l _k(d _k))

φ_{k} = 2 w_{q}^{k} (r_{q, s}^{k} - x_{p_{q}, s})

π_{j, k} = \frac{e^{- g_{j} η}}{\underset{j^{'}, j^{'} &NotEqual; k}{Σ} e^{- g_{j^{'}} η}}

ϵ_{t} = ϵ_{0} (1 - \frac{t}{T})

T represents iteration the t time, and T is a maximum iteration time, ε ₀It is a less positive number.Generally just can obtain convergency value through tens iteration.

Be better than training preceding template through the reference template after the distinctiveness training, but be not optimum result, this is because some the parameter settings in initial reference template and the algorithm are not that the most excellent factor causes.Reference template after training is carried out once and even repeatedly distinctiveness training again the template that can more be optimized.

The present invention is based on the training method of embedded automatic sound identification system and traditional training method and compare, with the discrimination of the system evaluation index as performance, experimental result sees Table 1.The obvious raising of discrimination can be found from table 1, and method of the present invention has higher performance.

The experimental result of table 1 distinctiveness training

	The test set discrimination	The training set discrimination	Template size
	The test set discrimination	The training set discrimination	Template size	Before the distinctiveness training	91.0	86.1	44KB
After the distinctiveness training	92.5	94.3	44KB	Before the distinctiveness training	91.0	86.1	44KB
After the distinctiveness training	92.5	94.3	44KB	Misclassification rate descends	16.7％	58.9％

More than Shi Yan sound bank comprises 50 order speech, has 12 people's recording data, is the male sex entirely, and by the mandarin pronunciation, word speed is normal word speed, and everyone reads 2 times by each speech.Wherein comprise the speech that much is confused easily, as " intensification " " cooling ", " left-hand rotation " " right-hand rotation ".Playback environ-ment is a laboratory environment, and sample frequency is 8kHz, voice after feature extraction as experimental data.Wherein 10 people's recording data are as training set, and other 2 people's recording data are as test set.In the experiment training set has also been done the discrimination test.Front-end processing comprises end-point detection, pre-emphasis 1-0.95z ^-1, 30ms hamming code window, 10ms frame move, eigenvector is made up of 8 dimension LPC cepstrum coefficients.Reference template adopts the MSVQ template, and segments is 12.

The MSVQ template is carried out repeated distinctiveness training, and experimental result as shown in Figure 2.Ordinate is represented the number of times of repetition training, and wherein the template of distinctiveness training is not done in 0 expression.Experimental result can obtain as drawing a conclusion from Fig. 2: after the distinctiveness training, template is optimized, and system recognition rate all is significantly improved.By repeatedly carrying out the distinctiveness training, template is further optimized, and discrimination is further enhanced.After the training of 5 distinctivenesses, the discrimination of training set is reached 99.4%, test set reaches 94.5%.It is particularly evident to the discrimination raising of training set repeatedly to carry out the distinctiveness training.

Claims

1, a kind of training method that is used for embedded automatic sound identification system is characterized in that, comprises two parts of improved multistage vector quantization template training and extensive probability decline distinctiveness training:

(1) improved multistage vector quantization template training: utilize the dynamic time warping method will belong to of a sort training statement and be divided into some voice segments in time, maximally related speech frame is aggregated in one section, constitute according to the statistical property of the temporal aspect of voice and each section and the syllable of Chinese, the syllable number that is comprised according to order speech to be identified is provided with the segmentation sum of template;

(2) extensive probability decline distinctiveness training: in conjunction with multistage vector quantization sound template, extensive probability decline distinctiveness training algorithm is embedded in the recognizer based on the dynamic time warping method, by the distance between definition training statement and the reference template as distinctive function, the reference template collection is carried out the distinctiveness training based on training set, make the identification error rate reach minimum, through repeatedly repeated distinctiveness training, increase the separating capacity between template, obtain more optimal sound template.

2, the training method that is used for embedded automatic sound identification system according to claim 1, it is characterized in that, adopt minimum distortion criterion and dynamic programming technology to obtain the voice segment of random length, improve the rationality of voice segment in the multistage vector quantization method, maximally related those frame aggregations are trained to template in one section, because template segmentation sum is relevant with the syllable number that this speech comprises, and each syllable is made of 3 to 4 phonemes usually in the Chinese, therefore each syllable is carried out segmentation according to the phoneme number that it comprised, in each phoneme corresponding templates one section.