CN1267384A

CN1267384A - Method for determining representative speech sound block from voice signal comprising speech units

Info

Publication number: CN1267384A
Application number: CN98808350A
Authority: CN
Inventors: M·霍泽普菲尔
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1997-08-21
Filing date: 1998-07-27
Publication date: 2000-09-20
Anticipated expiration: 2018-07-27
Also published as: EP1005694A1; DE59801989D1; JP2001514400A; CN1115664C; ES2167945T3; WO1999010878A1; EP1005694B1

Abstract

After segmenting a voice signal into individual speech units, said units representing a speech sound block are assembled in a group. These multiple speech units included in a group describe distinctively well a sound block. Different selection criteria to evaluate the usability of individual speech units are provided. One advantage of combining the selection criteria is that different criteria can be taken into account when selecting a representative speech unit. Each selection criterion includes a membership function which indicates the 'usability' of individual speech units to be selected as a representative of the group. Preferably, the speech unit representing a maximum amongst the speech units of the group according to the selection criteria indicated by the membership function is selected as the representative of the corresponding sound block.

Description

The method of the representative of the block of speech of definite language from the voice signal that comprises some voice segments

The present invention relates to from a voice signal that comprises some voice segments (Lautabschnitt), to determine a kind of method of representative (Repraesentant) of block of speech (Sprachbaustein) of language.

Known for the expert, by the signal that a people says, promptly a voice signal can be divided into voice segments (segmentation), and wherein each voice segments comprises the part of voice signal.

A kind of language can be described as being the combination of a lot of modular block of speech from its that aspect.

A membership function declaration, a voice segments is represented a corresponding block of speech with which type of membership yardstick.

In order from database, to select block of speech to have many methods.Wherein through [1] of metrics, philological [2] or continuity criterion [3] are carried out a kind of optimization.In document [4], narrated the database of automatic generation.

Known several implying-Markov-model (HMMs) in document [5].

The segmentation of a voice signal can be carried out with " quick-Viterbi-adjustment " by means of the HMMs (seeing document [4]) by the voice signal training.

It is imperfect with manual methods voice signal being divided into each voice segments, because this requires great expense and experience, and must the people of each speech be carried out separately.

Also have in addition more that important disadvantages is, the applicability of selecteed representative (Repraesentant) is not tested, and therefore owing to select a bad representative as a block of speech, correspondingly the result of phonetic synthesis also is bad.

Task as basis of the present invention is, determines the method for the block of speech of a kind of language of representative from a voice signal that comprises some voice segments.This method has been avoided above-mentioned shortcoming and has been guaranteed to improve the selection of representative.

To a kind of segmentation evaluation is the statistical appraisal of carrying out by means of the individual voice section, thereby can be defined as a section to " good " representative on the statistics of relevant voice segments.

Task of the present invention is to solve according to the feature of claim 1.

Pointed out from a voice signal that comprises some voice segments, to determine the method for the block of speech of a kind of language of representative according to the present invention.The voice segments of voice signal is always comprehensively being belonged in the group of this chosen block of speech according to the membership of block of speech in the method.Thereby people obtain having separately for a plurality of block of speech a group of at least one voice segments.Selecting scale is used as, and tries to achieve the selective value of voice segments from voice signal, and determines the frequency of the selective value that the relevant voice segments of organizing obtains.Determine the membership function by means of the frequency of so obtaining, the membership function of each voice segments in group provides the membership yardstick, the membership yardstick illustrates then whether this voice segments can be used as a representative (i.e. chosen voice segments).Voice segments is confirmed as the representative about the group of chosen block of speech now, and its membership yardstick is positioned at more than the threshold value of predesignating.

A big advantage of this method is, not in the group of chosen block of speech, to take out any one representative, but obtain a representative, and this representative has sufficiently high quality factor chosen block of speech (quite high membership yardstick) is described.

Belong to the voice segments of a group of a block of speech, relate to its workability, be dispersed in the voice signal to statistics.And voice signal is preferentially used for computing machine as a long language exemplar by the talk language of nature.For what is called " good " and " bad " voice segments are arranged about chosen block of speech.Can avoid especially with the present invention, determine of the representative of a bad voice segments as chosen block of speech.

An other selecting scale of voice segments is used in an expansion of the present invention at least.Draw at least one other selective value separately for each voice segments therein.Each group (promptly for each chosen block of speech) for voice segments is obtained the probability of all selective values, and draws a membership function by these probability as mentioned above.

In an additional expansion, from the group of voice segments, determine the representative of chosen block of speech, to multiply each other be an out to out to each membership yardstick (drawing the membership function with a membership yardstick for each selecting scale) therein.If the out to out of each voice segments is positioned on the total threshold value of predesignating, then this voice segments is suitable as the representative of chosen block of speech, and chosen from the group of voice segments.And this voice segments is to belong to this chosen block of speech.

Determine a plurality of selecting scales determine the representative advantage be because can guarantee too not bad selective value like this.In out to out the membership yardstick multiply each other weighting be equivalent to one of the probability density function with-logical operation.Representative then can enough quality factors be satisfied all selecting scales.

An expansion of the present invention in addition is that voice segments is a kind of phoneme of language, double-tone, three sounds, syllable semitone joint or word.Combination by these described voice segments also is possible.

An other expansion is, voice segments is the single status that is subordinated to implicit-Markov-model (HMM).

Also have an expansion to be, selecting scale is to be determined by the following amount of enumerating.

A) energy of each voice segments;

B) length of each voice segments;

C) fundamental frequency of each voice segments;

D) length of each voice segments control;

E) be suitable for the statistics yardstick of each voice segments.

A special expansion of the present invention is, from being produced synthetic speech by the representative of trying to achieve.Obtain the representative of block of speech according to the present invention, can become the language of determining by block of speech with new fully composition of relations by means of these representatives.Thereby draw a synthetic voice output, wherein the block of speech that is embodied by each representative (voice segments) is output with new being arranged in order.

The present invention also has an expansion to be, determines the representative of voice segments as chosen block of speech, if when its membership yardstick has the highest numerical value or considers with a plurality of selecting scale, its out to out has mxm..So just in the group of the voice segments of relevant chosen block of speech, obtain " best " voice segments.

Expansion of the present invention also can be by obtaining in the dependent claims.

Further narrate embodiments of the invention by means of following accompanying drawing.

Their expressions

Accompanying drawing 1 is represented from the voice signal that comprises a voice segments, the block diagram of each step of the method for the block of speech of a kind of language of definite representative,

Accompanying drawing 2 is represented a kind of language construction and the maps on voice signal thereof, particularly reads aloud a sketch of text,

A sketch of accompanying drawing 3 expression ' length control ' selecting scales,

A sketch of accompanying drawing 4 expression ' fundamental frequency ' selecting scales,

A sketch of accompanying drawing 5 expression ' energy ' selecting scales,

A sketch of accompanying drawing 6 expression ' SCORE score ' selecting scales.

From a voice signal, preferably from a teller's a sufficiently long voice sample, determine that block of speech is important for the phonetic synthesis of a splicing, the voice that exactly found block of speech rearranged into new semanteme are arranged.It is more accurate " to cut out " each voice segments of getting off from voice signal, and then the quality of synthetic speech is also higher.

Expression in accompanying drawing 1, the single step of the method for the block of speech of a kind of language of definite representative from the voice signal that comprises a voice segments.In 101 steps,, comprehensively become each group of each block of speech with the voice segments of voice signal membership corresponding to block of speech.This comprehensively can automatically carry out and for example in document [4] narration.Preferentially carry out HMM (=implicit-Markov-model) training by voice signal.Voice signal can be that about length is a kind of sample of voice arbitrarily of one hour to three hours.Voice segments is by comprehensively in groups after 101 steps are carried out, and wherein each group comprises that at least a voice segments, this voice segments are block of speech of predesignating that belongs to language.

Mostly include a plurality of voice segments in each such group, should determine a representative for phonetic synthesis this moment from each group.Not all the same of each voice segments in a group, but follow statistical distribution.To utilize the knowledge that distributes below, so that find and be selected in a suitable representative of a voice segments in the group.

For this reason, according to the selecting scale computing voice section of predesignating, wherein each voice segments to each selecting scale draws a selective value.Preferably estimate by different selecting scales, draw a distinctive selective value (for each voice segments) (seeing step 102) for each selecting scale for each voice segments.

Obtain the frequency (seeing step 103) of the selective value that the quilt of all voice segments of this group obtains for each group.This is equivalent to draw on X-Y scheme, and wherein horizontal ordinate is that selecting scale numerical value and ordinate are represented frequency.Produce such width of cloth figure for each selecting scale of all voice segments in the group, wherein this figure represents a statistical distribution of the voice segments that calculates according to selecting scale.

In next procedure 104, utilize the frequency of being tried to achieve, so that try to achieve membership function (for each above-mentioned figure).The membership function is preferably in the envelope that draws above the frequency of statistical distribution of selective value.This step also still will be carried out the selecting scale of each group.As mentioned above, a group comprises all voice segments of expressing the block of speech of predesignating.Can obtain a membership yardstick from the membership function to each voice segments.The membership yardstick represents, as a yardstick of the workability of each voice segments in the group of representing each selecting scale.

Select voice segments as representative subsequently in step 105, its membership yardstick is positioned on the threshold value of predesignating.As mentioned above, preferably use a plurality of selecting scales, just draw a plurality of membership yardsticks for each voice segments like this.A plurality of membership yardstick logic phase multiplications draw an out to out.Correspondingly selected then voice segments is as the representative of group, and its out to out is positioned on the total threshold value of predesignating.

For clarity, accompanying drawing 2 represented to include block of speech SBSi (i=1,2 ..., language SPR n) and comprise comprehensively voice segments LAi-j (j=1,2 in group GRi ..., the relation between voice signal SSI n).

Represent that with logical operation 201 block of speech SBS1 can use voice segments LA1-1, AL1-2, LA1-3 ..., LA1-m expresses.This voice segments that is subordinated to block of speech SBS1 is comprehensively in group GR1.Among the group GR1 each voice segments be by obtain in the voice signal and all block of speech SBS1 are described.According to voice signal, relevant each voice segments with different selecting scales has different quality factors separately.Therefore target is, draws one " spendable " representative from the voice segments of group GR1.This representative can realize block of speech SBS1 when synthetic speech.

Same relation similarly is suitable for logical operation 202.One arbitrarily block of speech SBSn can with a large amount of (here being ' p ') comprehensively the voice segments in a group GR2 express.

Tackling above-mentioned selecting scale subsequently studies.For such selecting scale multiple possibility is arranged, wherein recommend a kind of selection here.This selection can be used single, or make up mutually, or also can make up with other selecting scale, so that might from the voice segments group, advantageously determine a representative.

Accompanying drawing 3 expression with length control as selecting scale, i.e. the yardstick of duration synthetic voice segments duration originally with respect to voice segments.Up to each threshold value L _UGWith upper threshold value L _OGDeviation all be considered to no problem.Exceed this threshold value, promptly less than lower threshold value L _UGOr greater than upper threshold value L _OG, membership function Z then _{L_syn}Descend exponentially.This moment membership function Z _{L_syn}Determine by following formula:

(1).

By with average length l _ΦNormalization is 1, and then deviation is relative.Membership function Z _{L_syn}Also normalization is 1.ZG represents the membership yardstick.

Accompanying drawing 4 is represented fundamental frequency-control as selecting scale.The fundamental frequency of voice segments should be minimum to the deviation of a target-fundamental frequency (when the synthetic speech) therein.Membership function Z _L-synHave following form:

(2).

To be average frequency f also for clarity here to the frequency f normalization _ΦAlso with membership function Z _L-synNormalization is 1.The last parameter f of frequency _OGFollowing parameter f with frequency _UGExpression.

In accompanying drawing 5 expression with the energy of voice segments as selecting scale.This energy is membership function Z to the relative deviation of a mean value of energy _E-alCriterion:

(3).

The mean value of ENERGY E is E _Φ(expectation value), E _UGBe a lower threshold value of energy, E _OGBe a upper threshold value of energy, and σ _EIt is the variable of energy.With membership function Z _{E_al}Normalization is 1.

People use the length of voice segments to replace energy as selecting scale, produce a membership function Z similarly with accompanying drawing 5 like this _L-alBe used for estimating the relative deviation that voice segments length changes.If also there is a upper threshold value L _OG, a lower threshold value L _UGVariances sigma with a length ₁, membership function Z then _{L_al}For:

(4). represented that in accompanying drawing 6 score SCORE is as selecting scale.Score SCORE is the yardstick that a voice segments is suitable as representative, that is to say one prepare selected voice segments be one typical, characteristic voice segments by the byte pronunciation, therefore ' be fit to ' thus as the representative of block of speech accordingly.

Has " the best " (Z _{S (smax)}=1) and have " the poorest " (Z _{S (smin)}=1-s _G) membership function Z between the voice segments of score SCORE selecting scale _{S (s)}Supposed to be linear and (seen response curve Z in the accompanying drawing 6 _{S (s)}).This membership function Z _{S (s)}Can determine by following formula:

In order to judge, whether a voice segments is suitable as a representative of corresponding block of speech, preferably considers a plurality of membership functions of having set up.In order to ensure, a chosen representative, the numerical value of neither one membership function is positioned at below the threshold value of predesignating, and then single membership yardstick is carried out and-logical operation.This is that to be multiplied each other by each membership yardstick be that an out to out realizes.Under the above-named membership function situation of consideration, draw:

About at membership function Z _E-alAnd Z _L-alMultiplying each other of all states is meant each state in being used to describe a kind of HMMs of voice segments.Each can use the HMMs with varying number state according to modelling, wherein all these states of each voice segments individually is written in the out to out that is drawn by membership function Zges.

In this paper scope, quoted following document:

[1]Nick?Campell，Alan?W?Black：“Prosody?and?the?Selection

of?Source?Units?for?Concatenative?Synthesis”，in

Progress?Speechsynthesis，ISBN?0-387-94701-9，Springer

Verlag?New?York，1997，S.279-292

Ni Ke. bear Pei Er, A Lan. dimension. the cloth Rec: " being used to splice the metrics and the selection of synthetic source unit " language synthesizes proceedings, ISBN 0-387-94701-9, Springer publishing house, New York, 279-292 page or leaf in 1997

[2]Andrew?J.Hunt，Alan?W.Black：“Unit?Selection?in?a

concatenative?speechsynthesis?system?using?a?large

speech?data?base”，Proc.EUROSPEECH?1995，Madrid，

S.373-376。

Gheorghe Andriev. victory. Hui Te, A Lan. dimension. cloth Rec: " unit in the language synthesis system of the splicing of using a big language database is selected " european language 1995 proceedings, Madrid, 373-376 page or leaf.

[3]Alistair?D.Conkie，Stephen?Isard：“Optimal?Coupling

of?Diphones”，in?Progress?in?Speechsynthesis，ISBN

0-387-94701-9，Springer?Verlag?New?York，1997，S.293-

304。

Alistair. moral. Kang Ke, this carries all. Yi Saer: " optimum coupling of double-tone ", language synthesizes proceedings, ISBN 0-387-94701-9, Springer publishing house, New York, 279-292 page or leaf in 1997.

[4]R.E.Donovan，P.C.Woodland：“Improvements?in?an?HMM

-based?speechsynthesiser”，Proc.ICASSP?1995，

Michigan，S.573-576。

Ah .'s dust. it is slow by ten thousand to struggle against, skin. and uncommon. military Te Lande: " improvement of the voice operation demonstrator on the HMM-basis ", ICASSP 1995 proceedings, Michigan, 573-576 page or leaf

[5]G.Ruske：“Automati?sche?Spracherkennung：Methoden?der

Klassifikation?u.Merkmalsextraktion”，Oldenbourg

Verlag，Muenchen，1988，S.160-171。

Pueraria lobota. Lu Sike: " automatic speech recognition: classification and feature extracting method ", Ou Lunbao publishing house, Munich, 1988, the 160-171 pages or leaves.

Claims

1. from a voice signal that comprises some voice segments, determine a kind of method of representative of the block of speech of predesignating of language,

A) wherein, the voice segments of voice signal is become each one corresponding to the block of speech of language is comprehensive

Group,

B) wherein, for the voice segments of each group according to a kind of selecting scale of predesignating from language

Obtain selective value in the tone signal,

C) wherein, determine the frequency of the selective value of group,

D) wherein, determine the membership function by means of frequency, this membership function is explanation

A membership yardstick of relevant group relevant voice segments workability,

E) wherein, from the group of the voice segments of chosen block of speech, determine its membership chi

Degree is positioned at more than the threshold value of predesignating, and that voice segments is as representative.

2. according to the method for claim 1,

Wherein, obtain the other selective value of voice segments in the group by means of at least one other selecting scale, with other frequency of determining other selective value, and, determine to have an other membership function of accordingly other membership yardstick for each other frequency.

3. according to the method for claim 2,

Wherein, each membership yardstick enters out to out with multiplying each other, and obtains representative from the group of voice segments, and its out to out is positioned at more than total threshold value of predesignating.

4. the method that one of requires according to aforesaid right,

Wherein, voice segments is the phoneme of language, double-tone, three sounds, syllable, semitone joint, word or these combination.

5. the method that one of requires according to aforesaid right,

Wherein, voice segments is the single status that belongs to implicit-Markov-model.

6. the method that one of requires according to aforesaid right,

Wherein, selecting scale is in the amount of enumerating below one:

A) energy of each voice segments;

B) length of each voice segments;

C) fundamental frequency of each voice segments;

D) length of each voice segments control;

E) statistical yardstick that each voice segments is cooperated.

7. the method that one of requires according to aforesaid right,

Wherein, be combined into language from the representative that obtains.

8. the method that one of requires according to aforesaid right,

Wherein, determine that voice segments is the representative of block of speech, its membership yardstick has the highest numerical value, if or consider a plurality of selecting scales, its out to out has the highest numerical value.