CN1267384A - Method for determining representative speech sound block from voice signal comprising speech units - Google Patents

Method for determining representative speech sound block from voice signal comprising speech units Download PDF

Info

Publication number
CN1267384A
CN1267384A CN98808350A CN98808350A CN1267384A CN 1267384 A CN1267384 A CN 1267384A CN 98808350 A CN98808350 A CN 98808350A CN 98808350 A CN98808350 A CN 98808350A CN 1267384 A CN1267384 A CN 1267384A
Authority
CN
China
Prior art keywords
voice segments
speech
group
representative
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN98808350A
Other languages
Chinese (zh)
Other versions
CN1115664C (en
Inventor
M·霍泽普菲尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of CN1267384A publication Critical patent/CN1267384A/en
Application granted granted Critical
Publication of CN1115664C publication Critical patent/CN1115664C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

After segmenting a voice signal into individual speech units, said units representing a speech sound block are assembled in a group. These multiple speech units included in a group describe distinctively well a sound block. Different selection criteria to evaluate the usability of individual speech units are provided. One advantage of combining the selection criteria is that different criteria can be taken into account when selecting a representative speech unit. Each selection criterion includes a membership function which indicates the 'usability' of individual speech units to be selected as a representative of the group. Preferably, the speech unit representing a maximum amongst the speech units of the group according to the selection criteria indicated by the membership function is selected as the representative of the corresponding sound block.

Description

The method of the representative of the block of speech of definite language from the voice signal that comprises some voice segments
The present invention relates to from a voice signal that comprises some voice segments (Lautabschnitt), to determine a kind of method of representative (Repraesentant) of block of speech (Sprachbaustein) of language.
Known for the expert, by the signal that a people says, promptly a voice signal can be divided into voice segments (segmentation), and wherein each voice segments comprises the part of voice signal.
A kind of language can be described as being the combination of a lot of modular block of speech from its that aspect.
A membership function declaration, a voice segments is represented a corresponding block of speech with which type of membership yardstick.
In order from database, to select block of speech to have many methods.Wherein through [1] of metrics, philological [2] or continuity criterion [3] are carried out a kind of optimization.In document [4], narrated the database of automatic generation.
Known several implying-Markov-model (HMMs) in document [5].
The segmentation of a voice signal can be carried out with " quick-Viterbi-adjustment " by means of the HMMs (seeing document [4]) by the voice signal training.
It is imperfect with manual methods voice signal being divided into each voice segments, because this requires great expense and experience, and must the people of each speech be carried out separately.
Also have in addition more that important disadvantages is, the applicability of selecteed representative (Repraesentant) is not tested, and therefore owing to select a bad representative as a block of speech, correspondingly the result of phonetic synthesis also is bad.
Task as basis of the present invention is, determines the method for the block of speech of a kind of language of representative from a voice signal that comprises some voice segments.This method has been avoided above-mentioned shortcoming and has been guaranteed to improve the selection of representative.
To a kind of segmentation evaluation is the statistical appraisal of carrying out by means of the individual voice section, thereby can be defined as a section to " good " representative on the statistics of relevant voice segments.
Task of the present invention is to solve according to the feature of claim 1.
Pointed out from a voice signal that comprises some voice segments, to determine the method for the block of speech of a kind of language of representative according to the present invention.The voice segments of voice signal is always comprehensively being belonged in the group of this chosen block of speech according to the membership of block of speech in the method.Thereby people obtain having separately for a plurality of block of speech a group of at least one voice segments.Selecting scale is used as, and tries to achieve the selective value of voice segments from voice signal, and determines the frequency of the selective value that the relevant voice segments of organizing obtains.Determine the membership function by means of the frequency of so obtaining, the membership function of each voice segments in group provides the membership yardstick, the membership yardstick illustrates then whether this voice segments can be used as a representative (i.e. chosen voice segments).Voice segments is confirmed as the representative about the group of chosen block of speech now, and its membership yardstick is positioned at more than the threshold value of predesignating.
A big advantage of this method is, not in the group of chosen block of speech, to take out any one representative, but obtain a representative, and this representative has sufficiently high quality factor chosen block of speech (quite high membership yardstick) is described.
Belong to the voice segments of a group of a block of speech, relate to its workability, be dispersed in the voice signal to statistics.And voice signal is preferentially used for computing machine as a long language exemplar by the talk language of nature.For what is called " good " and " bad " voice segments are arranged about chosen block of speech.Can avoid especially with the present invention, determine of the representative of a bad voice segments as chosen block of speech.
An other selecting scale of voice segments is used in an expansion of the present invention at least.Draw at least one other selective value separately for each voice segments therein.Each group (promptly for each chosen block of speech) for voice segments is obtained the probability of all selective values, and draws a membership function by these probability as mentioned above.
In an additional expansion, from the group of voice segments, determine the representative of chosen block of speech, to multiply each other be an out to out to each membership yardstick (drawing the membership function with a membership yardstick for each selecting scale) therein.If the out to out of each voice segments is positioned on the total threshold value of predesignating, then this voice segments is suitable as the representative of chosen block of speech, and chosen from the group of voice segments.And this voice segments is to belong to this chosen block of speech.
Determine a plurality of selecting scales determine the representative advantage be because can guarantee too not bad selective value like this.In out to out the membership yardstick multiply each other weighting be equivalent to one of the probability density function with-logical operation.Representative then can enough quality factors be satisfied all selecting scales.
An expansion of the present invention in addition is that voice segments is a kind of phoneme of language, double-tone, three sounds, syllable semitone joint or word.Combination by these described voice segments also is possible.
An other expansion is, voice segments is the single status that is subordinated to implicit-Markov-model (HMM).
Also have an expansion to be, selecting scale is to be determined by the following amount of enumerating.
A) energy of each voice segments;
B) length of each voice segments;
C) fundamental frequency of each voice segments;
D) length of each voice segments control;
E) be suitable for the statistics yardstick of each voice segments.
A special expansion of the present invention is, from being produced synthetic speech by the representative of trying to achieve.Obtain the representative of block of speech according to the present invention, can become the language of determining by block of speech with new fully composition of relations by means of these representatives.Thereby draw a synthetic voice output, wherein the block of speech that is embodied by each representative (voice segments) is output with new being arranged in order.
The present invention also has an expansion to be, determines the representative of voice segments as chosen block of speech, if when its membership yardstick has the highest numerical value or considers with a plurality of selecting scale, its out to out has mxm..So just in the group of the voice segments of relevant chosen block of speech, obtain " best " voice segments.
Expansion of the present invention also can be by obtaining in the dependent claims.
Further narrate embodiments of the invention by means of following accompanying drawing.
Their expressions
Accompanying drawing 1 is represented from the voice signal that comprises a voice segments, the block diagram of each step of the method for the block of speech of a kind of language of definite representative,
Accompanying drawing 2 is represented a kind of language construction and the maps on voice signal thereof, particularly reads aloud a sketch of text,
A sketch of accompanying drawing 3 expression ' length control ' selecting scales,
A sketch of accompanying drawing 4 expression ' fundamental frequency ' selecting scales,
A sketch of accompanying drawing 5 expression ' energy ' selecting scales,
A sketch of accompanying drawing 6 expression ' SCORE score ' selecting scales.
From a voice signal, preferably from a teller's a sufficiently long voice sample, determine that block of speech is important for the phonetic synthesis of a splicing, the voice that exactly found block of speech rearranged into new semanteme are arranged.It is more accurate " to cut out " each voice segments of getting off from voice signal, and then the quality of synthetic speech is also higher.
Expression in accompanying drawing 1, the single step of the method for the block of speech of a kind of language of definite representative from the voice signal that comprises a voice segments.In 101 steps,, comprehensively become each group of each block of speech with the voice segments of voice signal membership corresponding to block of speech.This comprehensively can automatically carry out and for example in document [4] narration.Preferentially carry out HMM (=implicit-Markov-model) training by voice signal.Voice signal can be that about length is a kind of sample of voice arbitrarily of one hour to three hours.Voice segments is by comprehensively in groups after 101 steps are carried out, and wherein each group comprises that at least a voice segments, this voice segments are block of speech of predesignating that belongs to language.
Mostly include a plurality of voice segments in each such group, should determine a representative for phonetic synthesis this moment from each group.Not all the same of each voice segments in a group, but follow statistical distribution.To utilize the knowledge that distributes below, so that find and be selected in a suitable representative of a voice segments in the group.
For this reason, according to the selecting scale computing voice section of predesignating, wherein each voice segments to each selecting scale draws a selective value.Preferably estimate by different selecting scales, draw a distinctive selective value (for each voice segments) (seeing step 102) for each selecting scale for each voice segments.
Obtain the frequency (seeing step 103) of the selective value that the quilt of all voice segments of this group obtains for each group.This is equivalent to draw on X-Y scheme, and wherein horizontal ordinate is that selecting scale numerical value and ordinate are represented frequency.Produce such width of cloth figure for each selecting scale of all voice segments in the group, wherein this figure represents a statistical distribution of the voice segments that calculates according to selecting scale.
In next procedure 104, utilize the frequency of being tried to achieve, so that try to achieve membership function (for each above-mentioned figure).The membership function is preferably in the envelope that draws above the frequency of statistical distribution of selective value.This step also still will be carried out the selecting scale of each group.As mentioned above, a group comprises all voice segments of expressing the block of speech of predesignating.Can obtain a membership yardstick from the membership function to each voice segments.The membership yardstick represents, as a yardstick of the workability of each voice segments in the group of representing each selecting scale.
Select voice segments as representative subsequently in step 105, its membership yardstick is positioned on the threshold value of predesignating.As mentioned above, preferably use a plurality of selecting scales, just draw a plurality of membership yardsticks for each voice segments like this.A plurality of membership yardstick logic phase multiplications draw an out to out.Correspondingly selected then voice segments is as the representative of group, and its out to out is positioned on the total threshold value of predesignating.
For clarity, accompanying drawing 2 represented to include block of speech SBSi (i=1,2 ..., language SPR n) and comprise comprehensively voice segments LAi-j (j=1,2 in group GRi ..., the relation between voice signal SSI n).
Represent that with logical operation 201 block of speech SBS1 can use voice segments LA1-1, AL1-2, LA1-3 ..., LA1-m expresses.This voice segments that is subordinated to block of speech SBS1 is comprehensively in group GR1.Among the group GR1 each voice segments be by obtain in the voice signal and all block of speech SBS1 are described.According to voice signal, relevant each voice segments with different selecting scales has different quality factors separately.Therefore target is, draws one " spendable " representative from the voice segments of group GR1.This representative can realize block of speech SBS1 when synthetic speech.
Same relation similarly is suitable for logical operation 202.One arbitrarily block of speech SBSn can with a large amount of (here being ' p ') comprehensively the voice segments in a group GR2 express.
Tackling above-mentioned selecting scale subsequently studies.For such selecting scale multiple possibility is arranged, wherein recommend a kind of selection here.This selection can be used single, or make up mutually, or also can make up with other selecting scale, so that might from the voice segments group, advantageously determine a representative.
Accompanying drawing 3 expression with length control as selecting scale, i.e. the yardstick of duration synthetic voice segments duration originally with respect to voice segments.Up to each threshold value L UGWith upper threshold value L OGDeviation all be considered to no problem.Exceed this threshold value, promptly less than lower threshold value L UGOr greater than upper threshold value L OG, membership function Z then L_synDescend exponentially.This moment membership function Z L_synDetermine by following formula:
(1).
By with average length l ΦNormalization is 1, and then deviation is relative.Membership function Z L_synAlso normalization is 1.ZG represents the membership yardstick.
Accompanying drawing 4 is represented fundamental frequency-control as selecting scale.The fundamental frequency of voice segments should be minimum to the deviation of a target-fundamental frequency (when the synthetic speech) therein.Membership function Z L-synHave following form:
Figure A9880835000091
(2).
To be average frequency f also for clarity here to the frequency f normalization ΦAlso with membership function Z L-synNormalization is 1.The last parameter f of frequency OGFollowing parameter f with frequency UGExpression.
In accompanying drawing 5 expression with the energy of voice segments as selecting scale.This energy is membership function Z to the relative deviation of a mean value of energy E-alCriterion:
Figure A9880835000092
(3).
The mean value of ENERGY E is E Φ(expectation value), E UGBe a lower threshold value of energy, E OGBe a upper threshold value of energy, and σ EIt is the variable of energy.With membership function Z E_alNormalization is 1.
People use the length of voice segments to replace energy as selecting scale, produce a membership function Z similarly with accompanying drawing 5 like this L-alBe used for estimating the relative deviation that voice segments length changes.If also there is a upper threshold value L OG, a lower threshold value L UGVariances sigma with a length 1, membership function Z then L_alFor:
Figure A9880835000093
(4). represented that in accompanying drawing 6 score SCORE is as selecting scale.Score SCORE is the yardstick that a voice segments is suitable as representative, that is to say one prepare selected voice segments be one typical, characteristic voice segments by the byte pronunciation, therefore ' be fit to ' thus as the representative of block of speech accordingly.
Has " the best " (Z S (smax)=1) and have " the poorest " (Z S (smin)=1-s G) membership function Z between the voice segments of score SCORE selecting scale S (s)Supposed to be linear and (seen response curve Z in the accompanying drawing 6 S (s)).This membership function Z S (s)Can determine by following formula:
In order to judge, whether a voice segments is suitable as a representative of corresponding block of speech, preferably considers a plurality of membership functions of having set up.In order to ensure, a chosen representative, the numerical value of neither one membership function is positioned at below the threshold value of predesignating, and then single membership yardstick is carried out and-logical operation.This is that to be multiplied each other by each membership yardstick be that an out to out realizes.Under the above-named membership function situation of consideration, draw:
Figure A9880835000102
About at membership function Z E-alAnd Z L-alMultiplying each other of all states is meant each state in being used to describe a kind of HMMs of voice segments.Each can use the HMMs with varying number state according to modelling, wherein all these states of each voice segments individually is written in the out to out that is drawn by membership function Zges.
In this paper scope, quoted following document:
[1]Nick?Campell,Alan?W?Black:“Prosody?and?the?Selection
of?Source?Units?for?Concatenative?Synthesis”,in
Progress?Speechsynthesis,ISBN?0-387-94701-9,Springer
Verlag?New?York,1997,S.279-292
Ni Ke. bear Pei Er, A Lan. dimension. the cloth Rec: " being used to splice the metrics and the selection of synthetic source unit " language synthesizes proceedings, ISBN 0-387-94701-9, Springer publishing house, New York, 279-292 page or leaf in 1997
[2]Andrew?J.Hunt,Alan?W.Black:“Unit?Selection?in?a
concatenative?speechsynthesis?system?using?a?large
speech?data?base”,Proc.EUROSPEECH?1995,Madrid,
S.373-376。
Gheorghe Andriev. victory. Hui Te, A Lan. dimension. cloth Rec: " unit in the language synthesis system of the splicing of using a big language database is selected " european language 1995 proceedings, Madrid, 373-376 page or leaf.
[3]Alistair?D.Conkie,Stephen?Isard:“Optimal?Coupling
of?Diphones”,in?Progress?in?Speechsynthesis,ISBN
0-387-94701-9,Springer?Verlag?New?York,1997,S.293-
304。
Alistair. moral. Kang Ke, this carries all. Yi Saer: " optimum coupling of double-tone ", language synthesizes proceedings, ISBN 0-387-94701-9, Springer publishing house, New York, 279-292 page or leaf in 1997.
[4]R.E.Donovan,P.C.Woodland:“Improvements?in?an?HMM
-based?speechsynthesiser”,Proc.ICASSP?1995,
Michigan,S.573-576。
Ah .'s dust. it is slow by ten thousand to struggle against, skin. and uncommon. military Te Lande: " improvement of the voice operation demonstrator on the HMM-basis ", ICASSP 1995 proceedings, Michigan, 573-576 page or leaf
[5]G.Ruske:“Automati?sche?Spracherkennung:Methoden?der
Klassifikation?u.Merkmalsextraktion”,Oldenbourg
Verlag,Muenchen,1988,S.160-171。
Pueraria lobota. Lu Sike: " automatic speech recognition: classification and feature extracting method ", Ou Lunbao publishing house, Munich, 1988, the 160-171 pages or leaves.

Claims (8)

1. from a voice signal that comprises some voice segments, determine a kind of method of representative of the block of speech of predesignating of language,
A) wherein, the voice segments of voice signal is become each one corresponding to the block of speech of language is comprehensive
Group,
B) wherein, for the voice segments of each group according to a kind of selecting scale of predesignating from language
Obtain selective value in the tone signal,
C) wherein, determine the frequency of the selective value of group,
D) wherein, determine the membership function by means of frequency, this membership function is explanation
A membership yardstick of relevant group relevant voice segments workability,
E) wherein, from the group of the voice segments of chosen block of speech, determine its membership chi
Degree is positioned at more than the threshold value of predesignating, and that voice segments is as representative.
2. according to the method for claim 1,
Wherein, obtain the other selective value of voice segments in the group by means of at least one other selecting scale, with other frequency of determining other selective value, and, determine to have an other membership function of accordingly other membership yardstick for each other frequency.
3. according to the method for claim 2,
Wherein, each membership yardstick enters out to out with multiplying each other, and obtains representative from the group of voice segments, and its out to out is positioned at more than total threshold value of predesignating.
4. the method that one of requires according to aforesaid right,
Wherein, voice segments is the phoneme of language, double-tone, three sounds, syllable, semitone joint, word or these combination.
5. the method that one of requires according to aforesaid right,
Wherein, voice segments is the single status that belongs to implicit-Markov-model.
6. the method that one of requires according to aforesaid right,
Wherein, selecting scale is in the amount of enumerating below one:
A) energy of each voice segments;
B) length of each voice segments;
C) fundamental frequency of each voice segments;
D) length of each voice segments control;
E) statistical yardstick that each voice segments is cooperated.
7. the method that one of requires according to aforesaid right,
Wherein, be combined into language from the representative that obtains.
8. the method that one of requires according to aforesaid right,
Wherein, determine that voice segments is the representative of block of speech, its membership yardstick has the highest numerical value, if or consider a plurality of selecting scales, its out to out has the highest numerical value.
CN98808350A 1997-08-21 1998-07-27 Method for determining representative speech sound block from voice signal comprising speech units Expired - Fee Related CN1115664C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE19736465.9 1997-08-21
DE19736465 1997-08-21

Publications (2)

Publication Number Publication Date
CN1267384A true CN1267384A (en) 2000-09-20
CN1115664C CN1115664C (en) 2003-07-23

Family

ID=7839772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN98808350A Expired - Fee Related CN1115664C (en) 1997-08-21 1998-07-27 Method for determining representative speech sound block from voice signal comprising speech units

Country Status (6)

Country Link
EP (1) EP1005694B1 (en)
JP (1) JP2001514400A (en)
CN (1) CN1115664C (en)
DE (1) DE59801989D1 (en)
ES (1) ES2167945T3 (en)
WO (1) WO1999010878A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108269589A (en) * 2016-12-31 2018-07-10 ***通信集团贵州有限公司 For the speech quality assessment method and its device of call
CN110246490A (en) * 2019-06-26 2019-09-17 合肥讯飞数码科技有限公司 Voice keyword detection method and relevant apparatus

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10120513C1 (en) 2001-04-26 2003-01-09 Siemens Ag Method for determining a sequence of sound modules for synthesizing a speech signal of a tonal language
US8918316B2 (en) * 2003-07-29 2014-12-23 Alcatel Lucent Content identification system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2590414B2 (en) * 1991-03-12 1997-03-12 科学技術庁長官官房会計課長 Fuzzy pattern recognition method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108269589A (en) * 2016-12-31 2018-07-10 ***通信集团贵州有限公司 For the speech quality assessment method and its device of call
CN108269589B (en) * 2016-12-31 2021-01-29 ***通信集团贵州有限公司 Voice quality evaluation method and device for call
CN110246490A (en) * 2019-06-26 2019-09-17 合肥讯飞数码科技有限公司 Voice keyword detection method and relevant apparatus

Also Published As

Publication number Publication date
EP1005694A1 (en) 2000-06-07
DE59801989D1 (en) 2001-12-06
JP2001514400A (en) 2001-09-11
CN1115664C (en) 2003-07-23
ES2167945T3 (en) 2002-05-16
WO1999010878A1 (en) 1999-03-04
EP1005694B1 (en) 2001-10-31

Similar Documents

Publication Publication Date Title
Lluís et al. End-to-end music source separation: Is it possible in the waveform domain?
CN1152365C (en) Apparatus and method for pitch tracking
CN1162839C (en) Method and device for producing acoustics model
CN1169115C (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
DE60020434T2 (en) Generation and synthesis of prosody patterns
EP1213705B1 (en) Method and apparatus for speech synthesis
CN1275746A (en) Equipment for converting text into audio signal by using nervus network
US20090254349A1 (en) Speech synthesizer
CN101064104A (en) Emotion voice creating method based on voice conversion
CN1750120A (en) Indexing apparatus and indexing method
CN101075432A (en) Speech synthesis apparatus and method
CN101051462A (en) Feature-vector compensating apparatus and feature-vector compensating method
JPH0782348B2 (en) Subword model generation method for speech recognition
CN1835075A (en) Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN106295717A (en) A kind of western musical instrument sorting technique based on rarefaction representation and machine learning
CN1308911C (en) Method and system for identifying status of speaker
CN1924994A (en) Embedded language synthetic method and system
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
CN1115664C (en) Method for determining representative speech sound block from voice signal comprising speech units
Steffman et al. An automated method for detecting F measurement jumps based on sample-to-sample differences
CN1787072A (en) Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
Jacewicz et al. Variability in within-category implementation of stop consonant voicing in American English-speaking children
US7454347B2 (en) Voice labeling error detecting system, voice labeling error detecting method and program
CN1238805C (en) Method and apparatus for compressing voice library
CN105719641A (en) Voice selection method and device used for waveform splicing of voice synthesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee