CN1604185A

CN1604185A - Voice synthesizing system and method by utilizing length variable sub-words

Info

Publication number: CN1604185A
Application number: CN 03164848
Authority: CN
Inventors: 祖漪清; 陈桂林; 俞振利; 岳东剑
Original assignee: Motorola Inc
Current assignee: Serenes Operations
Priority date: 2003-09-29
Filing date: 2003-09-29
Publication date: 2005-04-06
Anticipated expiration: 2023-09-29
Also published as: CN1604185B

Abstract

It is a sound synthesizing system and method from input text, which comprises the following steps: first to receive s input text sting; then to compare the sound lists of text string and index; to search the relative complete son wave to the input text string in the sound list; to search the relative phoneme wave to the input text string from the sound list; to search the relative single phoneme wave to the input text string from the sound list; to connect the said wave; to generate the relative synthesizing sound to the input text string.

Description

Utilize the speech synthesis system and the method for variable eldest son's word

Technical field

The present invention relates generally to a kind of method and system that utilizes a less relatively sound inventory realization phonetic synthesis.The present invention is specially adapted to, but is not limited only to, for example: the phonetic synthesis of the hand-held device of mobile phone and personal digital assistant etc.

Background technology

What the speech synthesis technique of the complexity of knowing used is a kind of method of connection.What this technology was used is the physical record that is stored in the speech utterance in the pronunciation data storehouse.The various piece of pronunciation generates various spoken phrase through reconfiguring or connecting.Can be comprised complete word by the part that reconfigured, word section or or even the more subsection of single syllable.When bigger word section was coupled, resulting synthetic speech sounds will be more more naturally.Yet, when using bigger word section, just need jumbo storer to deposit voice data, can keep the audio database that can synthesize suitable large vocabulary.

Can be by only storing less section, for example diphones or single-tone reduce the size of this audio database; Yet the quality of the synthetic speech that obtains thus also can reduce usually.This is because form between correct tone and the very short voice segments length transit time, thereby the voice that produce the nature sounding are difficult.Exist complicated technology to analyze little phoneme chain element, for example CV and VCV (represent consonant at this C, V represents vowel).Yet the algorithm of realizing this technology will very complicatedly be strengthened processor with needs.

Other methods that are used to reduce the audio database size relevant with speech synthesis system comprise the technology that is called the resonance peak synthetic method of using.Use the resonance peak synthetic method,, just can no longer need audio database because people's sound only uses the Electron Excitation signal of filtering to simulate.Yet the synthetic speech that obtains sounds very unnatural and " machine chamber " usually.

Portable electric device such as mobile phone and PDA(Personal Digital Assistant) popular increased the demand to high-quality voice operation demonstrator.If this hand-held device dress is built-in with voice operation demonstrator, its convenience will increase greatly.For example, Email and text message, for example: SMS information can synthesize voice and be answered by the user of mobile phone.Yet the storage of this hand-hold electronic device is very limited usually with the processing resource.So the phonetic synthesis device that is built in this device must use compression and high efficiency audio database.

Therefore, just need a kind of method and system of improved phonetic synthesis, use the audio database of compression still can provide the natural phonation voice simultaneously.

Summary of the invention

According to an aspect of the present invention, the present invention is a kind of method of phonetic synthesis, comprises the text string that receives input; The sound inventory of described input text string and index is compared; From described sound inventory, retrieve and the corresponding to complete sub-character waveform of described input text string; From described sound inventory, retrieve and the corresponding to phone string waveform of described input text string; From described sound inventory, retrieve and the corresponding to single phoneme waveform of described input text string; Connect described waveform, produce and the corresponding to synthetic speech of described input text string.

The present invention preferably can comprise by big text corpus is implemented a statistical study and decide everyday words, and described everyday words is divided into the position syllable, produces the step of described sound inventory.

The step that generates described sound inventory may further include the step that the syllable that described position syllable is sorted out is sorted out step and given up the described syllable with low definition.

The step that generates described sound inventory may further include: calculate the frequency of the CV type word in the described big text corpus and select in the described big text corpus step of the described sub-word of common part.

The step that connects described waveform can comprise the described sub-character waveform of hard connection (needing the connection of signal Processing hardly), maybe can comprise the step to the correction connection of described syllable string waveform and described single syllable waveform.

Revise to connect and preferably comprise the duration that changes described connection waveform.

According to a further aspect in the invention, the present invention is a kind of according to importing the system that voice carry out phonetic synthesis, and it comprises the sound inventory with sub-character waveform.One multistage voice unit (VU) selector switch connects with described sound inventory, and a multilayer compositor connects with described voice unit (VU) selector switch.Whether the segmentation according to described input text is consistent with the sub-character waveform in the described sound inventory, selects the one-level of described tone unit selector switch.

Described multilayer compositor preferably comprises and is used to carry out the ground floor of hard connection and is used to carry out the second layer of revising connection.

Described sound inventory can comprise CV type character waveform, and described CV type character waveform can be with a comment file index.

Described multistage voice unit (VU) selector switch preferably comprises and can connect with the ground floor of described multilayer compositor to realize the hard first order that connects and can connect with the second layer of described multilayer compositor to realize revising the second level and the third level that connects.

At this instructions, and in claims, word " comprises ", " comprise " or similar terms is intended to represent comprising of Fei Paita, so, comprise the method and apparatus of listed element, not merely be to comprise these elements, can also comprise other element of not mentioning.

Description of drawings

For making easy to understand of the present invention and putting into practice, now with reference to accompanying drawing preferred embodiment is described, in the drawings, identical label is represented components identical, wherein:

Fig. 1 is the synoptic diagram according to the functional assembly of speech synthesis system of the present invention;

Fig. 2 is the process flow diagram that how to generate a sound inventory according to of the present invention; With

Fig. 3 is the process flow diagram according to phoneme synthesizing method of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED

Referring to Fig. 1, be synoptic diagram shown in the figure according to the functional assembly of the system 100 that is used for phonetic synthesis of the present invention.Sound inventory 110 comprises a plurality of sub-word assemblies 120, and for example initial, consonant ends up and CV type word.Utilize index 130 antithetical phrase word assemblies 120 to classify.

Sound inventory 110 and multilevel-cell selector switch 140 interfaces.In three grades which rank of unit selector switch 140 determine to be used to the synthetic speech that is input in the system 100.When the segmentation of input text string can be divided into the waveform corresponding with it when being included in sub-word in the sound inventory 110, the first order of selected cell selector switch 140.When the needed sub-word of synthetic input text string segmentation is not included in the sound inventory 110, but when the phone string in the sound inventory 110 can be used for synthetic input text string segmentation, the second level of selected cell selector switch 140.At last, when can only be with being included in single phoneme in the sound inventory 110 when synthesizing the segmentation of input text string, the third level of selected cell selector switch 140.

Unit selector switch 140 and double-deck compositor 150 interfaces, compositor 150 synthetic voice by system's 100 outputs.160 pairs of hard connections of the execution from the sub-word of the first order of unit selector switch 140 of ground floor synthesize.170 pairs of voice components that receive from the second level or the third level of unit selector switch 140 of the second layer of compositor 150 are carried out to revise to connect and are synthesized.Back in this explanation will and be revised connection and describe in detail hard connection.The voice component that dotted arrow among Fig. 1 is represented from the second level of unit selector switch 140 or the third level receives also can use hard connection to connect.

Referring to Fig. 2, shown in the figure process flow diagram that generates the method 200 of sound inventory 110.In step 205, big text corpus is carried out statistical study.This analysis comprises that calculating accounts for the word of remarkable majority in the word of given arbitrarily exemplary input text.For most west voice, for example English has 150,000 words of surpassing, and comprises at least 41,000 position syllable.Then, in step 210, be divided into the position syllable from the everyday words of step 205.The position syllable is defined as the syllable with word position mark, and is as follows:

Ws: the syllable in the single syllable word;

Wo: the syllable in the multi-syllable word but do not comprise the ultima of speech; With

Wf: the ultima in the multi-syllable word.

Then, method 200 proceeds to step 215, and at this, the phoneme in each syllable all is classified.Phoneme roughly can be divided into following four classes: consonant, semivowel, vowel and voiced sound tail.Sharpness between all kinds of is different.So in step 220, the phoneme with low definition can be rejected.Therefore, be based on syllable according to the definition of voice unit of the present invention, and the length of voice unit from a syllable to four or more multisyllable change.This just means that following combination can omit from sound inventory 110: consonant to consonant, vowel to consonant, semivowel to consonant and nose last or end syllable to consonant.Yet, to consider in the following connection that is combined in voice unit: consonant to vowel, semivowel to vowel, vowel to semivowel.The ending of consonant string can be shared by different words.Therefore, recited abovely surpass 41,000 position syllables and be reduced to and have only 16,000 CV type words.Following table 1 provides an example, illustrates how to use above-mentioned sub-word cell to describe, for example conversion of the syllable in " Battery level is low ":

Table 1

Syllable conversion in " Battery level is low "

?Word	?CV-like?unit
?Word	?CV-like?unit	?Battery	?b’ae(Wo)+tax(Wo)+riy(Wf)
?Level	?l’eh(Wo)+vaxl(Wf)	?Battery	?b’ae(Wo)+tax(Wo)+riy(Wf)
?Level	?l’eh(Wo)+vaxl(Wf)	?Is	?’Ih(Ws)+s
?Low	?l’ow(Ws)	?Is	?’Ih(Ws)+s

Then, method 200 proceeds to step 225, wherein the frequency of calculating CV type word according to the word frequencies in the dictionary (comprise according to a preferred embodiment of the invention surpass 190,000 entries) and unit frequency.The statistical study of English text shows that about 6,900 words can cover about 90% input text, and about 4,100 words can cover about 85% input text, and frequency or number of times that each sub-word occurs are defined as follows:

n _i＝n _1i+n _2i

N wherein _iBe i sub-word occurrence number, wherein n _1iBe the number of times that has the word appearance of i sub-word, wherein n _2iBe i the number of times that sub-word occurs in dictionary.For n _i, i=1,2 ...., N (wherein N is the number of dictionary neutron word) can calculate the frequency of each sub-word.

In step 230, selection will cover the most the most frequently used sub-word of expection input text at last.When being implemented on English, above result calculated show that 20% sub-word will cover the English text above 85%.Therefore, about 2,400 the selected formation voice unit of sub-word catalogues.From the sound corpus, extract the speech waveform relevant, form sound inventory 110 with each sub-word.Thereby said method 200 has significantly reduced the redundancy in the sound inventory 110.

All index of reference 130 indexes of related voice waveform of each sub-word in the sound inventory 110.Index 130 can comprise a simple note file with the speech waveform of record.Therefore, index 130 is used to identify phone string and the single phoneme that is included in the sub-character waveform.

Referring to Fig. 3, shown in the figure process flow diagram according to phoneme synthesizing method 300 of the present invention.Method 300 is called in initial step 305, for example; When the user of hand-held device receives a text message and want that it is synthesized voice.In step 310, speech synthesis system 100 receives an input text string, for example: be above-mentioned text message.In step 315, implement pre-service to the input text string.Pre-service becomes to comprise the son field of the positional information relevant with each section with the input text string sort.Then, in step 320, segmentation of input text string and sound inventory 110 are compared.In step 325, determine whether the complete sub-character waveform in the sound inventory 110 is consistent with the present segment of input text string.If method 300 execution in step 330 retrieve consistent sub-character waveform from sound inventory 110.Next in step 360, sub-character waveform is coupled.Step 330 is relevant with the first order of unit selector switch 140 with step 360, and the connection of sub-word is carried out hard the connection by the ground floor 160 of double-deck compositor 150.Hard connection will be described in detail hereinafter.Next in step 335, determine whether the input text string also has other section to compare with sound inventory 110.If also have, method 300 turns back to step 320 again, and at this, next section of input text string compares with sound inventory 110; Otherwise method 300 finishes in step 340.

If determine there be not the complete sub-character waveform consistent in the sound inventory 110 with the present segment of input text string in step 325, then method 300 advances to step 345, to judge whether a plurality of phone string waveforms consistent with the present segment of input text string is arranged in sound inventory 110.If have, method 300 proceeds to step 350, retrieves consistent a plurality of phone string waveforms from sound inventory 110.Next in step 365, multitone substring waveform is connected.Step 350 is relevant with the second level of unit selector switch 140 with step 365, and the connection of a plurality of phone strings is to be connected by the correction that the second layer 170 of compositor 150 is carried out.Revise to connect also and describe in detail hereinafter.Then, method 300 turns back to step 335, judges whether input this paper string also has other sections to compare with sound inventory 110.

If judge do not have a plurality of phone string waveforms consistent with the present segment of input text string in step 345 in sound inventory 110, method 300 just advances to 355 steps, retrieves single phoneme waveform from sound inventory 110.In step 365, single phoneme waveform is coupled with the most corresponding with the present segment of input text string then.Here, step 355 is relevant with the third level of unit selector switch 140 with step 365, and the connection of single phoneme is still connected by the correction that the second layer 170 of compositor 150 is finished.Then, method 300 turns back to step 335, judges whether input this paper string also has other segmentations to compare with sound inventory 110.After all segmentations of input text string were all relatively finished with the sound inventory 110 of index, method 300 finished in step 340.

Therefore, the method according to this invention 300 based on the analysis of " the most suitable " is carried out in the segmentation of input text string, connects from the waveform in the sound inventory 110.The ground floor of double-deck compositor 150 is carried out and is meaned under the situation that does not have correction hard the connection, and a plurality of waveforms from sound inventory 110 simply are stitched together.When the waveform that connects enough big, to such an extent as to the duration of speaking naturally of the duration altogether that connects waveform and corresponding input text string segmentation very near the time, this process can cause sounding the voice of nature.

On the other hand, when hard connection can not obtain sounding the voice of nature, will use to revise to connect.The second layer 170 of compositor 150 is carried out to revise and is connected.Here the duration of adjusting the connection waveform is to obtain sounding more natural voice.

With reference to following table 2, can better understand and revise connection.

Table 2

Type cases		Example		The duration result
		Example			Left side text	The right text
		1+1 in the syllable＜=2	?1		Left side text	The right text	Vowel	Semivowel/nasal sound ending	1+1＜2
?2	Vowel		?1	Vowel	1+1＜2		Vowel	Semivowel/nasal sound ending	1+1＜2
?2	Vowel		?3	Vowel	1+1＜2	Vowel	The consonant ending	1+1＝2
?4	The consonant initial		?3	Semivowel/nasal sound	1+1～＝2	Vowel	The consonant ending	1+1＝2
?4	The consonant initial		?5	Semivowel/nasal sound	1+1～＝2	The consonant initial	Vowel	1+1～＝2
?6	Semivowel/nasal sound initial		?5	Vowel	1+1＝2	The consonant initial	Vowel	1+1～＝2
?6	Semivowel/nasal sound initial		Inter-syllable 1+1＞=2	Vowel	1+1＝2	?7	Vowel	Semivowel/nasal sound	1+1＞＝2
?8	Semivowel/nasal sound	Semivowel/nasal sound		1+1＞＝2		?7	Vowel	Semivowel/nasal sound	1+1＞＝2
?8	Semivowel/nasal sound	Semivowel/nasal sound		1+1＞＝2	?9	Vowel	Consonant	1+1＝2
?10	The consonant ending	The consonant initial		1+1＝2	?9	Vowel	Consonant	1+1＝2

Provided the example of ten kinds of different situations in the table 2, wherein the sub-word assembly 120 of sound inventory 110 is divided into the left side and the right text.What describe at the rightmost row of table 2 is when connecting sub-word assembly 120, producing when sounding the synthetic speech of nature needed coupled type.For example, 2 explanations of situation in the table 2 are revised when connecting two vowel waveforms of sound inventory 110 when using, and the duration of connection waveform must reduce by 25% voice that just can obtain sounding nature.

As selection, 9 explanations of situation in the table 2 are when connecting two waveforms being made up of a vowel and consonant, and the duration of connection waveform needn't be revised.Therefore, the ground floor 160 of compositor 150 will be carried out this hard connection.

Therefore, the present invention is the improved method and system that is used for phonetic synthesis of the less relatively sound inventory 110 of a kind of use.Suitably set up the index collection that sound inventory 110 can obtain waveform, it can synthesize about 85% input text string by hard the connection.Remaining 15% can utilize the input text string described correction connection technique and synthesized.Sound inventory 110 therefore be high compression and also have the minimal redundancy waveform, make it be specially adapted to have in the hand-held device of finite memory.And the reduction of sound inventory 110 sizes makes more efficient quick of searching algorithm of the present invention.

What foregoing detailed description provided only is a preferred embodiment, is not to be restriction to scope of the present invention, usability and structure.On the contrary, the those skilled in the art that are specifically described as of preferred example embodiment implement preferred example embodiment of the present invention possibility are provided.It should be understood that under the situation that does not break away from the spirit and scope of the present invention in the claims, can make various modifications the function and the layout of element and step.

Claims

1. phoneme synthesizing method comprises:

Receive the input text string;

The sound inventory of described input text string and index is compared;

From described sound inventory, retrieve and the corresponding complete sub-character waveform of described input text string;

From described sound inventory, retrieve and the corresponding phone string waveform of described input text string;

From described sound inventory, retrieve and the corresponding single phoneme waveform of described input text string; With

Connect described waveform, to provide and the corresponding synthetic speech of described input text string.

2. according to the method for claim 1, also comprise the step that generates described sound inventory as follows:

To big text corpus implement a statistical study decide everyday words and

Described everyday words is divided into the position syllable.

3. according to the method for claim 2, the step of the described sound inventory of wherein said generation is further comprising the steps of:

To sort out from the phoneme of described position syllable; With

Give up described phoneme with low definition.

4. according to the method for claim 2, the step of the described sound inventory of wherein said generation is further comprising the steps of:

Calculate the frequency of CV type word in described big text corpus; With

Be chosen in described sub-word the most frequently used in the described big text corpus.

5. according to the process of claim 1 wherein that the step of the described waveform of described connection comprises the described sub-character waveform of hard connection.

6. according to the process of claim 1 wherein that the step of the described waveform of described connection comprises that correction connects described phone string waveform and described single phoneme waveform.

7. according to the method for claim 6, wherein said correction connects and comprises the duration that changes described connection waveform.

8. system that is used for carrying out according to input text phonetic synthesis comprises:

A sound inventory that comprises sub-character waveform;

A multistage voice unit (VU) selector switch can be connected with described sound inventory; With

Whether a multilayer compositor can be connected with described voice unit (VU) selector switch, wherein relevant with sub-character waveform in the described sound inventory according to the segmentation of described input text, selects the one-level of described tone unit selector switch.

9. system according to Claim 8, wherein said multilayer compositor comprises being used to carry out the ground floor of hard connection and being used to carry out revises the second layer that connects.

10. system according to Claim 8, wherein said sound inventory comprises CV type character waveform.

11., wherein utilize the described CV of comment file index type character waveform according to the system of claim 10.

12. require 8 system according to profit, wherein said multistage voice unit (VU) selector switch comprises:

The first order can be connected with the ground floor of described multilayer compositor, is used to carry out hard connection; With

The second level and the third level can be connected with the second layer of described multilayer compositor, are used for carrying out revising connecting.