CN105895075A - Method and system for improving synthetic voice rhythm naturalness - Google Patents

Method and system for improving synthetic voice rhythm naturalness Download PDF

Info

Publication number
CN105895075A
CN105895075A CN201510038454.2A CN201510038454A CN105895075A CN 105895075 A CN105895075 A CN 105895075A CN 201510038454 A CN201510038454 A CN 201510038454A CN 105895075 A CN105895075 A CN 105895075A
Authority
CN
China
Prior art keywords
weak
weak reading
synthesis unit
syllable
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510038454.2A
Other languages
Chinese (zh)
Other versions
CN105895075B (en
Inventor
祖漪清
王祖燕
黄维
邵鹏飞
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201510038454.2A priority Critical patent/CN105895075B/en
Publication of CN105895075A publication Critical patent/CN105895075A/en
Application granted granted Critical
Publication of CN105895075B publication Critical patent/CN105895075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention discloses a method and system for improving synthetic voice rhythm naturalness. The method comprises a step of receiving a text to be synthesized, a step of determining the basic synthesis unit sequence corresponding to the text, wherein, the basic synthesis unit sequence comprises one or more basic synthesis units, a step of determining whether each basic synthesis unit is weak reading or not, a step of obtaining the synthesis parameter model corresponding to the basic synthesis unit, carrying out weak reading processing on the synthesis parameter model corresponding to the basic synthesis unit if the basic synthesis unit is weak reading, and obtaining an updated synthesis parameter model, a step of generating the synthesis parameter model sequence corresponding to the basic synthesis unit sequence, and a step of generating continuous voice according to the synthesis parameter model sequence. By using the method and the system, the naturalness of continuous synthetic voice can be simply and effectively improved.

Description

Improve the method and system of synthesis phonetic-rhythm naturalness
Technical field
The present invention relates to speech synthesis technique field, particularly relate to a kind of raising and synthesize phonetic-rhythm nature The method and system of degree.
Background technology
Realize man-machine between hommization, intelligentized effectively mutual, build the man-machine communication of efficient natural Environment, has become as the application of current information technology and the urgent needs of development.Speech synthesis technique is by literary composition Word information is converted into natural voice signal, it is achieved the arbitrarily real-time conversion of text, changes tradition logical Cross recording playback and realize the troublesome operation that machine is lifted up one's voice, and save system memory space, at letter Cease the most increasing current dynamic queries application side particularly needing often variation in information content Face has played the most important effect.
In recent years, along with the development of demand of information-intensive society, user proposes higher wanting to man-machine interaction Asking, the phonetic synthesis effect of high naturalness has become as the important symbol of high-performance speech synthesis system. Words is interrupted the rhythm of the reflection voice modulation in tone sense of rhythm such as (break) and word tone stressed (focus) Rule problem is paid close attention to by more and more research worker.Words is interrupted can be by syntactic informations such as parts of speech Analysis is solved, and can obtain the accuracy of more than 80% in the case of training data is enough, full Foot functional need.And the problem that word tone is read again still can not solve very well owing to relating to semantic focal point analysis, To these a lot of speech synthesis systems frequently with the method avoiding offer word tone to read function again, cause synthesizing language The sound sense of rhythm that height does not rises and falls on adjusting, have impact on the natural effect of synthesis.
In the prior art, general employing stress predicted method based on semantic analysis, i.e. by semanteme Analyze and determine the focus of input text continuously and then determine the synthesis unit needing to read again and mark, then Obtain corresponding synthetic model according to stress prediction result and composite character, and then obtain continuous synthesis language Tone signal.But stress predicted exists the biggest uncertainty, it predicts the outcome the most not accurate enough, Particularly in the text that content does not limits, it is more prone to problem, has been used in inappropriate in stressed information Local time can bring significantly negative effect.
Summary of the invention
The embodiment of the present invention provides a kind of method and system improving synthesis phonetic-rhythm naturalness, to carry The naturalness of high continuous synthesis voice.
For achieving the above object, the technical scheme is that
A kind of method improving synthesis phonetic-rhythm naturalness, including:
Receive text to be synthesized;
Determine corresponding described text synthesizes unit sequence substantially, and described basic synthesis unit sequence includes One or more basic synthesis units;
Determine each the most weak reading of basic synthesis unit;
Obtain the synthetic parameters model that described basic synthesis unit is corresponding, and if described basic synthesis Unit is weak reading, then the synthetic parameters model that described basic synthesis unit is corresponding carries out weak readingization and processes, Obtain the synthetic parameters model updated;
Generate the synthetic parameters Model sequence of corresponding described basic synthesis unit sequence;
Continuous speech is generated according to described synthetic parameters Model sequence.
Preferably, described determine that the described the most weak reading of basic synthesis unit includes:
Obtain the syllable string belonging to described basic synthesis unit and/or syllable;
Determine whether described syllable string and/or syllable are weak reading, if it is, determine described basic synthesis Unit is weak reading.
Preferably, described determine that described syllable string and/or the most weak reading of syllable include:
Check that the syllable string belonging to described basic synthesis unit is whether in default weak reading vocabulary;
If it is, determine the described weak reading of basic synthesis unit;
Otherwise, check that the syllable belonging to described basic synthesis unit is whether in default weak reading vocabulary;
If the syllable belonging to described basic synthesis unit is in default weak reading vocabulary, then extract described The prosodic features of syllable, then according to the prosodic features of described syllable and the weak reading decision tree that builds in advance Determine the most weak reading of described syllable;If the weak reading of described syllable, the most described weak reading of basic synthesis unit, The most described the most weak reading of basic synthesis unit;
If the syllable belonging to described basic synthesis unit is not in default weak reading vocabulary, it is determined that institute State the most weak reading of basic synthesis unit.
Preferably, the building process of described weak reading vocabulary includes:
Obtain candidate's weak reading word, form weak reading word set;
Obtain corpus;
Calculate each candidate weak reading word weak reading frequency in described corpus in described weak reading word set successively;
If described weak reading frequency is more than frequency threshold, it is determined that described candidate weak reading word is weak reading word;
Weak reading vocabulary is generated by the weak reading word determined.
Preferably, the described weak building process reading decision tree includes:
Obtain a large amount of texts based on weak reading vocabulary as training data;
Described training data is carried out word segmentation processing, and determines each syllable that each participle comprises;
Described each syllable is carried out prosodic labeling, and prosodic labeling information includes: weak reading information;
According to described training text data and the prosodic labeling information of each syllable of correspondence, training obtains weak Read decision tree.
Preferably, described the synthetic parameters model that described basic synthesis unit is corresponding is carried out at weak reading Reason, the synthetic parameters model obtaining updating includes:
Obtaining the model parameter of described synthetic parameters model, described model parameter includes: duration parameters, Base frequency parameters, energy parameter;
Update described model parameter according to the mapping ruler that training in advance obtains, obtain the synthesis ginseng updated Digital-to-analogue type.
A kind of system improving synthesis phonetic-rhythm naturalness, described system includes:
Receiver module, is used for receiving text to be synthesized;
Basic synthesis unit sequence determines module, for determining the basic synthesis unit of corresponding described text Sequence, described basic synthesis unit sequence includes one or more basic synthesis unit;
Weak reading prediction module, is used for determining each the most weak reading of basic synthesis unit;
Synthetic parameters model acquisition module, for obtaining the synthetic parameters that described basic synthesis unit is corresponding Model;
Weak readingization processing module, for when described basic synthesis unit is weak reading, to described elementary sum The synthetic parameters model becoming unit corresponding carries out weak readingization and processes, and obtains the synthetic parameters model updated;
Synthetic parameters Model sequence generation module, for generating corresponding described basic synthesis unit sequence Synthetic parameters Model sequence;
Synthesis module, for generating continuous speech according to described synthetic parameters Model sequence.
Preferably, described weak reading prediction module includes:
Acquiring unit, for obtaining the syllable string belonging to each basic synthesis unit and/or syllable;
Determine unit, be used for determining whether described syllable string and/or syllable are weak reading, if it is, really Fixed described basic synthesis unit is weak reading.
Preferably, described determine that unit includes:
Inspection unit, for checking that syllable string belonging to described basic synthesis unit is whether default weak Read in vocabulary;If it is, determine the weak reading of described syllable;Otherwise, described basic synthesis unit is checked Whether affiliated syllable is in default weak reading vocabulary;Extract described if it is, trigger extraction unit The prosodic features of syllable;Otherwise determine the described the most weak reading of basic synthesis unit;
Extraction unit, for the prosodic features triggering the described syllable of extraction according to described inspection unit;
Judging unit, for the prosodic features of syllable extracted according to described extraction unit and builds in advance Weak reading decision tree determine the most weak reading of described syllable, and if the weak reading of described syllable, it is determined that institute State the weak reading of basic synthesis unit, otherwise determine the described the most weak reading of basic synthesis unit.
Preferably, described system also includes: weak reading vocabulary builds module, is used for building described weak reading word Table.
Preferably, described system also includes: weak reading decision tree builds module, is used for building described weak reading Decision tree.
Preferably, described weak readingization processing module includes:
Model parameter acquiring unit, for obtaining the model parameter of described synthetic parameters model, described mould Shape parameter includes: duration parameters, base frequency parameters, energy parameter;
Parameter updating block, updates described model parameter for the mapping ruler obtained according to training in advance, Obtain the synthetic parameters model updated.
The method and system improving synthesis phonetic-rhythm naturalness that the embodiment of the present invention provides, by place Reason is relatively easy to weak reading phenomenon, it is achieved the overall effect risen and fallen of continuous speech, has filled up current language Reason and good sense solution technology not yet reaches the blank of practical function to stress predicted in phonetic synthesis.And, relatively In prior art, the scheme of the embodiment of the present invention is not only accurate but also efficient, significantly to the prediction of weak reading Improve the naturalness of continuous synthesis voice.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme that the present invention implements, below will be to required in embodiment The accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only the present invention Some embodiments, for those of ordinary skill in the art, before not paying creative work Put, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
The embodiment of the present invention that shows Fig. 1 improves the flow chart of the method for synthesis phonetic-rhythm naturalness;
Fig. 2 shows the weak flow chart reading prediction of basic synthesis unit in the embodiment of the present invention;
Fig. 3 shows the weak structure flow chart reading decision tree in the embodiment of the present invention;
Fig. 4 shows in the embodiment of the present invention and synthesis parameter model is carried out the flow chart that weak readingization processes;
The embodiment of the present invention that shows Fig. 5 improves the structured flowchart of the system of synthesis phonetic-rhythm naturalness;
Fig. 6 shows the structured flowchart of weak readingization processing module in the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is entered Row clearly and completely describes, it is clear that described embodiment is only a part of embodiment of the present invention, Rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Have and make the every other embodiment obtained under creative work premise, broadly fall into present invention protection Scope.
There is the biggest uncertainty in existing employing stress predicted based on semantic analysis method, it is pre- Survey result the most not accurate enough, analyze its reason, mainly have following some:
The most in general the most of notional word (such as noun, verb etc.) occupying dictionary all may weight Read, be impossible task to its exclusive list.
2. the control only according to syntax aspect is difficult to determine stressed word, and only having possessed that semantic information just has can Can determine that stressed information, this also needs higher level intelligent processing method, and prior art is to semantic intelligence Change disposal ability the most extremely limited.
3. the characteristic parameter that stress predicted uses at present is mainly part of speech (POS), word length, word at rhythm Location etc. and the unrelated parameter of semanteme in rule structure, it the most directly instructs predicting the outcome Meaning, accordingly based on these characteristic parameters predict the outcome the most reliable.
Based on above-mentioned analysis, for the low fluctuation effect of Chinese idiom pitch involutory in continuous speech synthesis system The situation that demand and prior art are not enough to reading accurate judgement again, the embodiment of the present invention proposes The weak method and system reading prediction of a kind of synthesis text, it is achieved that efficiently and accurately that weak reading predicts the outcome Property.Correspondingly, it is also proposed that a kind of based on the weak phoneme synthesizing method reading prediction and system, by place Reason is relatively easy to weak reading phenomenon, i.e. utilizes " gently " to set off by contrast " weight ", solves to adjust the upper problem risen and fallen.Tool Body ground, the scheme of the embodiment of the present invention realizes by processing the weak readingization of part words in continuous text The natural effect that synthesis continuous speech height rises and falls, and then substantially improve the nature of continuous synthesis voice Degree.
For different language, weak reading is usually expressed as different words and feature, and such as, Chinese is common Function word (preposition, company in unstressed word in words, the function word in Tibetan language, English and a lot of western language Connect word etc.) etc..The effect in sentence of the weak reading factor is relatively unambiguous, generally can pass through part of speech, even Voice determines, typically will not surmount syntax aspect, i.e. be not related to semanteme.Therefore process weak ratio of reading to read again Cost much smaller.
To this end, the method and system improving synthesis phonetic-rhythm naturalness of the embodiment of the present invention, based on Weak reading is predicted, determines the weak reading unit in synthesis text efficiently and accurately, thus carries for phonetic synthesis For prosodic information accurately.Based on this, when phonetic synthesis, if the rhythm of basic synthesis unit is special Levy and include weak reading feature, then obtain weak reading synthetic parameters model corresponding to this basic synthesis unit or weak reading Sound bite;If the prosodic features of basic synthesis unit does not include weak reading feature, then obtain this basic What synthesis unit was corresponding is conventionally synthesized parameter model or regular speech fragment.So, utilize these corresponding Synthetic parameters model or sound bite generate continuous speech, efficiently solve and adjust the upper problem risen and fallen.
As shown in Figure 1, it is shown that the embodiment of the present invention improves the method for synthesis phonetic-rhythm naturalness Flow process, comprises the following steps:
Step 101, receives text to be synthesized.
Step 102, determine corresponding described text synthesizes unit sequence, described basic synthesis unit substantially Sequence includes one or more basic synthesis unit.
Specifically, making character fonts can be passed through, obtain each basic synthesis unit of corresponding described text, And formed by described basic synthesis unit and corresponding with described text substantially to synthesize unit sequence.
Described basic synthesis unit refers to the synthesis unit of minimum, for western language, generally uses sound Element is as basic synthesis unit, and such as: the phoneme that English word tone is comprised has three, they are t,ow,ng;Tone language based on syllable can be using initial consonant/simple or compound vowel of a Chinese syllable as basic synthesis unit, such as The initial and the final sequence of initial consonant one word is sh, eng, m, u.Wherein simple or compound vowel of a Chinese syllable eng comprises two phoneme e, ng.
Step 103, determines each the most weak reading of basic synthesis unit.
Specifically, the syllable string belonging to each basic synthesis unit and/or syllable can be obtained, it is then determined that Whether described syllable string and/or syllable are weak reading, if it is, determine that described basic synthesis unit is weak Read.
Syllable is the ultimate unit of phonetic structure.In Chinese, in general the pronunciation of a Chinese is One syllable.In English, a vowel may make up a syllable, a vowel and one or several Consonant phoneme combines can also constitute a syllable.
It should be noted that a syllable can corresponding one or more basic synthesis units.Such as " sound Female " it is a participle, it includes two syllables, and each syllable comprises an initial consonant, simple or compound vowel of a Chinese syllable (sh, Eng, m, u), therefore " initial consonant " word comprises four basic synthesis units.Correspondingly, if a sound Joint string or syllable are weak reading, then the most weak reading of all basic synthesis unit of its correspondence.
Step 104, obtains the synthetic parameters model that described basic synthesis unit is corresponding, and if described Basic synthesis unit is weak reading, then carry out weak to the synthetic parameters model that described basic synthesis unit is corresponding Readingization processes, and obtains the synthetic parameters model updated.
Described synthetic parameters model is acoustic model.It should be noted that a basic synthesis unit exists Under different linguistic context, may weak read, it is also possible to the most weak reading.Therefore, in embodiments of the present invention, For needing the basic synthesis unit of weak reading, its synthetic parameters model is carried out weak readingization and processes, make mould Shape parameter can preferably embody the height fluctuations of voice.And for the basic synthesis unit of non-weak reading, Its synthetic parameters model does not the most carry out weak readingization process.
The detailed process that synthesis parameter model carries out the process of weak readingization will be described in detail later.
Step 105, generates the synthetic parameters Model sequence of corresponding described basic synthesis unit sequence.
I.e. by synthetic parameters model corresponding to each basic synthesis unit substantially synthesized in unit sequence sequentially Arrangement, obtains described synthetic parameters Model sequence.Including the synthesis not processed through weak readingization Parameter model and the synthetic parameters model processed through weak readingization.If it is to say, therein substantially Synthesis unit is weak reading, then the synthetic parameters model of its correspondence is the synthesis ginseng after weak readingization processes Digital-to-analogue type;If basic synthesis unit therein is non-weak reading, then the synthetic parameters model of its correspondence is The synthetic parameters model of original acquisition, the synthetic parameters model of these original acquisitions can be regarded as normally Synthetic parameters model during pronunciation.
Step 106, generates continuous speech according to described synthetic parameters Model sequence.
Visible, that the embodiment of the present invention the provides method improving synthesis phonetic-rhythm naturalness, by place Reason is relatively easy to weak reading phenomenon, i.e. utilizes " gently " to set off by contrast " weight ", efficiently solves and adjust upper fluctuating Problem, preferably achieves the overall fluctuation effect of continuous speech.
As in figure 2 it is shown, be the weak flow chart reading prediction of basic synthesis unit in the embodiment of the present invention.
It should be noted that for each basic synthesis unit in basic synthesis unit sequence, all need To check successively, to determine if weak reading, specifically include following steps:
Step 201, obtains the basic synthesis unit of current check.
Step 202, basic syllable string belonging to synthesis unit described in check whether there is;If it is, hold Row step 203;Otherwise, step 204 is performed.
Specifically, synthesis text can be treated and carry out word segmentation processing, and determine that each participle obtained comprises Each syllable string and/or syllable, thus obtain the syllable string belonging to described basic synthesis unit or syllable.
Step 203, checks that described syllable string is whether in default weak reading vocabulary;If it is, perform Step 208;Otherwise, step 204 is performed.
Step 204, obtains the syllable belonging to described basic synthesis unit.
Step 205, checks that described syllable is whether in default weak reading vocabulary.If it is, perform step Rapid 206;Otherwise, step 209 is performed.
Weak pronunciation joint easily catches and negligible amounts, thus relatively easy limit.In the embodiment of the present invention In, can be in advance based on the statistics of corpus is set up weak reading vocabulary, specifically, can according to Lower process is carried out:
(1) obtain candidate's weak reading word, form weak reading word set.In actual applications, can be by all void Word is as candidate's weak reading word.
(2) corpus is obtained.
(3) each candidate weak reading word weak reading in described corpus in described weak reading word set is calculated successively Frequency.
(4) if described weak reading frequency is more than frequency threshold, it is determined that described candidate weak reading word is weak reading Word;
(5) weak reading vocabulary is generated by the weak reading word determined.
Certainly, in actual applications, it is also possible to build weak reading vocabulary by other method, such as add up Model method, does not limits this embodiment of the present invention.
Step 206, extracts the prosodic features of described syllable.
The prosodic features of described syllable can include one or more of feature: syllable place participle Position etc. in part of speech, syllable place participle.
Step 207, determines described according to the prosodic features of described syllable and the weak reading decision tree that builds in advance The basic the most weak reading of synthesis unit.
Specifically, first determine described according to the prosodic features of syllable and the weak reading decision tree that builds in advance The most weak reading of syllable;If the weak reading of described syllable, the most described weak reading of basic synthesis unit, otherwise described The basic the most weak reading of synthesis unit.
Step 208, determines the described weak reading of basic synthesis unit.
In view of same word, there is under different context environmentals different functions, particularly in load When different part of speech, it often has different representabilitys, thus weak reading has certain uncertainty. This embodiment of the present invention is determined according to the weak reading decision tree pre-build further the syllable of current check The most weak reading in the case of the most hereafter.
Weak read the building process of decision tree and utilize this weak reading decision tree to determine the concrete of the most weak reading of syllable Process will be described in detail later.
Step 209, determines the described the most weak reading of basic synthesis unit.
As it is shown on figure 3, be the weak structure flow process reading decision tree in the embodiment of the present invention, including following step Rapid:
Step 301, obtains a large amount of texts based on weak reading vocabulary as training data.
Step 302, carries out word segmentation processing, and determines each syllable that each participle comprises described training data.
Step 303, carries out prosodic labeling to described syllable, and prosodic labeling information includes: weak reading information.
Specifically, according to the speech data that training data is corresponding, each syllable can be carried out prosodic labeling.
In actual applications, prosodic labeling information also can farther include: weak pronunciation saves in participle Position, the part of speech etc. of weak pronunciation joint place participle.
Step 304, according to described training data and the prosodic labeling information of each syllable of correspondence, trains To weak reading decision tree.
Specifically, first initialize weak reading decision tree, then open from the described weak root node reading decision tree Begin, according to the problem set (this problem set comprises the information that all and weak readings are relevant) pre-build successively Investigate each nonleaf node, if the current node investigated needs division, then to the current node investigated Divide, and obtain the child node after division and training data corresponding to described child node;Otherwise, It is leaf node by currently investigating vertex ticks;After all nonleaf nodes have been investigated, obtain described weak Read decision tree.
It should be noted that in actual applications, it would however also be possible to employ other method builds weak reading decision tree, This embodiment of the present invention is not limited.
It is exemplified below and carries out the weak process reading prediction based on above-mentioned weak reading decision tree.
Such as text to be synthesized: red team and blue team have 49 books.
Carry out word segmentation processing, obtain: red team/and (conjunction)/blue team/be total to/have (there is verb)/40 Nine (number)/basis/books.
Weak read prediction: wherein syllable " with " " having " " ten " in weak reading vocabulary, therefore have only to These three syllable is judged whether weak reading.
Have according to weak reading forecast and decision tree and judge as follows:
(1) weak pronunciation joint place participle whether function word?The most weak reading." with " eligible, It is defined as weak reading;
(2) whether weak pronunciation joint place participle exists verb?If it is, the most whether have negative word? If it is, weak reading." have " though for there is verb, but above there is no negative word, be defined as non- Weak reading;
(3) weak pronunciation joint place participle whether number?If it is, whether be positioned in word?If it is Weak reading." ten " place participle is number, and is positioned in word, is defined as weak reading.
If a weak reading of syllable, then the most weak reading of all basic synthesis unit that this syllable is corresponding, otherwise As the same.
It should be noted that the synthetic parameters model described in the embodiment of the present invention is acoustic model.
The most normal pronunciation, the basic synthesis unit of weak reading has a following characteristics:
(1) the voice duration of the basic synthesis unit of weak reading is the shortest;
(2) fundamental curve of the basic synthesis unit of weak reading tends to the intermediate value of tone scope, the most originally The voice unit that fundamental curve is higher, fundamental curve meeting relative reduction, and original fundamental curve is relatively low Voice unit, fundamental curve can be at relatively raised;
(3) energy of the basic synthesis unit of weak reading is relatively low.
Based on These characteristics, in embodiments of the present invention, each weak reading can be first trained substantially to synthesize list The acoustic model that unit is corresponding, and carry out acoustics contrast with the corresponding basic synthesis unit of non-weak reading, determine Variance rule between duration, energy, the weak reading of fundamental frequency aspect and non-weak reading.Then to synthetic parameters Model carries out being shortened by duration during weak readingization, reducing or raise the Policy Updates such as fundamental frequency, reduction energy Model parameter is to realize weak reading effect.
As shown in Figure 4, it is that the embodiment of the present invention carries out, to synthesis parameter model, the stream that weak readingization processes Cheng Tu, comprises the following steps:
Step 401, obtains the model parameter of described synthetic parameters model, and described model parameter includes: time Long parameter, base frequency parameters, energy parameter;
Step 402, updates described model parameter according to the mapping ruler that training in advance obtains, is updated Synthetic parameters model.
The training process of above-mentioned mapping ruler is as follows:
In actual applications, the duration parameters that can be respectively trained in synthetic parameters model, base frequency parameters, The mapping ruler that energy parameter is corresponding, specific as follows:
1, duration parameters mapping ruler
(1) training data is obtained;
(2) the basic synthesis unit of weak reading in described training data is determined;
(3) the described basic synthesis unit of weak reading time length ratio in the case of weak reading and non-weak reading two kinds is calculated Value, and as duration parameters mapping ruler.
Due to the corresponding one or more basic synthesis units of syllable, therefore, in order to make mapping advise More accurate, can calculate respectively described basic synthesis unit in syllable diverse location (i.e. syllable first, In syllable, these three position, syllable end), duration average in the case of weak reading and non-weak reading two kinds; Then further according to this mean value computation duration ratio in the case of weak reading and non-weak reading two kinds.
Based on above-mentioned duration parameters mapping ruler, when synthesis parameter model being carried out weak readingization and processing, Can be according to basic synthesis unit diverse location in syllable, by the duration in this synthetic parameters model Parameter is adjusted according to above-mentioned duration ratio.
2, base frequency parameters mapping ruler
Duration is a scalar, and fundamental frequency is a vector, a corresponding fundamental curve of basic synthesis unit. For rule of simplification, it is possible to use the average fundamental frequency of basic synthesis unit carries out parameter mapping, the most such as Under:
(1) training data is obtained;
(2) the basic synthesis unit of weak reading in described training data is determined;
(3) the described basic synthesis unit of weak reading average base in the case of weak reading and non-weak reading two kinds is calculated Frequency ratio value, and as base frequency parameters mapping ruler.
Based on above-mentioned base frequency parameters mapping ruler, when synthesis parameter model being carried out weak readingization and processing, Can be according to basic synthesis unit diverse location in syllable, by the fundamental frequency in this synthetic parameters model Parameter is adjusted according to above-mentioned fundamental frequency ratio.
3, energy parameter mapping ruler
Energy is also a vector, a corresponding energy curve of basic synthesis unit.Can use and The identical method of base frequency parameters mapping ruler, carries out energy parameter mapping.It is not repeated herein.
The method improving synthesis phonetic-rhythm naturalness that the embodiment of the present invention provides, for continuous speech In synthesis system, the demand of the involutory low fluctuation effect of Chinese idiom pitch, based on the prediction saving weak pronunciation, right The synthetic parameters model of the basic synthesis unit that weak pronunciation joint is corresponding carries out weak readingization and processes, it is achieved continuously The overall effect risen and fallen of voice.The program is relatively easy to weak reading phenomenon by process, utilizes " gently " Set off by contrast " weight ", it is achieved the overall effect risen and fallen of continuous speech, filled up current semantics understanding technology Stress predicted in phonetic synthesis is not yet reached the blank of practical function, substantially improves continuous synthesis language The naturalness of sound.
In addition, it is necessary to explanation, in phonetic synthesis, it is also possible to consider simultaneously weak reading and stressed because of Element, improves the naturalness of continuous synthesis voice further.
Correspondingly, the embodiment of the present invention also provides for a kind of speech synthesis system, as it is shown in figure 5, be this A kind of structured flowchart of system.
In this embodiment, described system includes:
Receiver module 501, is used for receiving text to be synthesized;
Basic synthesis unit sequence determines module 502, substantially synthesizes list for determine corresponding described text Metasequence, described basic synthesis unit sequence includes one or more basic synthesis unit;
Weak reading prediction module 503, is used for determining each the most weak reading of basic synthesis unit;
Synthetic parameters model acquisition module 504, for obtaining the synthesis ginseng that described basic synthesis unit is corresponding Digital-to-analogue type;
Weak readingization processing module 505, for described basic synthesis unit be weak read time, to described substantially The synthetic parameters model that synthesis unit is corresponding carries out weak readingization and processes, and obtains the synthetic parameters model updated;
Synthetic parameters Model sequence generation module 506, is used for generating corresponding described basic synthesis unit sequence Synthetic parameters Model sequence;
Synthesis module 507, for generating continuous speech according to described synthetic parameters Model sequence.
Above-mentioned weak reading prediction module 503 specifically can use previously described weak reading Forecasting Methodology to determine respectively The basic the most weak reading of synthesis unit, a kind of concrete structure of weak reading prediction module 503 can include following Each unit:
Acquiring unit, for obtaining the syllable string belonging to each basic synthesis unit and/or syllable;
Determine unit, be used for determining whether described syllable string and/or syllable are weak reading, if it is, really Fixed described basic synthesis unit is weak reading.
Wherein, above-mentioned determine that unit may include that
Inspection unit, for checking that syllable string belonging to described basic synthesis unit is whether default weak Read in vocabulary;If it is, determine the weak reading of described syllable;Otherwise, described basic synthesis unit is checked Whether affiliated syllable is in default weak reading vocabulary;Extract described if it is, trigger extraction unit The prosodic features of syllable;Otherwise determine the described the most weak reading of basic synthesis unit;
Said extracted unit, special for the rhythm triggering the described syllable of extraction according to described inspection unit Levy,
Judging unit, for the prosodic features extracted according to described extraction unit and the weak reading built in advance Decision tree determines the most weak reading of described syllable, and if the weak reading of described syllable, it is determined that described substantially The weak reading of synthesis unit, otherwise determines the described the most weak reading of basic synthesis unit.
Above-mentioned weak reading vocabulary and weak reading decision tree can be built by speech synthesis system of the present invention, it is also possible to By other system constructing, this embodiment of the present invention is not limited.If by phonetic synthesis system of the present invention System builds, and can further include the most within the system: weak reading vocabulary builds module and weak reading decision-making Tree builds module, is respectively used to build weak reading vocabulary and weak reading decision tree.According to concrete construction method Difference, the two module can have the structure adapted respectively, not limit this.
A kind of concrete structure of above-mentioned weak readingization processing module 505 as shown in Figure 6, including:
Model parameter acquiring unit 601, for obtaining the model parameter of described synthetic parameters model, described Model parameter includes: duration parameters, base frequency parameters, energy parameter;
Parameter updating block 602, the mapping ruler for obtaining according to training in advance updates described model ginseng Number, obtains the synthetic parameters model updated.
In actual applications, described mapping ruler can be by present system training in advance, it is also possible to by Other system training in advance.
If trained by present system, also need to the most within the system farther include: mapping ruler Training module (not shown), is used for building the non-weak reading synthetic parameters of reflection corresponding with weak reading synthetic parameters The mapping ruler of relation.
Mapping ruler training module can be respectively trained duration for the model parameter of synthetic parameters model Parameter mapping ruler, base frequency parameters mapping ruler, energy parameter mapping ruler.Concrete training process can With reference to the description in above the inventive method embodiment, do not repeat them here.
Correspondingly, parameter updating block 602 needs to update each model ginseng according to corresponding mapping ruler Number.
The system improving synthesis phonetic-rhythm naturalness that the embodiment of the present invention provides, when phonetic synthesis, It is relatively easy to weak reading phenomenon by process, it is achieved the overall effect risen and fallen of continuous speech, fills up Current semantics understands that technology not yet reaches the blank of practical function to stress predicted in phonetic synthesis, significantly Improve the naturalness of continuous synthesis voice.
Each embodiment in this specification all uses the mode gone forward one by one to describe, phase between each embodiment As homophase part see mutually, each embodiment stress with other embodiments Difference.For system embodiment, owing to it is substantially similar to embodiment of the method, So describing fairly simple, relevant part sees the part of embodiment of the method and illustrates.Above institute The system embodiment described is only schematically, the wherein said unit illustrated as separating component and Module can be or may not be physically separate.Furthermore it is also possible to according to the actual needs Select some or all of unit therein and module to realize the purpose of the present embodiment scheme.This area Those of ordinary skill, in the case of not paying creative work, is i.e. appreciated that and implements.
The structure of the present invention, feature and effect effect are described in detail above according to graphic shown embodiment Really, the foregoing is only presently preferred embodiments of the present invention, but the present invention does not implements to limit shown in drawing Scope, every change made according to the conception of the present invention, or it is revised as the equivalence enforcement of equivalent variations Example, still without departing from description with diagram contained spiritual time, all should be within the scope of the present invention.

Claims (12)

1. the method improving synthesis phonetic-rhythm naturalness, it is characterised in that including:
Receive text to be synthesized;
Determine corresponding described text synthesizes unit sequence substantially, and described basic synthesis unit sequence includes One or more basic synthesis units;
Determine each the most weak reading of basic synthesis unit;
Obtain the synthetic parameters model that described basic synthesis unit is corresponding, and if described basic synthesis Unit is weak reading, then the synthetic parameters model that described basic synthesis unit is corresponding carries out weak readingization and processes, Obtain the synthetic parameters model updated;
Generate the synthetic parameters Model sequence of corresponding described basic synthesis unit sequence;
Continuous speech is generated according to described synthetic parameters Model sequence.
Method the most according to claim 1, it is characterised in that described determine described basic synthesis The most weak reading of unit includes:
Obtain the syllable string belonging to described basic synthesis unit and/or syllable;
Determine whether described syllable string and/or syllable are weak reading, if it is, determine described basic synthesis Unit is weak reading.
Method the most according to claim 2, it is characterised in that described determine described syllable string and/ Or the most weak reading of syllable includes:
Check that the syllable string belonging to described basic synthesis unit is whether in default weak reading vocabulary;
If it is, determine the described weak reading of basic synthesis unit;
Otherwise, check that the syllable belonging to described basic synthesis unit is whether in default weak reading vocabulary;
If the syllable belonging to described basic synthesis unit is in default weak reading vocabulary, then extract described The prosodic features of syllable, then according to the prosodic features of described syllable and the weak reading decision tree that builds in advance Determine the most weak reading of described syllable;If the weak reading of described syllable, the most described weak reading of basic synthesis unit, The most described the most weak reading of basic synthesis unit;
If the syllable belonging to described basic synthesis unit is not in default weak reading vocabulary, it is determined that institute State the most weak reading of basic synthesis unit.
Method the most according to claim 3, it is characterised in that the structure of described weak reading vocabulary Journey includes:
Obtain candidate's weak reading word, form weak reading word set;
Obtain corpus;
Calculate each candidate weak reading word weak reading frequency in described corpus in described weak reading word set successively;
If described weak reading frequency is more than frequency threshold, it is determined that described candidate weak reading word is weak reading word;
Weak reading vocabulary is generated by the weak reading word determined.
Method the most according to claim 3, it is characterised in that the described weak structure reading decision tree Process includes:
Obtain a large amount of texts based on weak reading vocabulary as training data;
Described training data is carried out word segmentation processing, and determines each syllable that each participle comprises;
Described each syllable is carried out prosodic labeling, and prosodic labeling information includes: weak reading information;
According to described training text data and the prosodic labeling information of each syllable of correspondence, training obtains weak Read decision tree.
6. according to the method described in any one of claim 1 to 5, it is characterised in that described to described The synthetic parameters model that basic synthesis unit is corresponding carries out weak readingization and processes, and obtains the synthetic parameters updated Model includes:
Obtaining the model parameter of described synthetic parameters model, described model parameter includes: duration parameters, Base frequency parameters, energy parameter;
Update described model parameter according to the mapping ruler that training in advance obtains, obtain the synthesis ginseng updated Digital-to-analogue type.
7. the system improving synthesis phonetic-rhythm naturalness, it is characterised in that described system includes:
Receiver module, is used for receiving text to be synthesized;
Basic synthesis unit sequence determines module, for determining the basic synthesis unit of corresponding described text Sequence, described basic synthesis unit sequence includes one or more basic synthesis unit;
Weak reading prediction module, is used for determining each the most weak reading of basic synthesis unit;
Synthetic parameters model acquisition module, for obtaining the synthetic parameters that described basic synthesis unit is corresponding Model;
Weak readingization processing module, for when described basic synthesis unit is weak reading, to described elementary sum The synthetic parameters model becoming unit corresponding carries out weak readingization and processes, and obtains the synthetic parameters model updated;
Synthetic parameters Model sequence generation module, for generating corresponding described basic synthesis unit sequence Synthetic parameters Model sequence;
Synthesis module, for generating continuous speech according to described synthetic parameters Model sequence.
System the most according to claim 7, it is characterised in that described weak reading prediction module includes:
Acquiring unit, for obtaining the syllable string belonging to each basic synthesis unit and/or syllable;
Determine unit, be used for determining whether described syllable string and/or syllable are weak reading, if it is, really Fixed described basic synthesis unit is weak reading.
System the most according to claim 8, it is characterised in that described determine that unit includes:
Inspection unit, for checking that syllable string belonging to described basic synthesis unit is whether default weak Read in vocabulary;If it is, determine the weak reading of described syllable;Otherwise, described basic synthesis unit is checked Whether affiliated syllable is in default weak reading vocabulary;Extract described if it is, trigger extraction unit The prosodic features of syllable;Otherwise determine the described the most weak reading of basic synthesis unit;
Extraction unit, for the prosodic features triggering the described syllable of extraction according to described inspection unit;
Judging unit, for the prosodic features of syllable extracted according to described extraction unit and builds in advance Weak reading decision tree determine the most weak reading of described syllable, and if the weak reading of described syllable, it is determined that institute State the weak reading of basic synthesis unit, otherwise determine the described the most weak reading of basic synthesis unit.
System the most according to claim 9, it is characterised in that described system also includes: weak reading Vocabulary builds module, is used for building described weak reading vocabulary.
11. systems according to claim 9, it is characterised in that described system also includes: weak reading Decision tree builds module, is used for building described weak reading decision tree.
12. according to the system described in any one of claim 7 to 11, it is characterised in that described weak reading Change processing module to include:
Model parameter acquiring unit, for obtaining the model parameter of described synthetic parameters model, described mould Shape parameter includes: duration parameters, base frequency parameters, energy parameter;
Parameter updating block, updates described model parameter for the mapping ruler obtained according to training in advance, Obtain the synthetic parameters model updated.
CN201510038454.2A 2015-01-26 2015-01-26 Improve the method and system of synthesis phonetic-rhythm naturalness Active CN105895075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510038454.2A CN105895075B (en) 2015-01-26 2015-01-26 Improve the method and system of synthesis phonetic-rhythm naturalness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510038454.2A CN105895075B (en) 2015-01-26 2015-01-26 Improve the method and system of synthesis phonetic-rhythm naturalness

Publications (2)

Publication Number Publication Date
CN105895075A true CN105895075A (en) 2016-08-24
CN105895075B CN105895075B (en) 2019-11-15

Family

ID=56999749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510038454.2A Active CN105895075B (en) 2015-01-26 2015-01-26 Improve the method and system of synthesis phonetic-rhythm naturalness

Country Status (1)

Country Link
CN (1) CN105895075B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604184A (en) * 2003-09-29 2005-04-06 摩托罗拉公司 Transformation from characters to sound for synthesizing text paragraph pronunciation
CN1664922A (en) * 2004-03-05 2005-09-07 雅马哈株式会社 Pitch model production device, method and pitch model production program
CN1685396A (en) * 2002-09-23 2005-10-19 因芬尼昂技术股份公司 Method for computer-aided speech synthesis of a stored electronic text into an analog speech signal, speech synthesis device and telecommunication apparatus
CN101123089A (en) * 2006-08-08 2008-02-13 苗玉水 Voice mixing method for Chinese voice code
CN101271687A (en) * 2007-03-20 2008-09-24 株式会社东芝 Method and device for pronunciation conversion estimation and speech synthesis
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685396A (en) * 2002-09-23 2005-10-19 因芬尼昂技术股份公司 Method for computer-aided speech synthesis of a stored electronic text into an analog speech signal, speech synthesis device and telecommunication apparatus
CN1604184A (en) * 2003-09-29 2005-04-06 摩托罗拉公司 Transformation from characters to sound for synthesizing text paragraph pronunciation
CN1664922A (en) * 2004-03-05 2005-09-07 雅马哈株式会社 Pitch model production device, method and pitch model production program
CN101123089A (en) * 2006-08-08 2008-02-13 苗玉水 Voice mixing method for Chinese voice code
CN101271687A (en) * 2007-03-20 2008-09-24 株式会社东芝 Method and device for pronunciation conversion estimation and speech synthesis
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet

Also Published As

Publication number Publication date
CN105895075B (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN105244020B (en) Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
Sóskuthy Evaluating generalised additive mixed modelling strategies for dynamic speech analysis
CN101000764B (en) Speech synthetic text processing method based on rhythm structure
CN101000765B (en) Speech synthetic method based on rhythm character
US6529874B2 (en) Clustered patterns for text-to-speech synthesis
Rao et al. Modeling durations of syllables using neural networks
EP1089256A2 (en) Speech recognition models adaptation from previous results feedback
CN101051458B (en) Rhythm phrase predicting method based on module analysis
CN104021784A (en) Voice synthesis method and device based on large corpus
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
US8626510B2 (en) Speech synthesizing device, computer program product, and method
Kohler Modelling prosody in spontaneous speech
CN104572614A (en) Training method and system for language model
KR20230039750A (en) Predicting parametric vocoder parameters from prosodic features
CN102254554A (en) Method for carrying out hierarchical modeling and predicating on mandarin accent
CN112634866A (en) Speech synthesis model training and speech synthesis method, apparatus, device and medium
CN105895076A (en) Speech synthesis method and system
Batliner et al. Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground?
CN115116428A (en) Prosodic boundary labeling method, apparatus, device, medium, and program product
Lazaridis et al. Improving phone duration modelling using support vector regression fusion
Van Niekerk et al. Predicting utterance pitch targets in Yorùbá for tone realisation in speech synthesis
KR101097186B1 (en) System and method for synthesizing voice of multi-language
CN1787072B (en) Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
CN105895075A (en) Method and system for improving synthetic voice rhythm naturalness
James The acquisition of phonological representation: a modular approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant