WO2015025788A1 - 定量的f0パターン生成装置及び方法、並びにf0パターン生成のためのモデル学習装置及び方法 - Google Patents
定量的f0パターン生成装置及び方法、並びにf0パターン生成のためのモデル学習装置及び方法 Download PDFInfo
- Publication number
- WO2015025788A1 WO2015025788A1 PCT/JP2014/071392 JP2014071392W WO2015025788A1 WO 2015025788 A1 WO2015025788 A1 WO 2015025788A1 JP 2014071392 W JP2014071392 W JP 2014071392W WO 2015025788 A1 WO2015025788 A1 WO 2015025788A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pattern
- component
- accent
- phrase
- generating
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 42
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 15
- 230000015572 biosynthetic process Effects 0.000 claims description 43
- 238000003786 synthesis reaction Methods 0.000 claims description 43
- 238000003860 storage Methods 0.000 claims description 30
- 238000000605 extraction Methods 0.000 claims description 23
- 230000008859 change Effects 0.000 claims description 19
- 230000008569 process Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 9
- 238000009499 grossing Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000004519 manufacturing process Methods 0.000 description 5
- 238000000926 separation method Methods 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000002922 simulated annealing Methods 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 238000000137 annealing Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/086—Detection of language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
Definitions
- the present invention relates to speech synthesis technology, and in particular, to a fundamental frequency pattern synthesis technology during speech synthesis.
- the time change pattern of the fundamental frequency of speech helps to clarify sentence breaks, express accent positions, and distinguish words.
- the F0 pattern also plays a major role in conveying non-verbal information such as emotions associated with utterances.
- the F0 pattern has a great influence on the naturalness of speech. In particular, it is necessary to utter a sentence with an appropriate intonation in order to clarify the focused position during utterance and to clarify the structure of the sentence. If the F0 pattern is not appropriate, the intelligibility of the synthesized speech is impaired. Therefore, how to synthesize a desired F0 pattern in speech synthesis is a big problem.
- Non-Patent Document 1 There is a method called the Fujisaki model disclosed in Non-Patent Document 1 described later as a method for synthesizing the F0 pattern.
- the Fujisaki model is an F0 pattern generation process model that quantitatively describes an F0 pattern with a small number of parameters.
- the F0 pattern generation process model 30 is adapted to F0 pattern, the phrase component, and the accent component, expressed as the sum of the basis components F b.
- Phrase component refers to a component of an utterance that has a peak that rises immediately after the start of one phrase, and that gradually changes to the end of the phrase.
- An accent component refers to a component represented by local unevenness corresponding to a word.
- the phrase component is represented by a response of the phrase control mechanism 42 to the phrase command 40 on the impulse generated at the beginning of the phrase.
- the accent component is similarly represented by a response of the accent control mechanism 46 to the step-like accent command 44.
- the logarithm expression log e F0 (t) of the F0 pattern 50 is obtained by adding the phrase component, the accent component, and the logarithm log e Fb of the base component Fb by the adder 48.
- an HMM Hidden Markov Model
- a voice corpus As a typical method for constructing a model from a large amount of collected voice data, an HMM (Hidden Markov Model) is used based on an F0 pattern observed by a voice corpus as described in Non-Patent Document 2 described later.
- HMM Hidden Markov Model
- a conventional speech synthesis system 70 includes a model learning unit 80 for learning an HMM model for F0 pattern synthesis from a speech corpus, and a synthesized speech signal 118 corresponding to input text. And a speech synthesizer 82 for synthesizing according to the F0 pattern obtained using the HMM obtained by learning.
- the model learning unit 80 extracts a speech corpus storage device 90 that stores a speech corpus with a phoneme context label, and F0 that extracts a speech signal of each utterance in the speech corpus stored in the speech corpus storage device 90.
- the spectral parameter extraction unit 94 that similarly extracts a mel cepstrum parameter as a spectral parameter from each utterance, and the F0 pattern extracted by the F0 extraction unit 92 and the speech corpus storage device 90.
- the context label is a control symbol for speech synthesis, and is a label to which various language information (context) such as the phonemic environment is assigned to the phoneme.
- the speech synthesizer 82 receives the HMM storage device 110 that stores the parameters of the HMM trained by the HMM learning unit 96 and, when given the text to be speech synthesized, performs text analysis on the text, From the text analysis unit 112 and the text analysis unit 112 for specifying the word and the phoneme being uttered, determining the accent, determining the insertion position of the pose, determining the type of sentence, etc., and outputting a label string representing the utterance
- the HMM stored in the HMM storage device 110 is collated with this label string, and a combination having the highest probability is generated and output as a combination of the F0 pattern and the mel cepstrum string when the original text is uttered.
- the parameter generation unit 114 that performs the parameter generation and the F0 pattern given by the parameter generation unit 114
- the sound represented by the mel-cepstral parameters given from the generation unit 114 synthesizes, and a speech synthesizer 116 for outputting a synthesized speech signal 118.
- this speech synthesis system 70 it is possible to obtain an effect that various F0 patterns can be output in a wide range of contexts based on a large amount of speech data.
- the present invention provides an apparatus and method for synthesizing an F0 pattern in which the correspondence between the linguistic information and the F0 pattern becomes clear while maintaining accuracy when the F0 pattern is generated by a statistical model. With the goal.
- Another object of the present invention is to provide a device that can clearly set the focus of a sentence and can clearly set the correspondence between linguistic information and the F0 pattern while maintaining accuracy when generating the F0 pattern by a statistical model. It aims to provide a method.
- the quantitative F0 pattern generation device generates an accent component of an F0 pattern using a given number of target points for an utterance accent phrase obtained by text analysis.
- Each accent phrase is described by 3 or 4 target points. Two of the four points are a low target indicating a low frequency portion of the F0 pattern of the accent phrase, and the remaining one or two points are a high target indicating a high frequency portion of the F0 pattern. If there are two high targets, their strengths may be the same.
- the means for generating the F0 pattern generates a continuous F0 pattern.
- the quantitative F0 pattern generation method generates an F0 pattern accent component using a given number of target points for an utterance accent phrase obtained by text analysis. And generating a phrase component of the F0 pattern using a limited number of target points by dividing the utterances into groups containing one or more accent phrases according to linguistic information including the structure of the utterances; Generating an F0 pattern based on the accent component and the phrase component.
- the quantitative F0 pattern generation device stores parameters of a generation model for generating a target parameter for the phrase component of the F0 pattern and a generation model for generating a target parameter for the accent component of the F0 pattern.
- a model storage means a text analysis means for receiving a text input for speech synthesis, a text analysis means for outputting a control symbol string for speech synthesis, and a phrase component generating a control symbol string output by the text analysis means.
- the model storage means may further store the parameters of the generation model for estimating the micro-prosody component of the F0 pattern.
- the F0 pattern generation apparatus further includes a micro-prosody component output unit that outputs a micro-prosody component of the F0 pattern by collating the control symbol string output from the text analysis unit with a generation model for generating the micro-prosody component.
- the F0 pattern generation means includes means for generating an F0 pattern by synthesizing the phrase component generated by the phrase component generation means, the accent component generated by the accent component generation means, and the micro-prosody component.
- the quantitative F0 pattern generation method stores parameters of a generation model for generating a target parameter for the phrase component of the F0 pattern and a generation model for generating the target parameter for the accent component of the F0 pattern.
- a method for generating a quantitative F0 pattern using the model storage means comprising: a text analysis step for receiving text input for speech synthesis, analyzing the text, and outputting a control symbol string for speech synthesis; Phrase component generation means for generating the phrase component of the F0 pattern by collating the output control symbol string with the generation model for generating the phrase component stored in the storage means, and the control symbol output in the text analysis step
- the column is a generation mode for generating accent components stored in the storage means.
- a model learning apparatus for generating an F0 pattern includes an F0 pattern extracting unit that extracts an F0 pattern from an audio data signal, and an F0 pattern that fits the extracted F0 pattern as a phrase component and an accent.
- a parameter estimation unit for estimating a target parameter representing a phrase component and a target parameter representing an accent component, and a target parameter for the phrase component and a target parameter for the accent component estimated by the parameter estimation unit, Model learning means for learning the F0 generation model using the continuous F0 pattern represented by the following as learning data.
- the F0 generation model may include a generation model for generating a phrase component and a generation model for generating an accent component.
- the model learning means uses the phrase component time change pattern represented by the phrase component target parameter estimated by the parameter estimation means and the accent component time change pattern represented by the accent component target parameter as learning data.
- First model learning means for learning a generation model for generating a phrase component and a generation model for generating an accent component is included.
- the model learning apparatus further separates the micro-prosody component from the F0 pattern extracted by the F0 pattern extraction unit, and learns the generation model for generating the micro-prosody component using the micro-prosody component as learning data.
- Second model learning means to perform may be included.
- a model learning method for F0 pattern generation includes an F0 pattern extraction step for extracting an F0 pattern from an audio data signal, and an F0 pattern that fits the F0 pattern extracted in the F0 pattern extraction step. Is expressed by superimposing the phrase component and the accent component, a parameter estimation step for estimating a target parameter representing the phrase component and a target parameter representing the accent component, a target parameter of the phrase component estimated in the parameter estimation step, and A model learning step of learning an F0 generation model using a continuous F0 pattern represented by the target parameter of the accent component as learning data.
- the F0 generation model may include a generation model for generating a phrase component and a generation model for generating an accent component.
- the model learning step uses, as learning data, a time change pattern of the phrase component represented by the target parameter of the phrase component estimated in the parameter estimation step and a time change pattern of the accent component represented by the target parameter of the accent component. Learning a generation model for generating a phrase component and a generation model for generating an accent component.
- FIG. 11 is a block diagram showing a hardware configuration of a computer after the computer system shown in FIG.
- an HMM is used as the F0 pattern generation model, but the model is not limited to the HMM.
- CART Classification and Regression Tree
- JHFriedman JHFriedman
- RA Olshen CJStone
- CJStone Simulated annealing
- CD Gellett, Jr. CD Gellett, Jr.
- MP Vecchi "Optimization by simulated annealing," IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 1982.
- the basic concept of the present invention is as follows. First, an F0 pattern is extracted from the speech corpus, and an observed F0 pattern 130 is created. This observed F0 pattern is usually discontinuous. The discontinuous F0 pattern is made continuous and smoothed to generate a continuous F0 pattern 132. Up to this point, it can be realized using the prior art.
- the continuous F0 pattern 132 is fitted by combining the phrase component and the accent component, and the F0 pattern 133 after the fit is estimated.
- HMM learning is performed by the same method as in Non-Patent Document 2, and the HMM parameters after learning are stored in the HMM storage device 139.
- the estimation of the F0 pattern 145 can be performed in the same manner as the method of Non-Patent Document 2.
- the feature vector here includes 40 mel cepstrum parameters including 0th order and logarithm of F0, and their delta and delta delta as elements.
- the obtained continuous F0 pattern 132 is decomposed into an accent component 134, a phrase component 136, and a micro-prosody component (hereinafter also referred to as “micro component”) 138.
- micro component a micro-prosody component
- the HMMs 140, 142, and 144 are separately learned for these.
- the learning of the HMMs 140, 142, and 144 uses a feature vector that is integrated into a multi-stream format for these three HMMs.
- the structure of the feature vector used is the same as that in the first embodiment.
- the accent component HMM 140, the phrase component HMM 142, and the micro component HMM 144 are used to individually generate the F0 pattern accent component 146, phrase component 148, and micro component 150. These are added by the adder 152 to generate a final F0 pattern 154.
- the micro component can be considered as a component obtained by removing the accent component and the phrase component from the F0 pattern. Therefore, how to obtain an accent component and a phrase component becomes a problem.
- target points In both the accent component and the phrase component, the description at the target point is a method of describing one accent or phrase with three or four points. Two of the four points represent the low target and the remaining one or two points represent the high target. These are called target points. When there are two high targets, the strength is the same for both.
- a continuous F0 pattern 174 is generated from the observed F0 pattern 170. Further, the continuous F0 pattern 174 is divided into phrase components 220 and 222 and accent components 200, 202, 204, 206, and 208, and each is described as a target point.
- a target point for accent is called an accent target
- a target point for a phrase is called a phrase target.
- the continuous F0 pattern 174 is represented by an accent component on the phrase component 172.
- the reason why the accent component and the phrase component are described at the target point in this way is to appropriately process the nonlinear interaction between the accent component and the phrase component by defining them in relation to each other. It is relatively easy to find the target point from the F0 pattern.
- the transition of F0 between target points can be represented by interpolation by the Poisson process (Non-patent Document 3).
- the F0 pattern is modeled here by a two-level mechanism.
- an accent component and a phrase component are generated by a mechanism using a Poisson process.
- these are synthesized by a mechanism using resonance to generate an F0 pattern.
- the micro component is obtained by removing the accent component and the phrase component from the continuous F0 pattern obtained first.
- Non-Patent Document 4 mapping using resonance (Non-Patent Document 4) is applied, and potential interference between an accent component and a phrase component is handled as a kind of topology conversion.
- ⁇ f ⁇ 1 ( ⁇ ) is the inverse mapping of the above mapping.
- Equation 4 represents the decomposition of lnf 0 on the time axis. More specifically, ⁇ f0r represents a phrase component (handled as a reference value), and ⁇ f0
- f0r represents an accent component.
- a non-linear interference between the accent component and the phrase component can be processed using a mechanism using resonance and integrated to obtain an F0 pattern.
- a model expressing the F0 pattern as a function of time t can be expressed as a superposition of the accent component Ca (t) on the phrase component Cp (t) by resonance in logarithmic expression.
- the model parameters representing the F0 pattern of the utterance are as follows.
- the phrase target ⁇ pi is defined by F0 in the range of [f 0b , f 0t ] in logarithmic expression.
- the accent target ⁇ ai is expressed in a range of (0, 1.5) with 0.5 as a zero point. If the accent target ⁇ ai ⁇ 0.5, the accent component bites into the phrase component (a part of the phrase component is removed), and lowers the end of the F0 pattern so that it can be observed by natural speech. That is, the accent component is superimposed on the phrase component, but at this time, it is allowed that a part of the phrase component is removed by the accent component.
- FIG. 5 is a program of a control structure shown in a flowchart format, and processing for extracting the F0 pattern from the observed F0 pattern 130 shown in FIG. 3, and smoothing and continuation of the extracted F0 pattern 132 ,
- the estimation of the target parameter for representing the continuous F0 pattern 132 by the sum of the phrase component and the accent component represented by the target point, and the F0 pattern 133 fitted to the continuous F0 pattern 132 by the estimated target parameter Has a function of performing generation processing.
- this program smoothes the observed discontinuous F0 patterns, and continuously outputs the continuous F0 patterns, and outputs the continuous F0 patterns output in step 340 to N groups. And step 342 of dividing into two.
- Each of the divided groups corresponds to an exhalation paragraph.
- a continuous F0 pattern is smoothed using a long window width, a designated number of locations where the F0 pattern is a valley are detected, and the F0 pattern is divided there.
- This program further minimizes the error between step 344 for substituting 0 for the repetition control variable k, step 346 for initializing the phrase component P, and the phrase component P and accent component A and the continuous F0 pattern.
- Step 348 for estimating the target parameter for the accent component A and the target parameter for the phrase component P;
- Step 354 for adding 1 to the repetition control variable k after step 348; and the repetition for which the value of the variable k is predetermined. It is determined whether or not it is smaller than the number n.
- step 356 returns the control flow to step 346, and when the determination at step 356 is NO, the accent obtained by repeating step 346 to step 356 Optimize the target parameters of the accent, and the optimized accent target and phrase And a step 358 of outputting the Getto.
- the error between the F0 pattern represented by these and the original continuous F0 pattern corresponds to the micro-prosody component.
- Step 348 includes a step 350 for estimating an accent target parameter and a step 352 for estimating the target parameter of the phrase component P using the accent target parameter estimated in step 350.
- Pre-processing F0 pattern is converted into ⁇ f0
- f0r with f 0r f 0b and smoothed together with two window sizes (short-term: 10 points, long-term: 80 points) (step 340). Considering the characteristics of Japanese accent, rising-(flat)-falling, remove the influence of micro-prosody (change F0 using phoneme segments). The smoothed F0 pattern is returned to F0 using equation (5) for parameter extraction.
- Step 342 Parameter extraction A segment between pauses longer than 0.3 seconds is regarded as an exhalation paragraph, and the exhalation paragraph is further divided into N paragraphs using an F0 pattern smoothed by a long-term window (step 342). The following processing is applied to each group. At this time, a criterion of minimizing the absolute value of the F0 error is used. Thereafter, the repeated control variable k is set to 0 in order to repeatedly execute Step 348 (Step 344).
- A As an initial value, a phrase component P of three target points having two low target points and one high target point is prepared (step 346). This phrase component P has the same shape as the left half of the graph of the phrase component P at the bottom of FIG. 4, for example.
- the timing of this high target point is matched with the start of the second mora, and the first low target point is shifted earlier by 0.3 seconds.
- the timing of the second low target point is matched with the end of the exhalation paragraph.
- the initial value of the phrase target intensity ⁇ pi is determined using an F0 pattern smoothed using a long-term window.
- the accent component A is calculated using the smoothed F0 pattern and the current phrase component P according to equation (4). Further, an accent target point is estimated from the current accent component A.
- C A range of [0.4, 0.6] for all low target points such that ⁇ ai is a range of [0.9, 1.1] for all high target points.
- the accent component A is recalculated using the adjusted target point (step 350).
- D The current accent component A is calculated and the phrase target is re-estimated (step 352).
- E In order to repeat the return to (b) until a predetermined number of times is reached, 1 is added to the variable k (step 354).
- (F) Insert a high phrase target point if the amount of error reduction between the generated F0 pattern and the smoothed F0 pattern is greater than a certain threshold by inserting a high phrase target point.
- (b) In order to determine whether or not to return to the above (b), 1 is added to the variable k in step 354. If the value of variable k has not reached n, control returns to step 346. By this processing, for example, a phrase component P like the right half of the lower part of FIG. 4 is obtained. If the value of variable k has reached n, accent parameters are optimized in step 358.
- step 358 On the premise of the estimated phrase component P, the accent target point is optimized so as to minimize the error between the generated F0 pattern and the observed F0 pattern. As a result, the target points of the phrase component P and the accent component A that can generate the F0 pattern that fits the smoothed F0 pattern are obtained.
- the micro-prosody component M is obtained from the portion corresponding to the difference between the smoothed F0 pattern and the F0 pattern generated from the phrase component P and the accent component A.
- FIG. 6 shows an example in which the phrase component P and the accent component A are synthesized according to the result of analyzing the text, and the F0 pattern is fitted to the observed F0 pattern.
- FIG. 6 shows two cases superimposed.
- the target F0 pattern 240 (observed F0 pattern) is represented by a symbol “+” column.
- the first case shown in FIG. 6 is to obtain a fitted F0 pattern 246 by synthesizing an accent component 250 also indicated by a broken line with a phrase component 242 indicated by a broken line.
- an F0 pattern 246 is obtained by synthesizing an accent component 252 also indicated by a thin line with a phrase component 244 indicated by a thin line.
- the accent component 250 and the accent component 252 almost coincide with each other, but the positions of the high target point of the first accent element and the low target point on the back side are lower than those of the accent component 252. It has become.
- a phrase component 242 composed of two phrases is adopted as a phrase component and synthesized with an accent component 252 obtained by a Japanese accent pattern. If the result of text analysis is that there are three exhalation paragraphs, the phrase component 244 and the accent component 250 are synthesized.
- both the phrase component 242 and the phrase component 244 have a phrase boundary between the third accent element and the fourth accent element.
- the phrase component 244 is employed.
- a high target point and a low target point on the back side of the accent element located immediately before this position are represented as in the accent component 250. Pull down.
- the F0 pattern can be accurately fitted to the result of text analysis even when three phrases exist as a result of text analysis.
- the linguistic information that forms the basis of the utterance is represented by the utterance structure and the accent type, and the correspondence between the linguistic information and the F0 pattern is clear. by.
- F0 pattern synthesis section 359 is obtained by smoothing and continually observing observed F0 pattern 130 from each of a large number of speech signals included in the speech corpus.
- a parameter estimation unit 366 that estimates a target point that defines the phrase component P and a target parameter that defines the accent component A, and a parameter estimation unit 366 based on the given accent boundary and the target parameter that defines the accent component A, for the continuous F0 pattern 132
- the F0 pattern fitting unit 368 that generates the F0 pattern after fitting that fits the continuous F0 pattern by synthesizing the phrase component P and the accent component A estimated by the above, and using the F0 pattern after fitting as in the conventional case HMM learning unit 369 that performs HMM learning, and after learning
- a HMM storage device 370 for storing the HMM parameters.
- the process of synthesizing the F0 pattern 372 using the HMM stored in the HMM storage device 370 can be realized by an apparatus similar to the speech synthesis unit
- the system according to the first embodiment operates as follows. For each of the observed F0 patterns 130, a continuous F0 pattern 132 is obtained by smoothing and continuation.
- the parameter estimation unit 366 decomposes the continuous F0 pattern 132 into the phrase component P and the accent component A, and estimates each target parameter by the method described above.
- the F0 pattern fitting unit 368 combines the phrase component P and the accent component A expressed by the estimated target parameter, and obtains the F0 pattern after fitting that fits the observed F0 pattern. This system performs such an operation for each observation F0 pattern 130.
- the HMM learning unit 369 performs HMM learning by using a number of F0 patterns after fitting obtained in this way, using a method similar to the conventional one.
- the HMM storage device 370 stores the HMM parameters after learning. After the learning of the HMM is completed, when a text is given, the text is analyzed and the F0 pattern 372 is synthesized by using the HMM stored in the HMM storage device 370 according to the result. By using this F0 pattern 372 and a speech parameter string such as a mel cepstrum selected according to the phoneme of the text, a speech signal can be obtained in the same manner as in the prior art.
- HMM learning was performed according to the first embodiment, and a subjective evaluation (preference evaluation) test was performed on speech synthesized using the F0 pattern synthesized using the learned HMM.
- the experiment of this evaluation test was conducted using 503 utterances included in the voice corpus ATR503set created and published by the applicant. Of the 503 utterances, 490 utterances were used for HMM learning, and the rest were used for testing.
- the speech signal was sampled at a sampling rate of 16 kHz, and the spectral envelope was extracted by STRIGHT analysis with a 5 ms frame shift.
- the feature vector consists of 40 mel cepstrum parameters including 0th order, log F0, and their delta and delta delta. A 5-state left-to-right unidirectional HMM model topology was used.
- F0 pattern obtained from speech waveform (original) (2) F0 pattern (Proposed) generated by the first embodiment (3)
- the voiced portion is the original, and the unvoiced portion is the F0 pattern generated by the method of the first embodiment (Prop. + MP (Micro-production)).
- Voiced part is original, unvoiced part is F0 pattern using spline interpolation (Spl + MP) Of the above four patterns, (2) to (4) are continuous F0 patterns. (2) does not include micro-prosody or F0 extraction error, but it should be noted that (3) and (4) include both.
- a continuous F0 pattern was synthesized using a continuous F0 pattern HMM, and voiced / unvoiced were judged using an MSD-HMM.
- the phrase component P and the accent component A are represented by target points, and the F0 pattern is fitted by combining them.
- the idea of using target points is not limited to this first embodiment.
- the F0 pattern observed by the method described above is separated into a phrase component P, an accent component A, and a micro-prosody component M, and HMM learning is performed for each of these time change patterns.
- the time change patterns of the phrase component P, the accent component A, and the micro-prosody component M are obtained using the learned HMM, and the F0 pattern is estimated by further combining them.
- speech synthesis system 270 uses model learning unit 280 that performs HMM learning for speech synthesis, and HMM that is trained by model learning unit 280, and the text is A speech synthesis unit 282 that synthesizes the speech when it is input and outputs the synthesized speech signal 284.
- the model learning unit 280 includes a speech corpus storage device 90, an F0 extraction unit 92, and a spectrum parameter extraction unit 94, similar to the model learning unit 80 of the conventional speech synthesis system 70 shown in FIG. However, in place of the HMM learning unit 96 of the model learning unit 80, the model learning unit 280 smoothes the discontinuous F0 pattern 93 output from the F0 extraction unit 92, and performs continuous F0 smoothing that outputs a continuous F0 pattern 291.
- Unit 290 and the continuous F0 pattern output by the F0 smoothing unit 290 are separated into a phrase component P, an accent component A, and a micro-prosody component M, and a time-varying pattern for each component is generated.
- the model learning unit 280 includes a multi-stream HMM learning data vector 293 (40 mel cepstrum parameters including the 0th order) including the mel cepstrum parameter 95 output from the spectrum parameter extraction unit 94 and the output of the F0 separation unit 292. And the three component time variation patterns of F0, and their delta and delta delta), based on the phoneme context label corresponding to the learning data vector 293 read from the speech corpus storage device 90, And an HMM learning unit 294 that performs simple learning.
- the speech synthesis unit 282 includes an HMM storage device 310 that stores the HMM learned by the HMM learning unit 294, the same text analysis unit 112 as that shown in FIG. 2, and a context label string given from the text analysis unit 112
- HMM storage device 310 that stores the HMM learned by the HMM learning unit 294, the same text analysis unit 112 as that shown in FIG. 2, and a context label string given from the text analysis unit 112
- a parameter generation unit 312 that estimates and outputs a pattern and a mel cepstrum parameter, and an F0 pattern by synthesizing a temporal change pattern of the phrase component P, the accent component A, and the micro-prosody component M output by the parameter generation unit 312 F0 pattern synthesis unit 314 that generates and outputs Including a mel cepstral parameters parameter generating unit 312 outputs, to synthesize a speech from the F0 pattern F0 pattern synthesizing section 314 outputs, to the same speech synthesizer 116 as shown in FIG.
- the control structure of the computer program for realizing the F0 smoothing unit 290, the F0 separating unit 292, and the HMM learning unit 294 shown in FIG. 9 is the same as that shown in FIG.
- the speech synthesis system 270 operates as follows.
- the voice corpus storage device 90 stores a large amount of speech signals.
- the speech signal is stored in units of frames, and a phoneme context label is attached to each phoneme.
- the F0 extraction unit 92 outputs a discontinuous F0 pattern 93 from the utterance signal of each utterance.
- the F0 smoothing unit 290 smoothes the discontinuous F0 pattern 93 and outputs a continuous F0 pattern 291.
- the F0 separation unit 292 receives the continuous F0 pattern 291 and the discontinuous F0 pattern 93 output from the F0 extraction unit 92, and in accordance with the above-described method, the temporal change pattern of the phrase component P and the accent component A for each frame.
- the HMM learning unit 294 for each frame of the speech signal of each utterance, the label read from the speech corpus storage device 90, the learning data vector 293 given from the F0 separation unit 292, and the mel cepstrum parameter from the spectrum parameter extraction unit 94.
- the feature vector having the above-described configuration is used as learning data and the context label of the estimation target frame is given, the temporal change pattern of the phrase component P, the accent component A, and the micro-prosody component M of the frame.
- Statistical HMM learning is performed so as to output the probability of the value of the cepstrum parameter.
- the parameters of the HMM are stored in the HMM storage device 310.
- the speech synthesizer 282 When a text to be synthesized is given, the speech synthesizer 282 operates as follows.
- the text analysis unit 112 analyzes the given text, generates a context label string indicating the speech to be synthesized, and provides the parameter generation unit 312 with the context label string.
- the parameter generation unit 312 refers to the HMM storage device 310 for each label included in the label string, so that the parameter string (phrase component P) having the highest probability of being a voice for generating such a label string for the label string.
- Accent component A and micro-prosody component M, and mel cepstrum parameters) are sent to the F0 pattern synthesizer 314 as mel cepstrum parameters.
- the speech synthesizer 116 are given to the speech synthesizer 116, respectively.
- the F0 pattern synthesis unit 314 synthesizes the time change patterns of the phrase component P, the accent component A, and the micro-prosody component M, and gives them to the speech synthesizer 116 as an F0 pattern.
- the phrase component P, the accent component A, and the micro-prosody component M are all expressed logarithmically during HMM learning. Therefore, in the synthesis of the F0 pattern synthesis unit 314, these may be added to each other after being converted from logarithmic expressions to normal frequency components. At this time, since the zero point of each component is moved during learning, an operation to restore the zero point is also necessary.
- the voice synthesizer 116 synthesizes a voice signal according to the F0 pattern output from the F0 pattern synthesizer 314, and further performs signal processing corresponding to modulating the voice signal according to the mel cepstrum parameter provided from the parameter generator 312. And a synthesized voice signal 284 is output.
- the F0 pattern is decomposed into a phrase component P, an accent component A, and a micro-prosody component M, and separate HMM learning is performed using them.
- the phrase component P, the accent component A, and the micro-prosody component M are separately generated using these HMMs based on the result of text analysis.
- an F0 pattern can be generated. If the F0 pattern obtained in this way is used, natural speech can be obtained as in the first embodiment.
- the correspondence between the accent component A and the F0 pattern is clear, it is possible to easily focus on the word by increasing the range of the accent component A for a specific word. This can be seen from, for example, the operation of lowering the frequency for the component immediately before the vertical line 254 in the accent component 250 of FIG. 6 and the operation of lowering the frequency of the last F0 pattern in the accent components 250 and 252 of FIG.
- Both the F0 pattern synthesis unit according to the first embodiment and the second embodiment can be realized by computer hardware and a computer program executed on the computer hardware.
- FIG. 10 shows the external appearance of this computer system 530
- FIG. 11 shows the internal configuration of the computer system 530.
- the computer system 530 includes a computer 540 having a memory port 552 and a DVD (Digital Versatile Disc) drive 550, a keyboard 546, a mouse 548, and a monitor 542.
- DVD Digital Versatile Disc
- the computer 540 includes a CPU (Central Processing Unit) 556, a bus 566 connected to the CPU 556, the memory port 552, and the DVD drive 550, and a boot program. And the like, a read only memory (ROM) 558 for storing etc., a random access memory (RAM) 560 connected to the bus 566 for storing program instructions, system programs, work data, etc.
- Computer system 530 further includes a network interface (I / F) 544 that provides a connection to a network 568 that allows communication with other terminals.
- I / F network interface
- a computer program for causing the computer system 530 to function as each functional unit of the F0 pattern generation / synthesis unit according to the above-described embodiment is stored in the DVD 562 or the removable memory 564 mounted in the DVD drive 550 or the memory port 552, and Transferred to the hard disk 554.
- the program may be transmitted to the computer 540 through the network 568 and stored in the hard disk 554.
- the program is loaded into the RAM 560 when executed.
- the program may be loaded directly into the RAM 560 from the DVD 562, from the removable memory 564, or via the network 568.
- This program includes an instruction sequence including a plurality of instructions for causing the computer 540 to function as each functional unit of the F0 pattern synthesis unit according to the above embodiment.
- Some of the basic functions necessary to cause computer 540 to perform this operation are provided by operating systems or third party programs that run on computer 540 or various programming toolkits or program libraries installed on computer 540. . Therefore, this program itself does not necessarily include all functions necessary for realizing the system and method of this embodiment.
- the program is a system as described above by dynamically calling the appropriate program in the appropriate function or programming toolkit or program library at run time in a controlled manner to obtain the desired result. It is only necessary to include an instruction for realizing the function. Of course, all necessary functions may be provided only by the program.
- the present invention can be used to provide services using speech synthesis and to manufacture devices using speech synthesis.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Computer Vision & Pattern Recognition (AREA)
Abstract
Description
図3を参照して、本願発明の基本的考え方は以下の通りである。最初に、音声コーパスからF0パターンを抽出し、観測F0パターン130を作成する。この観測F0パターンは通常は不連続である。この不連続なF0パターンを連続化・平滑化させ、連続F0パターン132を生成する。ここまでは先行技術を用いて実現できる。
F0は声帯の振動から生ずる。F0パターンを操作する上で、レゾナンス機構を用いることが有効であることが知られている。ここでは、レゾナンスを用いたマッピング(非特許文献4)を適用し、アクセント成分とフレーズ成分との間の潜在的干渉を、トポロジの変換の一種として取扱うことにより処理する。
式4は、lnf0の時間軸上での分解を表す。より具体的には、αf0rはフレーズ成分(基準値として取扱う。)を表し、φf0|f0rはアクセント成分を表す。アクセント成分をφf0|f0rで表し、フレーズ成分をαf0rで表すと、lnf0は以下の式(5)により計算できる。
F0パターンを時間tの関数として表すモデルは、対数表現ではレゾナンスによる、フレーズ成分Cp(t)上へのアクセント成分Ca(t)の重畳として表現できる。
アクセント的フレーズ境界に関する情報が与えられたものとして、日本語の発話について観測されたF0パターンから、ターゲットポイントのパラメータ(ターゲットパラメータ)を推定するためのアルゴリズムを開発した。パラメータf0b及びf0tを、観測されたF0パターンの集合のF0範囲と一致させる。日本語では、アクセント的フレーズはアクセント(アクセントタイプ0,1,2,…)を持つ。このアルゴリズムは以下のようなものである。
F0パターンを、f0r=f0bとしてφf0|f0rに変換し、2つのウィンドウサイズ(短期:10ポイント、長期:80ポイント)でともに平滑化し(ステップ340)、全体的な上昇―(フラット)-下降という日本語アクセントの特徴を考慮し、マイクロ・プロソディの影響を除去する(音素セグメントを用いてF0を変更する)。平滑化されたF0パターンを、パラメータ抽出のために式(5)を用いてF0に戻す。
ポーズ間のセグメントで0.3秒より長いものを呼気段落とみなし、呼気段落をさらに長期ウィンドウで平滑化したF0パターンを用いてN個の段落に分割する(ステップ342)。以下の処理を各グループに対して適用する。この際、F0誤差の絶対値を最小化するという基準を用いる。以下、ステップ348を繰返し実行するために繰返し制御変数kを0に設定する(ステップ344)。(a)初期値として、2つの低ターゲットポイントと1つの高ターゲットポイントとを持つ3ターゲットポイントのフレーズ成分Pを準備する(ステップ346)。このフレーズ成分Pは、例えば図4の最下部にあるフレーズ成分Pのグラフの、左半分と同様の形状である。この高ターゲットポイントのタイミングを第2モーラの開始時に合わせ、1番目の低ターゲットポイントを0.3秒だけ早めにずらす。さらに、2番目の低ターゲットポイントのタイミングを呼気段落の末尾に一致させる。フレーズターゲットの強度γpiの初期値は、長期ウィンドウを用いて平滑化したF0パターンを用いて決定する。
推定されたフレーズ成分Pを前提に、生成されたF0パターンと観測されたF0パターンとの間の誤差を最小化するようにアクセントのターゲットポイントを最適化する。この結果、平滑化されたF0パターンにフィットするようなF0パターンを生成できるフレーズ成分P及びアクセント成分Aのターゲットポイントが得られる。
<構成>
図7を参照して、第1の実施の形態に係るF0パターン合成部359は、音声コーパスに含まれる多数の音声信号の各々から観測された観測F0パターン130を平滑化・連続化して得た連続F0パターン132について、所与のアクセント境界に基づいて、上記した原理に従い、フレーズ成分Pを規定するターゲットポイント及びアクセント成分Aを規定するターゲットパラメータを推定するパラメータ推定部366と、パラメータ推定部366により推定されたフレーズ成分Pとアクセント成分Aとを合成することにより連続F0パターンにフィットしたフィット後のF0パターンを生成するF0パターンフィッティング部368と、フィット後のF0パターンを用いて従来と同様にHMMの学習を行なうHMM学習部369と、学習後のHMMパラメータを記憶するHMM記憶装置370とを含む。HMM記憶装置370に記憶されたHMMを用いてF0パターン372を合成する処理は、図2に示す音声合成部82と同様の装置で実現できる。
図7を参照して、第1の実施の形態のシステムは以下のように動作する。観測F0パターン130の各々について、平滑化・連続化することにより連続F0パターン132を得る。パラメータ推定部366は、この連続F0パターン132をフレーズ成分Pとアクセント成分Aとに分解し、それぞれのターゲットパラメータを上記した手法で推定する。F0パターンフィッティング部368は、推定されたターゲットパラメータにより表現されるフレーズ成分Pとアクセント成分Aとを合成し、観測F0パターンにフィットしたフィット後のF0パターンを得る。このシステムは、このような動作を観測F0パターン130の各々に対して行なう。
上記第1の実施の形態によりHMMの学習を行ない、学習後のHMMを使用して合成したF0パターンを利用して合成した音声について、主観的な評価(選好評価)テストを行なった。
(2)実施の形態1により生成されたF0パターン(Proposed)
(3)有声部分はオリジナル、無声部分は実施の形態1の方法により生成したF0パターン(Prop.+MP(Micro-prosody))
(4)有声部分はオリジナル、無声部分はスプラインによる内挿を使用したF0パターン(Spl+MP)
上記した4つのパターンの内、(2)~(4)は連続F0パターンである。(2)はマイクロ・プロソディもF0抽出誤差も含まないが、(3)及び(4)は両者を含む点に注意が必要である。
(2)Proposed 対 Prop+MP
(3)Proposed 対 Spl+MP
(4)Prop+MP 対 Spl+MP
第1の実施の形態では、フレーズ成分P及びアクセント成分Aをターゲットポイントで表し、それらを合成することでF0パターンをフィッティングしている。しかし、ターゲットポイントを使用するアイデアは、この第1の実施の形態に限定されるわけではない。第2の実施の形態は、上に説明した手法によって観測されたF0パターンをフレーズ成分P、アクセント成分A及びマイクロ・プロソディ成分Mに分離し、それらの時間変化パターンについてそれぞれHMM学習を行なう。F0生成の際には、学習済のHMMを用いてフレーズ成分P、アクセント成分A及びマイクロ・プロソディ成分Mの時間変化パターンを得て、さらにそれらを合成することでF0パターンを推定する。
図9を参照して、この実施の形態に係る音声合成システム270は、音声合成のためのHMMの学習を行なうモデル学習部280と、モデル学習部280によって学習を行なったHMMを用い、テキストが入力されるとその音声を合成し合成音声信号284として出力する音声合成部282とを含む。
音声合成システム270は以下のように動作する。音声コーパス記憶装置90には、大量の発話信号が記憶されている。発話信号はフレーム単位で記憶されており、各音素に対して音素のコンテキストラベルが付されている。F0抽出部92は、各発話の発話信号から不連続なF0パターン93を出力する。F0平滑化部290は、不連続なF0パターン93を平滑化し、連続F0パターン291を出力する。F0分離部292は、連続F0パターン291と、F0抽出部92の出力する不連続なF0パターン93とを受け、前述した方法にしたがって、各フレームについてフレーズ成分Pの時間変化パターン、アクセント成分Aの時間変化パターン、マイクロ・プロソディ成分Mの時間変化パターン、不連続なF0パターン93から得られる、各フレームが有声区間か無声区間かを示す情報F0(U/V)、及び、スペクトルパラメータ抽出部94が各発話の音声信号の各フレームについて算出したメルケプストラムパラメータからなる学習データベクトル293を、HMM学習部294に与える。
この第2の実施の形態では、F0パターンをフレーズ成分P、アクセント成分A及びマイクロ・プロソディ成分Mに分解し、それらを用いて別々のHMMの学習を行なう。音声合成時には、テキスト解析の結果に基づき、これらHMMを用いてフレーズ成分P、アクセント成分A、及びマイクロ・プロソディ成分Mを別々に生成する。さらに、生成されたフレーズ成分P、アクセント成分A、及びマイクロ・プロソディ成分Mを合成することで、F0パターンを生成できる。こうして得られたF0パターンを用いると、第1の実施の形態と同様、自然な発話を得ることができる。さらに、アクセント成分AとF0パターンとの対応関係が明確なので、特定の単語についてアクセント成分Aのレンジを大きくとることによって、当該単語に焦点を当てたりすることが容易に行なえる。これは例えば図6のアクセント成分250において縦線254の直前の成分に関して周波数を下げている操作、及び図6のアクセント成分250及び252において、末尾のF0パターンの周波数を落とす操作からも分かる。
上記第1実施の形態及び第2の実施の形態に係るF0パターン合成部は、いずれも、コンピュータハードウェアと、そのコンピュータハードウェア上で実行されるコンピュータプログラムとにより実現できる。図10はこのコンピュータシステム530の外観を示し、図11はコンピュータシステム530の内部構成を示す。
40 フレーズコマンド
42 フレーズ制御機構
44 アクセントコマンド
46 アクセント制御機構
48,152 加算器
50 F0パターン
70,270 音声合成システム
80,280 モデル学習部
82,282 音声合成部
90 音声コーパス記憶装置
92 F0抽出部
93 不連続なF0パターン
94 スペクトルパラメータ抽出部
95 メルケプストラムパラメータ
96,294,369 HMM学習部
110,310,139,370 HMM記憶装置
112 テキスト解析部
114 パラメータ生成部
116 音声合成器
130,170 観測F0パターン
132,174,291 連続F0パターン
134,146,200,202,204,206,208,250,252 アクセント成分
136,148,220,222,242,244 フレーズ成分
138,150 マイクロ・プロソディ成分
140,142,144 HMM
48,152 加算器
154,240,246 F0パターン
172 フレーズ成分
290 F0平滑化部
292 F0分離部
293 学習データベクトル
312 パラメータ生成部
314,359 F0パターン合成部
366 パラメータ推定部
368 F0パターンフィッティング部
Claims (8)
- テキスト解析により得られた、発話のアクセント句に対して、所与の数のターゲットポイントを用いてF0パターンのアクセント成分を生成する手段と、
発話の構造を含む言語情報にしたがって、発話を1つ以上のアクセント句を含むグループに分けることにより、限定された数のターゲットポイントを用いてF0パターンのフレーズ成分を生成する手段と、
前記アクセント成分と前記フレーズ成分に基づいてF0パターンを生成する手段とを含む、定量的F0パターン生成装置。 - テキスト解析により得られた、発話のアクセント句に対して、所与の数のターゲットポイントを用いてF0パターンのアクセント成分を生成するステップと、
発話の構造を含む言語情報にしたがって、発話を1つ以上のアクセント句を含むグループに分けることにより、限定された数のターゲットポイントを用いてF0パターンのフレーズ成分を生成するステップと、
前記アクセント成分と前記フレーズ成分とに基づいてF0パターンを生成するステップとを含む、定量的F0パターン生成方法。 - F0パターンのフレーズ成分のターゲットパラメータ生成用の生成モデルと、F0パターンのアクセント成分のターゲットパラメータ生成用の生成モデルとのパラメータを記憶するモデル記憶手段と、
音声合成の対象となるテキストの入力を受けてテキスト解析し、音声合成用の制御記号列を出力するテキスト解析手段と、
前記テキスト解析手段の出力する制御記号列を前記フレーズ成分生成用の生成モデルと照合することにより、F0パターンのフレーズ成分を生成するフレーズ成分生成手段と、
前記テキスト解析手段の出力する制御記号列を前記アクセント成分生成用の生成モデルと照合することにより、F0パターンのアクセント成分を生成するアクセント成分生成手段と、
前記フレーズ成分生成手段により生成されたフレーズ成分、及び前記アクセント成分生成手段により生成されたアクセント成分を合成することにより、F0パターンを生成するF0パターン生成手段とを含む、定量的F0パターン生成装置。 - F0パターンのフレーズ成分のターゲットパラメータ生成用の生成モデルと、F0パターンのアクセント成分のターゲットパラメータ生成用の生成モデルとのパラメータを記憶したモデル記憶手段を用いる定量的F0パターン生成方法であって、
音声合成の対象となるテキストの入力を受けてテキスト解析し、音声合成用の制御記号列を出力するテキスト解析ステップと、
前記テキスト解析において出力される制御記号列を、前記記憶手段に記憶された前記フレーズ成分生成用の生成モデルと照合することにより、F0パターンのフレーズ成分を生成するフレーズ成分生成手段と、
前記テキスト解析ステップにおいて出力される制御記号列を、前記記憶手段に記憶された前記アクセント成分生成用の生成モデルと照合することにより、F0パターンのアクセント成分を生成するアクセント成分生成ステップと、
前記フレーズ成分生成ステップにおいて生成されたフレーズ成分、及び前記アクセント成分生成ステップにおいて生成されたアクセント成分を合成することにより、F0パターンを生成するF0パターン生成ステップとを含む、定量的F0パターン生成方法。 - 音声データ信号からF0パターンを抽出するF0パターン抽出手段と、
抽出されたF0パターンにフィットするF0パターンをフレーズ成分とアクセント成分との重畳により表すために、フレーズ成分を表すターゲットパラメータと、アクセント成分を表すターゲットパラメータとを推定するパラメータ推定手段と、
前記パラメータ推定手段により推定されたフレーズ成分のターゲットパラメータ及びアクセント成分のターゲットパラメータにより表される、連続的なF0パターンを学習データとして、F0生成モデルの学習を行なうモデル学習手段とを含む、F0パターン生成のためのモデル学習装置。 - 前記F0生成モデルは、フレーズ成分生成用の生成モデルと、アクセント成分生成用の生成モデルとを含み、
前記モデル学習手段は、前記パラメータ推定手段により推定されたフレーズ成分のターゲットパラメータによって表されるフレーズ成分の時間変化パターンと、アクセント成分のターゲットパラメータにより表されるアクセント成分の時間変化パターンとをそれぞれ学習データとして、前記フレーズ成分生成用の生成モデルと、前記アクセント成分生成用の生成モデルとの学習を行なう手段を含む、請求項5に記載のF0パターン生成のためのモデル学習装置。 - 音声データ信号からF0パターンを抽出するF0パターン抽出ステップと、
前記F0パターン抽出ステップにおいて抽出されたF0パターンにフィットするF0パターンをフレーズ成分とアクセント成分との重畳により表すために、フレーズ成分を表すターゲットパラメータと、アクセント成分を表すターゲットパラメータとを推定するパラメータ推定ステップと、
前記パラメータ推定ステップにおいて推定されたフレーズ成分のターゲットパラメータ及びアクセント成分のターゲットパラメータにより表される、連続的なF0パターンを学習データとして、F0生成モデルの学習を行なうモデル学習ステップとを含む、F0パターン生成のためのモデル学習方法。 - 前記F0生成モデルは、フレーズ成分生成用の生成モデルと、アクセント成分生成用の生成モデルとを含み、
前記モデル学習ステップは、前記パラメータ推定ステップにおいて推定されたフレーズ成分のターゲットパラメータによって表されるフレーズ成分の時間変化パターンと、アクセント成分のターゲットパラメータにより表されるアクセント成分の時間変化パターンとを学習データとして、フレーズ成分生成用の生成モデルと、アクセント成分生成用の生成モデルとの学習を行なうステップを含む、請求項7に記載のF0パターン生成のためのモデル学習方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201480045803.7A CN105474307A (zh) | 2013-08-23 | 2014-08-13 | 定量的f0轮廓生成装置及方法、以及用于生成f0轮廓的模型学习装置及方法 |
US14/911,189 US20160189705A1 (en) | 2013-08-23 | 2014-08-13 | Quantitative f0 contour generating device and method, and model learning device and method for f0 contour generation |
EP14837587.6A EP3038103A4 (en) | 2013-08-23 | 2014-08-13 | Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern |
KR1020167001355A KR20160045673A (ko) | 2013-08-23 | 2014-08-13 | 정량적 f0 패턴 생성 장치 및 방법, 그리고 f0 패턴 생성을 위한 모델 학습 장치 및 방법 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2013173634A JP5807921B2 (ja) | 2013-08-23 | 2013-08-23 | 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム |
JP2013-173634 | 2013-08-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015025788A1 true WO2015025788A1 (ja) | 2015-02-26 |
Family
ID=52483564
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/071392 WO2015025788A1 (ja) | 2013-08-23 | 2014-08-13 | 定量的f0パターン生成装置及び方法、並びにf0パターン生成のためのモデル学習装置及び方法 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20160189705A1 (ja) |
EP (1) | EP3038103A4 (ja) |
JP (1) | JP5807921B2 (ja) |
KR (1) | KR20160045673A (ja) |
CN (1) | CN105474307A (ja) |
WO (1) | WO2015025788A1 (ja) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6468518B2 (ja) * | 2016-02-23 | 2019-02-13 | 日本電信電話株式会社 | 基本周波数パターン予測装置、方法、及びプログラム |
JP6472005B2 (ja) * | 2016-02-23 | 2019-02-20 | 日本電信電話株式会社 | 基本周波数パターン予測装置、方法、及びプログラム |
JP6468519B2 (ja) * | 2016-02-23 | 2019-02-13 | 日本電信電話株式会社 | 基本周波数パターン予測装置、方法、及びプログラム |
JP6876641B2 (ja) * | 2018-02-20 | 2021-05-26 | 日本電信電話株式会社 | 音声変換学習装置、音声変換装置、方法、及びプログラム |
CN112530213B (zh) * | 2020-12-25 | 2022-06-03 | 方湘 | 一种汉语音调学习方法及*** |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02113299A (ja) * | 1988-10-22 | 1990-04-25 | Hiroya Fujisaki | 基本周波数パタン生成装置 |
JPH06332490A (ja) * | 1993-05-20 | 1994-12-02 | Meidensha Corp | 音声合成装置のアクセント成分基本テーブルの作成方法 |
JPH0990970A (ja) * | 1995-09-20 | 1997-04-04 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | 音声合成装置 |
JPH09198073A (ja) * | 1996-01-11 | 1997-07-31 | Secom Co Ltd | 音声合成装置 |
JP2003005775A (ja) * | 2001-06-26 | 2003-01-08 | Oki Electric Ind Co Ltd | テキスト音声変換装置における高速読上げ制御方法 |
JP2008191525A (ja) * | 2007-02-07 | 2008-08-21 | Nippon Telegr & Teleph Corp <Ntt> | F0値時系列生成装置、その方法、そのプログラム、及びその記録媒体 |
Family Cites Families (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3704345A (en) * | 1971-03-19 | 1972-11-28 | Bell Telephone Labor Inc | Conversion of printed text into synthetic speech |
US5475796A (en) * | 1991-12-20 | 1995-12-12 | Nec Corporation | Pitch pattern generation apparatus |
WO2000058943A1 (fr) * | 1999-03-25 | 2000-10-05 | Matsushita Electric Industrial Co., Ltd. | Systeme et procede de synthese de la parole |
CN1207664C (zh) * | 1999-07-27 | 2005-06-22 | 国际商业机器公司 | 对语音识别结果中的错误进行校正的方法和语音识别*** |
WO2001035389A1 (en) * | 1999-11-11 | 2001-05-17 | Koninklijke Philips Electronics N.V. | Tone features for speech recognition |
US6810379B1 (en) * | 2000-04-24 | 2004-10-26 | Sensory, Inc. | Client/server architecture for text-to-speech synthesis |
US20080147404A1 (en) * | 2000-05-15 | 2008-06-19 | Nusuara Technologies Sdn Bhd | System and methods for accent classification and adaptation |
US6856958B2 (en) * | 2000-09-05 | 2005-02-15 | Lucent Technologies Inc. | Methods and apparatus for text to speech processing using language independent prosody markup |
CN1187693C (zh) * | 2000-09-30 | 2005-02-02 | 英特尔公司 | 以自底向上方式将声调集成到汉语连续语音识别***中的方法和*** |
US7263488B2 (en) * | 2000-12-04 | 2007-08-28 | Microsoft Corporation | Method and apparatus for identifying prosodic word boundaries |
US6845358B2 (en) * | 2001-01-05 | 2005-01-18 | Matsushita Electric Industrial Co., Ltd. | Prosody template matching for text-to-speech systems |
WO2002073595A1 (fr) * | 2001-03-08 | 2002-09-19 | Matsushita Electric Industrial Co., Ltd. | Dispositif generateur de prosodie, procede de generation de prosodie, et programme |
US7035794B2 (en) * | 2001-03-30 | 2006-04-25 | Intel Corporation | Compressing and using a concatenative speech database in text-to-speech systems |
US20030055640A1 (en) * | 2001-05-01 | 2003-03-20 | Ramot University Authority For Applied Research & Industrial Development Ltd. | System and method for parameter estimation for pattern recognition |
JP4056470B2 (ja) * | 2001-08-22 | 2008-03-05 | インターナショナル・ビジネス・マシーンズ・コーポレーション | イントネーション生成方法、その方法を用いた音声合成装置及びボイスサーバ |
US7136802B2 (en) * | 2002-01-16 | 2006-11-14 | Intel Corporation | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system |
US20030191645A1 (en) * | 2002-04-05 | 2003-10-09 | Guojun Zhou | Statistical pronunciation model for text to speech |
US7136816B1 (en) * | 2002-04-05 | 2006-11-14 | At&T Corp. | System and method for predicting prosodic parameters |
US7136818B1 (en) * | 2002-05-16 | 2006-11-14 | At&T Corp. | System and method of providing conversational visual prosody for talking heads |
US7219059B2 (en) * | 2002-07-03 | 2007-05-15 | Lucent Technologies Inc. | Automatic pronunciation scoring for language learning |
US20040030555A1 (en) * | 2002-08-12 | 2004-02-12 | Oregon Health & Science University | System and method for concatenating acoustic contours for speech synthesis |
US7467087B1 (en) * | 2002-10-10 | 2008-12-16 | Gillick Laurence S | Training and using pronunciation guessers in speech recognition |
US8768701B2 (en) * | 2003-01-24 | 2014-07-01 | Nuance Communications, Inc. | Prosodic mimic method and apparatus |
US20050086052A1 (en) * | 2003-10-16 | 2005-04-21 | Hsuan-Huei Shih | Humming transcription system and methodology |
US7315811B2 (en) * | 2003-12-31 | 2008-01-01 | Dictaphone Corporation | System and method for accented modification of a language model |
US20050187772A1 (en) * | 2004-02-25 | 2005-08-25 | Fuji Xerox Co., Ltd. | Systems and methods for synthesizing speech using discourse function level prosodic features |
US20060229877A1 (en) * | 2005-04-06 | 2006-10-12 | Jilei Tian | Memory usage in a text-to-speech system |
US20060259303A1 (en) * | 2005-05-12 | 2006-11-16 | Raimo Bakis | Systems and methods for pitch smoothing for text-to-speech synthesis |
CN101176146B (zh) * | 2005-05-18 | 2011-05-18 | 松下电器产业株式会社 | 声音合成装置 |
CN1945693B (zh) * | 2005-10-09 | 2010-10-13 | 株式会社东芝 | 训练韵律统计模型、韵律切分和语音合成的方法及装置 |
JP4559950B2 (ja) * | 2005-10-20 | 2010-10-13 | 株式会社東芝 | 韻律制御規則生成方法、音声合成方法、韻律制御規則生成装置、音声合成装置、韻律制御規則生成プログラム及び音声合成プログラム |
US7996222B2 (en) * | 2006-09-29 | 2011-08-09 | Nokia Corporation | Prosody conversion |
JP4455610B2 (ja) * | 2007-03-28 | 2010-04-21 | 株式会社東芝 | 韻律パタン生成装置、音声合成装置、プログラムおよび韻律パタン生成方法 |
JP2009047957A (ja) * | 2007-08-21 | 2009-03-05 | Toshiba Corp | ピッチパターン生成方法及びその装置 |
JP5238205B2 (ja) * | 2007-09-07 | 2013-07-17 | ニュアンス コミュニケーションズ,インコーポレイテッド | 音声合成システム、プログラム及び方法 |
US7996214B2 (en) * | 2007-11-01 | 2011-08-09 | At&T Intellectual Property I, L.P. | System and method of exploiting prosodic features for dialog act tagging in a discriminative modeling framework |
JP5025550B2 (ja) * | 2008-04-01 | 2012-09-12 | 株式会社東芝 | 音声処理装置、音声処理方法及びプログラム |
US8374873B2 (en) * | 2008-08-12 | 2013-02-12 | Morphism, Llc | Training and applying prosody models |
US8571849B2 (en) * | 2008-09-30 | 2013-10-29 | At&T Intellectual Property I, L.P. | System and method for enriching spoken language translation with prosodic information |
US8321225B1 (en) * | 2008-11-14 | 2012-11-27 | Google Inc. | Generating prosodic contours for synthesized speech |
US8296141B2 (en) * | 2008-11-19 | 2012-10-23 | At&T Intellectual Property I, L.P. | System and method for discriminative pronunciation modeling for voice search |
JP5293460B2 (ja) * | 2009-07-02 | 2013-09-18 | ヤマハ株式会社 | 歌唱合成用データベース生成装置、およびピッチカーブ生成装置 |
JP5471858B2 (ja) * | 2009-07-02 | 2014-04-16 | ヤマハ株式会社 | 歌唱合成用データベース生成装置、およびピッチカーブ生成装置 |
CN101996628A (zh) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | 提取语音信号的韵律特征的方法和装置 |
JP5747562B2 (ja) * | 2010-10-28 | 2015-07-15 | ヤマハ株式会社 | 音響処理装置 |
US9286886B2 (en) * | 2011-01-24 | 2016-03-15 | Nuance Communications, Inc. | Methods and apparatus for predicting prosody in speech synthesis |
US9087519B2 (en) * | 2011-03-25 | 2015-07-21 | Educational Testing Service | Computer-implemented systems and methods for evaluating prosodic features of speech |
WO2012164835A1 (ja) * | 2011-05-30 | 2012-12-06 | 日本電気株式会社 | 韻律生成装置、音声合成装置、韻律生成方法および韻律生成プログラム |
US10453479B2 (en) * | 2011-09-23 | 2019-10-22 | Lessac Technologies, Inc. | Methods for aligning expressive speech utterances with text and systems therefor |
JP2014038282A (ja) * | 2012-08-20 | 2014-02-27 | Toshiba Corp | 韻律編集装置、方法およびプログラム |
US9135231B1 (en) * | 2012-10-04 | 2015-09-15 | Google Inc. | Training punctuation models |
US9224387B1 (en) * | 2012-12-04 | 2015-12-29 | Amazon Technologies, Inc. | Targeted detection of regions in speech processing data streams |
US9495955B1 (en) * | 2013-01-02 | 2016-11-15 | Amazon Technologies, Inc. | Acoustic model training |
US9292489B1 (en) * | 2013-01-16 | 2016-03-22 | Google Inc. | Sub-lexical language models with word level pronunciation lexicons |
US9761247B2 (en) * | 2013-01-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Prosodic and lexical addressee detection |
-
2013
- 2013-08-23 JP JP2013173634A patent/JP5807921B2/ja not_active Expired - Fee Related
-
2014
- 2014-08-13 EP EP14837587.6A patent/EP3038103A4/en not_active Ceased
- 2014-08-13 US US14/911,189 patent/US20160189705A1/en not_active Abandoned
- 2014-08-13 KR KR1020167001355A patent/KR20160045673A/ko not_active Application Discontinuation
- 2014-08-13 CN CN201480045803.7A patent/CN105474307A/zh active Pending
- 2014-08-13 WO PCT/JP2014/071392 patent/WO2015025788A1/ja active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02113299A (ja) * | 1988-10-22 | 1990-04-25 | Hiroya Fujisaki | 基本周波数パタン生成装置 |
JPH06332490A (ja) * | 1993-05-20 | 1994-12-02 | Meidensha Corp | 音声合成装置のアクセント成分基本テーブルの作成方法 |
JPH0990970A (ja) * | 1995-09-20 | 1997-04-04 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | 音声合成装置 |
JPH09198073A (ja) * | 1996-01-11 | 1997-07-31 | Secom Co Ltd | 音声合成装置 |
JP2003005775A (ja) * | 2001-06-26 | 2003-01-08 | Oki Electric Ind Co Ltd | テキスト音声変換装置における高速読上げ制御方法 |
JP2008191525A (ja) * | 2007-02-07 | 2008-08-21 | Nippon Telegr & Teleph Corp <Ntt> | F0値時系列生成装置、その方法、そのプログラム、及びその記録媒体 |
Non-Patent Citations (9)
Title |
---|
FUJISAKI, H.; HIROSE, K.: "Analysis of voice fundamental frequency contours for declarative sentences of Japanese", J. ACOUST. SOC. JPN., vol. 5, 1984, pages 233 - 242 |
L.BREIMAN; J.H.FRIEDMAN; R.A. OLSHEN; C.J.STONE: "Classification and Regression Trees", WADSWORTH, 1984 |
NI, J.; NAKAMURA, S.: "Use of Poisson processes to generate fundamental frequency contours", PROC. OF ICASSP2007, 2007, pages 825 - 828 |
NI, J; SHIGA, Y.; KAWAI, H.; KASHIOKA, H.: "Resonance-based spectral deformation in HMM-based speech synthesis", PROC. OF ISCSLP2012, 2012, pages 88 - 92, XP032321828, DOI: doi:10.1109/ISCSLP.2012.6423478 |
S. KIRKPATRICK; C.D. GELLATT, JR.; M.P. VECCHI: "Optimization by simulated annealing", 1982, IBM THOMAS J. WATSON RESEARCH CENTER |
See also references of EP3038103A4 * |
TETSUYA MATSUDA ET AL.: "HMM-based FO Contour Synthesis Using the Generation Process Model", IEICE TECHNICAL REPORT, vol. 110, no. 81, 10 June 2010 (2010-06-10), pages 73 - 78, XP008182279 * |
TOKUDA, K.; MASUKO, T.; MIYAZAKI, N.; KOBAYASHI, T.: "Hidden Markov models based on multi-space probability distribution for pitch contour modeling", PROC. OFICASSP1999, 1999, pages 229 - 232, XP010328034, DOI: doi:10.1109/ICASSP.1999.758104 |
YUSUKE FURUYAMA ET AL.: "Use of Linguistic Information for Automatic Extraction of FO Contour Generation Process Model Paramaters", IEICE TECHNICAL REPORT, vol. 103, no. 263, 14 August 2003 (2003-08-14), pages 37 - 42, XP007006778 * |
Also Published As
Publication number | Publication date |
---|---|
CN105474307A (zh) | 2016-04-06 |
KR20160045673A (ko) | 2016-04-27 |
EP3038103A1 (en) | 2016-06-29 |
US20160189705A1 (en) | 2016-06-30 |
EP3038103A4 (en) | 2017-05-31 |
JP2015041081A (ja) | 2015-03-02 |
JP5807921B2 (ja) | 2015-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Blaauw et al. | A neural parametric singing synthesizer | |
Tokuda et al. | Speech synthesis based on hidden Markov models | |
JP4455610B2 (ja) | 韻律パタン生成装置、音声合成装置、プログラムおよび韻律パタン生成方法 | |
JP4328698B2 (ja) | 素片セット作成方法および装置 | |
CN107924686B (zh) | 语音处理装置、语音处理方法以及存储介质 | |
JP6802958B2 (ja) | 音声合成システム、音声合成プログラムおよび音声合成方法 | |
JP6266372B2 (ja) | 音声合成辞書生成装置、音声合成辞書生成方法およびプログラム | |
JP6392012B2 (ja) | 音声合成辞書作成装置、音声合成装置、音声合成辞書作成方法及び音声合成辞書作成プログラム | |
EP2070084A2 (en) | Prosody conversion | |
KR20230056741A (ko) | 목소리 변환 및 스피치 인식 모델을 사용한 합성 데이터 증강 | |
JP5807921B2 (ja) | 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム | |
EP4266306A1 (en) | A speech processing system and a method of processing a speech signal | |
JP6631883B2 (ja) | クロスリンガル音声合成用モデル学習装置、クロスリンガル音声合成用モデル学習方法、プログラム | |
JP2016151736A (ja) | 音声加工装置、及びプログラム | |
JP6330069B2 (ja) | 統計的パラメトリック音声合成のためのマルチストリームスペクトル表現 | |
JP5574344B2 (ja) | 1モデル音声認識合成に基づく音声合成装置、音声合成方法および音声合成プログラム | |
JP6137708B2 (ja) | 定量的f0パターン生成装置、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
JP7357518B2 (ja) | 音声合成装置及びプログラム | |
Takaki et al. | Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012 | |
Yamagishi et al. | Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV | |
Cai et al. | Statistical parametric speech synthesis using a hidden trajectory model | |
JP2021099454A (ja) | 音声合成装置、音声合成プログラム及び音声合成方法 | |
Guner et al. | A small footprint hybrid statistical/unit selection text-to-speech synthesis system for agglutinative languages | |
KR100488121B1 (ko) | 화자간 변별력 향상을 위하여 개인별 켑스트럼 가중치를 적용한 화자 인증 장치 및 그 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201480045803.7 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14837587 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20167001355 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2014837587 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14911189 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |