WO2010050103A1 - 音声合成装置 - Google Patents
音声合成装置 Download PDFInfo
- Publication number
- WO2010050103A1 WO2010050103A1 PCT/JP2009/004004 JP2009004004W WO2010050103A1 WO 2010050103 A1 WO2010050103 A1 WO 2010050103A1 JP 2009004004 W JP2009004004 W JP 2009004004W WO 2010050103 A1 WO2010050103 A1 WO 2010050103A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- prosody
- speech
- information
- candidate
- unit
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 73
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 58
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 22
- 238000000034 method Methods 0.000 claims description 14
- 238000001308 synthesis method Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 10
- 230000010365 information processing Effects 0.000 claims description 4
- 230000033764 rhythmic process Effects 0.000 abstract 12
- 230000006870 function Effects 0.000 description 7
- 230000007423 decrease Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Definitions
- the present invention relates to a speech synthesizer that performs speech synthesis processing for synthesizing speech representing a character string.
- FIG. 1 is a block diagram showing the configuration of this type of speech synthesizer.
- Non-Patent Document 1 to Non-Patent Document 3, Patent Document 1 and Patent Document 2 describe speech synthesis apparatuses having such a configuration.
- the speech synthesizer shown in FIG. 1 includes a language processing unit 901, a prosody estimation unit 902, a segment information storage unit 905, a segment selection unit 906, and a waveform generation unit 908.
- the unit information storage unit 905 stores speech unit information representing speech units generated for each speech synthesis unit and attribute information of each speech unit.
- the speech unit information is information used to generate synthesized speech (speech waveform).
- the speech segment information is often information extracted from speech uttered by humans (natural speech waveform).
- the speech segment information is generated based on information obtained by recording a voice uttered (spoken) by an announcer or a voice actor.
- the person (speaker) who uttered the voice that is the basis of the speech unit information is called the original speaker of the speech unit.
- the speech segment is a speech waveform, a linear prediction analysis parameter, a cepstrum coefficient, or the like divided (cut out) for each speech synthesis unit.
- the attribute information of the speech segment is phoneme environment of the speech that is the basis of each speech segment, phoneme information such as pitch frequency, amplitude, duration, etc., and prosodic information.
- a speech synthesis unit a phoneme, CV, CVC, or VCV (V is a vowel and C is a consonant) is often used. Details of the length of the speech element and the speech synthesis unit are described in Non-Patent Document 1 to Non-Patent Document 3.
- the language processing unit 901 performs analysis such as morphological analysis, syntax analysis, and reading on the input character string information, information indicating a symbol string indicating “reading” such as a phoneme symbol, Information indicating the part of speech, utilization, accent type, and the like are output to the prosody estimation unit 902 and the segment selection unit 906 as a language analysis processing result.
- the prosody estimation unit 902 based on the result of the language analysis processing output from the language processing unit 901, the prosody of the synthesized speech (sound pitch (pitch), sound length (time length), and sound volume). Information on (power) etc.) is estimated, and prosodic information indicating the estimated prosody is output to the segment selection unit 906 and the waveform generation unit 908.
- the unit selection unit 906 selects speech unit information from the speech unit information stored in the unit information storage unit 905 based on the language analysis processing result and the estimated prosody as follows, The selected speech unit information and its attribute information are output to the waveform generation unit 908.
- the segment selection unit 906 generates information representing the characteristics of the synthesized speech based on the input language analysis processing result and the estimated prosody (hereinafter referred to as “target segment environment”). Obtained for each speech synthesis unit.
- the target segment environment is the corresponding / preceding / following phonemes, the presence / absence of stress, the distance from the accent core, the pitch frequency for each speech synthesis unit, the power, the duration of the unit, the cepstrum, MFCC (Mel Frequency Cepstial Coefficients) , And their ⁇ amount (change amount per unit time).
- the segment selection unit 906 generates speech unit information representing speech units having speech units corresponding to (for example, matching) specific information (mainly corresponding phonemes) included in the obtained target segment environment.
- a plurality of pieces are acquired from the piece information storage unit 5.
- the acquired speech unit information is a candidate speech unit information used for synthesizing speech.
- the segment selection unit 906 calculates a cost, which is an index indicating the appropriateness as speech unit information used for synthesizing speech with respect to the acquired speech unit information.
- the cost is a value that decreases as the appropriateness increases. That is, as the speech unit information with a lower cost is used, the synthesized speech becomes a speech with a higher natural level representing a degree of similarity to a speech uttered by a human. That is, the segment selection unit 906 selects speech segment information with the smallest calculated cost.
- the waveform generation unit 908 uses the prosody represented by the prosodic information to represent the prosody of the speech segment represented by the speech segment information. Then, a speech waveform is generated, and a speech waveform connecting the generated speech waveforms is output as synthesized speech.
- the speech synthesizer described in Patent Document 3 synthesizes speech so as to have the prosody (prosodic requested by the user, required prosody) possessed by the speech uttered by the user. According to this speech synthesizer, the user can bring the prosody of the synthesized speech closer to the prosody of the speech he / she uttered.
- a speech unit that can synthesize speech having a natural degree higher than a predetermined reference value when used to synthesize speech having a reference prosody that is a reference prosody. Is stored.
- the speech synthesizer synthesizes speech having a prosody that is significantly different from the reference prosody, the naturalness of the synthesized speech is relatively likely to be lower than the reference value.
- the prosody requested by the user may be significantly different from the reference prosody. Therefore, the above-described speech synthesizer has a problem in that it may synthesize speech that has an excessively low natural level (an extremely low possibility of being recognized as a speech uttered by a human).
- This problem also occurs when the required prosody is a prosody input (or edited) by the user, or when the required prosody is an artificially generated prosody.
- an object of the present invention is to provide a speech synthesizer capable of solving the above-mentioned problem “synthesizes speech with an extremely low naturalness”.
- a speech synthesizer When used to synthesize a voice that has a reference prosody, which is a reference prosody, it is possible to synthesize a voice whose naturalness is higher than a predetermined reference value, indicating the degree of similarity to a human-generated voice
- Speech segment information storage means for storing speech segment information representing a speech segment
- Requested prosodic information receiving means for receiving required prosodic information representing a requested prosody that is a prosodic requested by a user
- Intermediate prosody information generating means for generating intermediate prosody information representing an intermediate prosody that is a prosody between the reference prosody and the required prosody
- Speech synthesis means for performing speech synthesis processing for synthesizing speech based on the generated intermediate prosody information and the stored speech segment information; Is provided.
- a speech synthesis method includes: When used to synthesize a voice that has a reference prosody, which is a reference prosody, it is possible to synthesize a voice whose naturalness is higher than a predetermined reference value, indicating the degree of similarity to a human-generated voice
- speech unit information representing a speech unit is stored in the storage device, Accepts requested prosodic information representing the requested prosody that is the prosody requested by the user, Generating intermediate prosody information representing an intermediate prosody that is a prosody between the reference prosody and the required prosody;
- This is a method of performing speech synthesis processing for synthesizing speech based on the generated intermediate prosodic information and the stored speech segment information.
- a speech synthesis program is In the information processing device, When used to synthesize a voice that has a reference prosody, which is a reference prosody, it is possible to synthesize a voice whose naturalness is higher than a predetermined reference value, indicating the degree of similarity to a human-generated voice
- Speech unit information storage processing means for storing speech unit information representing a speech unit in a storage device;
- Requested prosodic information receiving means for receiving required prosodic information representing a requested prosody that is a prosodic requested by a user;
- Speech synthesis means for performing speech synthesis processing for synthesizing speech based on the generated intermediate prosody information and the stored speech segment information; It is a program for realizing.
- the present invention is configured as described above, so that the required prosody can be reflected in the synthesized speech while preventing the naturalness of the synthesized speech from becoming excessively low.
- FIG. 2 It is a figure showing the schematic structure of the speech synthesizer which concerns on background art. It is a block diagram showing the outline of the function of the speech synthesizer concerning a 1st embodiment by the present invention. It is the flowchart which showed the speech synthesis program which CPU of the speech synthesizer shown in FIG. 2 performs. It is the graph which showed notionally the relation of a standard prosody, a requirement prosody, and a candidate prosody. 6 is a graph conceptually showing the relationship between the degree of similarity between the candidate prosody and the reference prosody and the cost. It is the flowchart which showed the speech synthesis program which CPU of the speech synthesizer concerning 2nd Embodiment by this invention performs. It is a block diagram showing the outline of the function of the speech synthesizer based on 3rd Embodiment by this invention.
- the speech synthesizer 1 is an information processing apparatus.
- the speech synthesizer 1 includes a central processing unit (CPU; Central Processing Unit), a storage device (memory and a hard disk drive (HDD)), an input device, and an output device (not shown).
- CPU Central Processing Unit
- HDD hard disk drive
- the output device has a display and a speaker.
- the output device displays an image made up of characters and graphics on the display based on the image information output by the CPU.
- the output device outputs sound from the speaker based on the sound information generated by the CPU.
- the input device has a mouse, keyboard and microphone.
- the speech synthesizer 1 is configured such that information based on user operations is input via a keyboard and a mouse.
- the voice synthesizer 1 is configured such that input voice information representing the voice around the microphone (that is, outside the voice synthesizer 1) is input via the microphone.
- the functions of the speech synthesizer 1 are a language processing unit 11, a prosody estimation unit 12, a request prosodic information reception unit (request prosody information reception unit) 13, an intermediate prosody information generation unit (intermediate prosody information generation unit) 14, and , Unit information storage unit (speech unit information storage unit, speech unit information storage processing step, speech unit information storage unit), and unit selection unit (speech unit information selection unit, cost calculation unit, voice A part of synthesis means) 16, a prosody specifying part (part of speech synthesis means) 17, and a waveform generation part (part of speech synthesis means) 18.
- This function is realized by the CPU of the speech synthesizer 1 executing the speech synthesis program shown in FIG. 3 stored in the storage device.
- the segment information storage unit 15 stores in advance a speech unit information representing a speech unit generated for each speech synthesis unit and attribute information of each speech unit in a storage device.
- the speech segment is a speech waveform divided (cut out) for each speech synthesis unit.
- the speech segment may be a linear prediction analysis parameter, a cepstrum coefficient, or the like.
- the attribute information of the speech unit includes phoneme information such as the phoneme environment, pitch frequency, amplitude, and duration of the speech that is the basis of each speech unit, and prosody information representing the prosody.
- the speech synthesis unit is a phoneme.
- the speech synthesis unit may be CV, CVC, or VCV (V is a vowel and C is a consonant).
- the prosody includes a parameter that represents the pitch (pitch) of the sound, a parameter that represents the length (time length) of the sound, and a parameter that represents the magnitude (power) of the sound.
- the language processing unit 11 receives character string information input by the user.
- the language processing unit 11 performs language analysis processing on the character string represented by the received character string information.
- the language analysis process includes a morphological analysis process, a syntax analysis process, and a reading process.
- the language processing unit 11 uses information representing the symbol string representing “reading” such as phoneme symbols and information representing the part of speech, utilization, accent type, etc. of the morpheme as the results of the language analysis processing, This is transmitted to the segment selection unit 16.
- the prosody estimation unit 12 estimates a reference prosody that is a reference prosody based on the language analysis processing result transmitted from the language processing unit 11.
- the reference prosody when speech having the reference prosody is synthesized using the speech unit information stored in the unit information storage unit 15, the naturalness of the synthesized speech is higher than a predetermined reference value. It is a prosody set to be.
- speech segment information that makes the naturalness of the synthesized speech higher than a predetermined reference value is stored in the segment information storage unit 15.
- the naturalness is a value representing the degree of similarity to a voice uttered by a human. That is, it can be said that the reference prosody is a prosody estimated by performing language analysis processing on a character string represented by character string information.
- the prosody estimation unit 12 transmits reference prosody information representing the estimated reference prosody to the intermediate prosody information generation unit 14.
- the requested prosodic information receiving unit 13 extracts the prosodic information based on the input speech information input via the microphone, thereby receiving the extracted prosodic information as the requested prosodic information.
- the requested prosody information represents a requested prosody that is a prosody requested by the user. That is, the requested prosody information accepting unit 13 accepts requested prosody information indicating a requested prosody that is a prosody requested by the user.
- the requested prosodic information receiving unit 13 uses a known method used when generating attribute information of speech segments as a method of extracting prosodic information based on input speech information.
- the requested prosodic information receiving unit 13 transmits the received requested prosodic information to the intermediate prosodic information generating unit 14.
- the intermediate prosody information generation unit 14 is a prosody candidate of the speech to be synthesized based on the reference prosody information transmitted from the prosody estimation unit 12 and the requested prosody information transmitted from the requested prosody information reception unit 13. A plurality of candidate prosody information representing candidate prosody is generated.
- the candidate prosodic information includes intermediate prosodic information, which will be described later, and requested prosodic information. Further, the candidate prosody information may include reference prosody information.
- the intermediate prosody information generation unit 14 transmits the generated candidate prosody information to the segment selection unit 16.
- the intermediate prosody information generation unit 14 generates intermediate prosody information representing an intermediate prosody that is a prosody between the reference prosody and the required prosody. At this time, the intermediate prosodic information generation unit 14 has a plurality of pieces of intermediate prosodic information so that the intermediate prosody represented by the generated intermediate prosodic information and the reference prosody (or required prosody) are different from each other. Is generated.
- a prosody having a greater degree (similarity) to the reference prosody can synthesize a speech having a higher natural degree when a speech having that prosody is synthesized.
- the prosody that is more similar to the reference prosody has a smaller (lower) degree of similarity to the requested prosody, so the possibility that the user's request is satisfied is reduced. Therefore, by using a prosody between the reference prosody and the required prosody, it is possible to increase the possibility that the user's request is satisfied while preventing the naturalness from becoming excessively low.
- the intermediate prosody in this embodiment is a value obtained by internally dividing (interpolating) the reference prosody and the required prosody.
- the prosody has K elements (K is an integer) (pitch, time length, power, etc.).
- r (i) ⁇ (i) ⁇ p (i) + (1 ⁇ (i)) ⁇ q (i) (4)
- ⁇ (i) 1, 2,..., K, and ⁇ (i) is a real number satisfying 0 ⁇ (i) ⁇ 1.
- a pitch pattern as a prosody element.
- the pitch pattern (reference pitch pattern) as the reference prosody is f1 (t) and the pitch pattern (required pitch pattern) as the required prosody is f2 (t)
- the pitch pattern (candidate pitch pattern) fn ( t) is derived by the following equation (5).
- fn (t) ⁇ (t) ⁇ f1 (t) + (1 ⁇ (t)) ⁇ f2 (t) (5)
- FIG. 4 is a graph showing an example of the reference pitch pattern f1 (t), the required pitch pattern f2 (t), and the candidate pitch patterns fn1 (t) to fn3 (t).
- the solid line represents the reference pitch pattern f1 (t) and the required pitch pattern f2 (t)
- the dotted line represents the candidate pitch patterns fn1 (t) to fn3 (t).
- the degree to which the candidate pitch pattern fn1 (t) is similar to the reference pitch pattern f1 (t) is the maximum.
- the candidate pitch pattern having the second highest degree of similarity to the reference pitch pattern f1 (t) after the candidate pitch pattern fn1 (t) is fn2 (t), and the next is fn3 (t).
- the pitch pattern fn4 (t) is an example of a prosody that is not an intermediate prosody of the reference pitch pattern f1 (t) and the required pitch pattern f2 (t).
- Candidate prosody is generated in units of processing for selecting speech segment information (for example, for each exhalation paragraph that is sandwiched between punctuation marks or punctuation marks) so that speech segment information described later can be easily selected. .
- it is not necessary to generate the same unit as the unit of processing for selecting the speech unit information.
- prosody different in degree similar to the reference prosody in units of accent phrases may be generated as candidate prosody.
- the segment selection unit 16 includes candidate prosody information transmitted from the intermediate prosody information generation unit 14, language analysis processing results transmitted from the language processing unit 11, and speech units stored in the unit information storage unit 15. Based on the information, speech unit information corresponding to the candidate prosody is selected from the stored speech unit information for each candidate prosody represented by the candidate prosody information.
- the segment selection unit 16 performs the following processing for each candidate prosody.
- the segment selection unit 16 obtains information (target segment environment) representing the characteristics of the synthesized speech (synthesized speech) for each speech synthesis unit based on the language analysis processing result and the candidate prosody.
- the target segment environment is the corresponding / preceding / following phonemes, presence / absence of stress, distance from accent core, pitch frequency for each speech synthesis unit, power, duration of unit, cepstrum, MFCC (Mel Frequency Cepstial Coefficients) , And their ⁇ amount (change amount per unit time).
- the unit selection unit 16 selects speech unit information representing a speech unit having a phoneme corresponding to (for example, matching) specific information (mainly corresponding phoneme) included in the target unit environment.
- the segment selection unit 16 calculates the cost based on the selected speech segment information.
- the cost is an index indicating the appropriateness as speech unit information used for synthesizing speech. That is, the cost is a value that changes according to the naturalness of the speech when the speech having the candidate prosody is synthesized.
- the cost includes a parameter indicating the degree of difference between the segment environment of the stored speech segment information and the target segment environment, and the segment between the speech segments to be connected. And a parameter indicating the degree of difference in the environment.
- the cost increases as the degree of difference between the segment environment of the stored speech segment information and the target segment environment increases. Furthermore, the cost increases as the degree of difference in the segment environment between connected speech segments increases. That is, it can be said that the cost is a value that increases as the degree to which the natural level is lower than the reference value increases.
- the cost is calculated using the target segment environment, the pitch frequency at the segment connection boundary, the cepstrum, the MFCC, the short-time autocorrelation, the power, and the ⁇ amount (time variation amount). Details of the cost are disclosed in Japanese Patent Application Laid-Open No. 2006-84854, Japanese Patent Application Laid-Open No. 2005-91551, and the like, and are omitted in this specification.
- the segment selection unit 16 selects speech unit information with the smallest calculated cost as the speech unit information corresponding to the candidate prosody from the selected speech unit information.
- the unit selection unit 16 selects speech unit information corresponding to the candidate prosody from the stored speech unit information for each candidate prosody.
- the segment selection unit 16 displays the selected speech segment information and the cost calculated based on the speech segment information together with candidate prosody information representing the candidate prosody. 17 is transmitted.
- the speech unit information selected for each candidate prosody is often different, but may be the same.
- the candidate prosody generated by the intermediate prosody information generation unit 14 is similar, or when the number of speech unit information stored in the unit information storage unit 15 is small, for each candidate prosody There is a high possibility that the selected speech segment information is the same.
- the prosodic identification unit 17 identifies one of the candidate prosody based on the cost, speech segment information, and candidate prosody information transmitted from the segment selection unit 16.
- the prosody specifying unit 17 specifies the candidate prosody as close as possible to the required prosody as long as the naturalness of the synthesized speech satisfies a preset tolerance level.
- the prosody specifying unit 17 specifies a candidate prosody having the highest degree of similarity to the requested prosody among candidate prosody having a calculated cost smaller than a predetermined threshold.
- the prosody specifying unit 17 specifies the candidate prosody having the largest degree of similarity to the reference prosody when there is no candidate prosody having a cost smaller than the threshold.
- the vertical axis represents the cost
- the horizontal axis represents the similarity of the candidate prosody to the reference prosody (the degree of similarity between the candidate prosody and the reference prosody, ⁇ in Expression (4)).
- the cost decreases as the candidate prosody is similar to the reference prosody in many cases (that is, the cost decreases monotonously).
- the cost may not monotonously decrease as the degree of similarity between the candidate prosody and the reference prosody increases.
- the threshold value is a preset value (constant value).
- the threshold value may be set based on the cost transmitted from the segment selection unit 16. According to this, the threshold value can be set appropriately.
- c is a real number that satisfies 0 ⁇ c ⁇ 1. Note that when the prosody specifying unit 17 recognizes that the reference prosody is used as the candidate prosody, the cost calculated for the candidate prosody may be used as the minimum value Smin. Similarly, when the prosody specifying unit 17 recognizes that the required prosody is used as the candidate prosody, the cost calculated for the candidate prosody may be used as the maximum value Smax.
- the prosody specifying unit 17 transmits the specified candidate prosody information and the speech unit information transmitted together with the candidate prosody information to the waveform generation unit 18.
- the waveform generation unit 18 uses the prosody of the speech unit represented by the speech unit information as the prosody represented by the candidate prosody information.
- a speech waveform is generated, and a speech waveform connected to the generated speech waveform is output as synthesized speech. That is, the waveform generation unit 18 performs a speech synthesis process for synthesizing speech having the candidate prosody specified by the prosody specifying unit 17.
- the CPU of the speech synthesizer 1 is configured to execute the speech synthesis program shown by the flowchart in FIG. 3 in response to an activation instruction input by the user.
- the CPU waits until character string information is input by the user in step 305.
- the CPU receives the input character string information and performs language analysis processing on the character string represented by the received character string information. Then, the CPU outputs the language analysis processing result (step A1).
- the CPU estimates a reference prosody based on the output language analysis processing result, and outputs reference prosody information representing the estimated reference prosody (step A2).
- the CPU waits until input voice information is input by the user.
- the CPU receives the input voice information and extracts requested prosodic information based on the received input voice information (step A3, required prosodic information receiving step). .
- the CPU generates a plurality of candidate prosody information representing candidate prosody that is a candidate for the prosody of the synthesized speech based on the output reference prosodic information and the extracted required prosodic information (step A4, Intermediate prosodic information generation process).
- the CPU performs each of the candidate prosody represented by the candidate prosodic information. Then, speech unit information corresponding to the candidate prosody is selected from the stored speech unit information.
- the CPU selects speech unit information representing a speech unit having a phoneme corresponding to specific information included in the target unit environment for each candidate prosody, and selects the selected speech unit. A cost is calculated based on the information (cost calculation step). Then, the CPU selects, from among the selected speech unit information, speech unit information having the smallest calculated cost as speech unit information corresponding to the candidate prosody (step A5, speech unit information). Selection step).
- the CPU specifies the candidate prosody having the highest degree of similarity to the requested prosody among candidate prosody whose calculated cost is smaller than a predetermined threshold (step A6). Then, the CPU generates a speech waveform such that the prosody of the speech unit represented by the speech unit information selected according to the identified candidate prosody is the identified candidate prosody. Next, the CPU outputs a voice waveform obtained by connecting the generated voice waveforms as synthesized voice from the speaker (step A7, voice synthesis step).
- the speech synthesizer 1 synthesizes speech based on the intermediate prosody, which is a prosody between the reference prosody and the required prosody. It is configured.
- the naturalness of synthesized speech can be made higher than when speech having the required prosody is synthesized. That is, the required prosody can be reflected in the synthesized speech while preventing the naturalness of the synthesized speech from becoming excessively low.
- the candidate prosody used for synthesizing the speech is determined based on the cost that changes according to the naturalness. Therefore, it is possible to reliably prevent the naturalness from becoming excessively low.
- the first embodiment it is possible to synthesize a speech having a prosody that is most similar (closest) to the required prosody within a sufficiently natural range. Therefore, it is possible to increase the degree to which the required prosody is reflected in the synthesized speech while preventing the naturalness of the synthesized speech from becoming excessively low. As a result, the possibility that the user's request is satisfied can be increased.
- the speech synthesizer 1 may be configured to generate a plurality of intermediate prosodic information in parallel.
- the speech synthesizer 1 may include a plurality of circuit units for generating one intermediate prosodic information.
- the CPU of the speech synthesizer 1 may perform parallel processing.
- the speech synthesizer according to the second embodiment calculates costs in order from candidate prosody having a high degree of similarity to the requested prosody with respect to the speech synthesizer according to the first embodiment, and the calculated cost is less than the threshold Is different in that the speech synthesis process is performed using the candidate prosody that is initially reduced. Accordingly, the following description will focus on such differences.
- the segment selection unit 16 generates (acquires) candidate prosody one by one from the candidate prosody having a high degree of similarity to the requested prosody one by one. Calculate the cost. Further, when the calculated cost becomes smaller than the threshold value, the prosody specifying unit 17 specifies a candidate prosody that is a basis for calculating the cost.
- the CPU of the speech synthesizer 1 according to the second embodiment executes the speech synthesis program shown in FIG. 6 instead of the speech synthesis program of FIG.
- the CPU executes steps A1 to A3 as in the first embodiment.
- the CPU generates only one candidate prosody information (step B4).
- the CPU selects candidates so that the degree of similarity between the candidate prosody represented by the generated candidate prosody information and the requested prosody is small (lower). Prosody information is generated.
- the CPU selects from the stored speech segment information based on the generated candidate prosodic information, the output language analysis processing result, and the speech segment information stored in the storage device. Then, speech segment information corresponding to the candidate prosody represented by the candidate prosody information is selected.
- the CPU selects speech unit information representing a speech unit having a phoneme corresponding to specific information included in the target unit environment, and calculates a cost based on the selected speech unit information. . Then, the CPU selects, from the selected speech unit information, speech unit information having the smallest calculated cost as speech unit information corresponding to the candidate prosody (step B5).
- step B6 determines whether or not the cost calculated for the selected speech segment information is smaller than a threshold (step B6). Now, the description will be continued assuming that the calculated cost is larger than the threshold value. In this case, the CPU makes a “No” determination at step B6 to return to step B4, and repeatedly executes the processing from step B4 to step B6.
- step B6 the CPU determines “Yes” and proceeds to step A7. Then, the CPU generates a speech waveform so that the prosody of the speech unit represented by the speech unit information selected according to the generated latest candidate prosody is the candidate prosody. Next, the CPU outputs a voice waveform obtained by connecting the generated voice waveforms as synthesized voice from the speaker (step A7).
- the same operations and effects as those of the first embodiment can be achieved. Furthermore, according to the second embodiment, it is possible to prevent costs from being calculated wastefully. As a result, the processing load for the speech synthesizer 1 to calculate the cost can be reduced.
- the function of the speech synthesizer 100 according to the third embodiment includes a request prosodic information receiving unit 113, an intermediate prosody information generating unit 114, a speech segment information storage unit 115, and a speech synthesizing unit 116.
- the speech unit information storage unit 115 When used to synthesize speech having a reference prosody, which is a reference prosody, the speech unit information storage unit 115 has a predetermined degree of naturalness representing a degree of similarity to a speech uttered by a human. Speech unit information representing speech units capable of synthesizing speech higher than the value is stored.
- the requested prosody information accepting unit 113 accepts requested prosody information indicating a requested prosody that is a prosody requested by the user.
- the intermediate prosody information generation unit 114 generates intermediate prosody information representing an intermediate prosody that is a prosody between the reference prosody and the required prosody.
- the speech synthesis unit 116 performs speech synthesis processing for synthesizing speech based on the intermediate prosody information generated by the intermediate prosody information generation unit 114 and the speech unit information stored by the speech unit information storage unit 115. Do.
- the naturalness of the synthesized speech can be made higher than when the speech having the required prosody is synthesized. That is, the required prosody can be reflected in the synthesized speech while preventing the naturalness of the synthesized speech from becoming excessively low.
- the speech synthesis means Speech unit information selecting means for selecting speech unit information corresponding to the candidate prosody from the stored speech unit information for each of the candidate prosody including the intermediate prosthesis, Cost calculation means for calculating a cost that varies depending on the naturalness of the speech when the speech having the candidate prosody is synthesized based on the selected speech segment information for each of the candidate prosody When, Including The speech that identifies one of the candidate prosody based on the calculated cost and synthesizes speech having the identified candidate prosody based on speech segment information selected for the identified candidate prosody It is preferable that the composition processing is performed.
- the candidate prosody used for synthesizing the speech is determined based on the cost that changes in accordance with the naturalness. Therefore, it is possible to reliably prevent the naturalness from becoming excessively low.
- the cost is a value that increases as the degree to which the naturalness is lower than the reference value increases.
- the speech synthesis means is configured to identify a candidate prosody that has the highest degree of similarity to the required prosody among the candidate prosody having the calculated cost smaller than a predetermined threshold. It is.
- the speech synthesizer is configured to set the threshold value based on the calculated maximum cost value and the calculated minimum cost value. According to this, the threshold value can be set appropriately.
- the cost calculating means is configured to acquire the candidate prosody one by one from the candidate prosody having a high degree of similarity to the requested prosody, and calculate the cost for the acquired candidate prosody.
- the speech synthesis unit specifies a candidate prosody from which the cost is calculated, and selects a speech unit selected for the specified candidate prosody It is preferable that the speech synthesis process for synthesizing speech having the identified candidate prosody is performed based on the information.
- the prosody that has a high degree of similarity to the required prosody is more likely to have a higher cost. Therefore, according to the above configuration, it is possible to prevent the cost from being calculated wastefully. As a result, the processing load for the speech synthesizer to calculate the cost can be reduced.
- the reference prosody is preferably a prosody estimated by performing language analysis processing on a character string.
- the speech synthesizer Each of the reference prosody and the required prosody preferably includes at least one of a parameter representing a pitch, a parameter representing a sound length, and a parameter representing a loudness. .
- a speech synthesis method includes: When used to synthesize a voice that has a reference prosody, which is a reference prosody, it is possible to synthesize a voice whose naturalness is higher than a predetermined reference value, indicating the degree of similarity to a human-generated voice
- speech unit information representing a speech unit is stored in the storage device, Accepts requested prosodic information representing the requested prosody that is the prosody requested by the user, Generating intermediate prosody information representing an intermediate prosody that is a prosody between the reference prosody and the required prosody;
- This is a method of performing speech synthesis processing for synthesizing speech based on the generated intermediate prosodic information and the stored speech segment information.
- the speech synthesis method is For each candidate prosody including the intermediate prosody, select speech unit information corresponding to the candidate prosody from the stored speech unit information, For each of the candidate prosody, based on the selected speech segment information, to calculate a cost that varies according to the naturalness of the speech when the speech having the candidate prosody is synthesized, The speech that identifies one of the candidate prosody based on the calculated cost and synthesizes speech having the identified candidate prosody based on speech segment information selected for the identified candidate prosody It is preferable that the composition processing is performed.
- the cost is a value that increases as the degree to which the naturalness is lower than the reference value increases. It is preferable that the candidate prosody having the highest degree of similarity to the required prosody among the candidate prosody having the calculated cost smaller than a predetermined threshold is specified.
- a speech synthesis program is In the information processing device, When used to synthesize a voice that has a reference prosody, which is a reference prosody, it is possible to synthesize a voice whose naturalness is higher than a predetermined reference value, indicating the degree of similarity to a human-generated voice
- Speech unit information storage processing means for storing speech unit information representing a speech unit in a storage device;
- Requested prosodic information receiving means for receiving required prosodic information representing a requested prosody that is a prosodic requested by a user;
- Speech synthesis means for performing speech synthesis processing for synthesizing speech based on the generated intermediate prosody information and the stored speech segment information; It is a program for realizing.
- the speech synthesis means Speech unit information selecting means for selecting speech unit information corresponding to the candidate prosody from the stored speech unit information for each of the candidate prosody including the intermediate prosthesis, Cost calculation means for calculating a cost that varies depending on the naturalness of the speech when the speech having the candidate prosody is synthesized based on the selected speech segment information for each of the candidate prosody When, Including The speech that identifies one of the candidate prosody based on the calculated cost and synthesizes speech having the identified candidate prosody based on speech segment information selected for the identified candidate prosody It is preferable that the composition processing is performed.
- the cost is a value that increases as the degree to which the naturalness is lower than the reference value increases.
- the speech synthesis means is configured to identify a candidate prosody having the highest degree of similarity to the requested prosody among the candidate prosody having the calculated cost smaller than a predetermined threshold. It is.
- the required prosodic information is information based on a voice uttered by the user, but is information based on information input by the user using an input device (such as a keyboard and a mouse). Also good. For example, information obtained by editing the prosodic information stored in the speech synthesizer 1 by the user may be used as the requested prosodic information.
- the program is stored in the storage device, but may be stored in a computer-readable recording medium.
- the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.
- the present invention is applicable to a speech synthesizer that performs speech synthesis processing for synthesizing speech representing a character string.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Navigation (AREA)
- Machine Translation (AREA)
Abstract
Description
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報を記憶する音声素片情報記憶手段と、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付ける要求韻律情報受付手段と、
上記基準韻律と上記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する中間韻律情報生成手段と、
上記生成された中間韻律情報と、上記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成手段と、
を備える。
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報が記憶装置に記憶されている場合に、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付け、
上記基準韻律と上記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成し、
上記生成された中間韻律情報と、上記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う、方法である。
情報処理装置に、
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報を記憶装置に記憶させる音声素片情報記憶処理手段と、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付ける要求韻律情報受付手段と、
上記基準韻律と上記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する中間韻律情報生成手段と、
上記生成された中間韻律情報と、上記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成手段と、
を実現させるためのプログラムである。
(構成)
図2に示したように、第1実施形態に係る音声合成装置1は、情報処理装置である。音声合成装置1は、図示しない中央処理装置(CPU;Central Processing Unit)、記憶装置(メモリ及びハードディスク駆動装置(HDD;Hard Disk Drive))、入力装置及び出力装置を備える。
次に、上記のように構成された音声合成装置1の機能について説明する。
この音声合成装置1の機能は、言語処理部11と、韻律推定部12と、要求韻律情報受付部(要求韻律情報受付手段)13と、中間韻律情報生成部(中間韻律情報生成手段)14と、素片情報記憶部(音声素片情報記憶手段、音声素片情報記憶処理工程、音声素片情報記憶処理手段)15と、素片選択部(音声素片情報選択手段、コスト算出手段、音声合成手段の一部)16と、韻律特定部(音声合成手段の一部)17と、波形生成部(音声合成手段の一部)18と、を含む。この機能は、音声合成装置1のCPUが、記憶装置に記憶されている図3に示した音声合成プログラムを実行することにより実現される。
韻律推定部12は、推定した基準韻律を表す基準韻律情報を中間韻律情報生成部14へ伝達する。
要求韻律情報受付部13は、受け付けた要求韻律情報を中間韻律情報生成部14へ伝達する。
p=(p(1),p(2),…,p(K)) …(1)
q=(q(1),q(2),…,q(K)) …(2)
r=(r(1),r(2),…,r(K)) …(3)
r(i)=α(i)・p(i)+(1-α(i))・q(i) …(4)
基準韻律としてのピッチパタン(基準ピッチパタン)をf1(t)とし、要求韻律としてのピッチパタン(要求ピッチパタン)をf2(t)とすると、候補韻律としてのピッチパタン(候補ピッチパタン)fn(t)は下記式(5)により導出される。
fn(t)=β(t)・f1(t)+(1-β(t))・f2(t) …(5)
図4は、基準ピッチパタンf1(t)、要求ピッチパタンf2(t)、及び、候補ピッチパタンfn1(t)~fn3(t)の例を示したグラフである。実線は、基準ピッチパタンf1(t)及び要求ピッチパタンf2(t)を表し、点線は、候補ピッチパタンfn1(t)~fn3(t)を表している。
素片選択部16は、言語解析処理結果と、候補韻律と、に基づいて、合成される音声(合成音声)の特徴を表す情報(目標素片環境)を音声合成単位毎に求める。目標素片環境は、該当・先行・後続の各音素、ストレスの有無、アクセント核からの距離、音声合成単位毎のピッチ周波数、パワー、単位の継続時間長、ケプストラム、MFCC(Mel Frequency Cepstral Coefficients)、及びこれらのΔ量(単位時間あたりの変化量)等である。素片選択部16は、目標素片環境に含まれる特定の情報(主に該当音素)に対応(例えば、一致)する音素を有する音声素片を表す音声素片情報を選択する。
Th=Smax-c・(Smax-Smin) …(6)
次に、上述した音声合成装置1の作動について具体的に述べる。
音声合成装置1のCPUは、図3にフローチャートにより示した音声合成プログラムをユーザにより入力された起動指示に応じて実行するようになっている。
次に、本発明の第2実施形態に係る音声合成装置について説明する。第2実施形態に係る音声合成装置は、上記第1実施形態に係る音声合成装置に対して、要求韻律に類似している程度が高い候補韻律から順にコストを算出し、算出したコストが閾値よりも最初に小さくなった候補韻律を用いて音声合成処理を行う点において相違している。従って、以下、かかる相違点を中心として説明する。
更に、韻律特定部17は、算出されたコストが閾値よりも小さくなった場合、そのコストを算出する基となった候補韻律を特定する。
いま、算出されたコストが閾値よりも大きい場合を想定して説明を続ける。この場合、CPUは、ステップB6にて「No」と判定してステップB4へ戻り、ステップB4~ステップB6の処理を繰り返し実行する。
次に、本発明の第3実施形態に係る音声合成装置について図7を参照しながら説明する。
第3実施形態に係る音声合成装置100の機能は、要求韻律情報受付部113と、中間韻律情報生成部114と、音声素片情報記憶部115と、音声合成部116と、を含む。
中間韻律情報生成部114は、基準韻律と要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する。
上記中間韻律を含む候補韻律のそれぞれに対して、上記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択する音声素片情報選択手段と、
上記候補韻律のそれぞれに対して、上記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の上記自然度に応じて変化するコストを算出するコスト算出手段と、
を含むとともに、
上記算出されたコストに基づいて上記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する上記音声合成処理を行うように構成されることが好適である。
上記コストは、上記自然度が上記基準値よりも低下する程度が大きくなるほど大きくなる値であり、
上記音声合成手段は、上記算出されたコストが所定の閾値よりも小さい上記候補韻律のうちの、上記要求韻律に類似している程度が最も高い候補韻律を特定するように構成されることが好適である。
これによれば、閾値を適切に設定することができる。
上記コスト算出手段は、上記候補韻律を、上記要求韻律に類似している程度が高い候補韻律から順に1つずつ取得するとともに、当該取得した候補韻律に対して上記コストを算出するように構成され、
上記音声合成手段は、上記算出されたコストが上記閾値よりも小さくなった場合、そのコストを算出する基となった候補韻律を特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する上記音声合成処理を行うように構成されることが好適である。
上記基準韻律は、文字列に対して言語解析処理を行うことにより推定された韻律であることが好適である。
上記基準韻律及び上記要求韻律のそれぞれは、音の高さを表すパラメータ、音の長さを表すパラメータ、及び、音の大きさを表すパラメータ、のうちの少なくとも1つを含むことが好適である。
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報が記憶装置に記憶されている場合に、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付け、
上記基準韻律と上記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成し、
上記生成された中間韻律情報と、上記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う、方法である。
上記中間韻律を含む候補韻律のそれぞれに対して、上記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択し、
上記候補韻律のそれぞれに対して、上記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の上記自然度に応じて変化するコストを算出し、
上記算出されたコストに基づいて上記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する上記音声合成処理を行うように構成されることが好適である。
上記算出されたコストが所定の閾値よりも小さい上記候補韻律のうちの、上記要求韻律に類似している程度が最も高い候補韻律を特定するように構成されることが好適である。
情報処理装置に、
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報を記憶装置に記憶させる音声素片情報記憶処理手段と、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付ける要求韻律情報受付手段と、
上記基準韻律と上記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する中間韻律情報生成手段と、
上記生成された中間韻律情報と、上記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成手段と、
を実現させるためのプログラムである。
上記中間韻律を含む候補韻律のそれぞれに対して、上記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択する音声素片情報選択手段と、
上記候補韻律のそれぞれに対して、上記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の上記自然度に応じて変化するコストを算出するコスト算出手段と、
を含むとともに、
上記算出されたコストに基づいて上記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する上記音声合成処理を行うように構成されることが好適である。
上記コストは、上記自然度が上記基準値よりも低下する程度が大きくなるほど大きくなる値であり、
上記音声合成手段は、上記算出されたコストが所定の閾値よりも小さい上記候補韻律のうちの、上記要求韻律に類似している程度が最も高い候補韻律を特定するように構成されることが好適である。
11 言語処理部
12 韻律推定部
13 要求韻律情報受付部
14 中間韻律情報生成部
15 素片情報記憶部
16 素片選択部
17 韻律特定部
18 波形生成部
100 音声合成装置
113 要求韻律情報受付部
114 中間韻律情報生成部
115 音声素片情報記憶部
116 音声合成部
901 言語処理部
902 韻律推定部
905 素片情報記憶部
906 素片選択部
908 波形生成部
Claims (13)
- 基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報を記憶する音声素片情報記憶手段と、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付ける要求韻律情報受付手段と、
前記基準韻律と前記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する中間韻律情報生成手段と、
前記生成された中間韻律情報と、前記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成手段と、
を備える音声合成装置。 - 請求項1に記載の音声合成装置であって、
前記音声合成手段は、
前記中間韻律を含む候補韻律のそれぞれに対して、前記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択する音声素片情報選択手段と、
前記候補韻律のそれぞれに対して、前記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の前記自然度に応じて変化するコストを算出するコスト算出手段と、
を含むとともに、
前記算出されたコストに基づいて前記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する前記音声合成処理を行うように構成された音声合成装置。 - 請求項2に記載の音声合成装置であって、
前記コストは、前記自然度が前記基準値よりも低下する程度が大きくなるほど大きくなる値であり、
前記音声合成手段は、前記算出されたコストが所定の閾値よりも小さい前記候補韻律のうちの、前記要求韻律に類似している程度が最も高い候補韻律を特定するように構成された音声合成装置。 - 請求項3に記載の音声合成装置であって、
前記音声合成手段は、前記算出されたコストの最大値と当該算出されたコストの最小値とに基づいて前記閾値を設定するように構成された音声合成装置。 - 請求項3又は請求項4に記載の音声合成装置であって、
前記コスト算出手段は、前記候補韻律を、前記要求韻律に類似している程度が高い候補韻律から順に1つずつ取得するとともに、当該取得した候補韻律に対して前記コストを算出するように構成され、
前記音声合成手段は、前記算出されたコストが前記閾値よりも小さくなった場合、そのコストを算出する基となった候補韻律を特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する前記音声合成処理を行うように構成された音声合成装置。 - 請求項1乃至請求項5のいずれか一項に記載の音声合成装置であって、
前記基準韻律は、文字列に対して言語解析処理を行うことにより推定された韻律である音声合成装置。 - 請求項1乃至請求項6のいずれか一項に記載の音声合成装置であって、
前記基準韻律及び前記要求韻律のそれぞれは、音の高さを表すパラメータ、音の長さを表すパラメータ、及び、音の大きさを表すパラメータ、のうちの少なくとも1つを含む音声合成装置。 - 基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報が記憶装置に記憶されている場合に、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付け、
前記基準韻律と前記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成し、
前記生成された中間韻律情報と、前記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う、音声合成方法。 - 請求項8に記載の音声合成方法であって、
前記中間韻律を含む候補韻律のそれぞれに対して、前記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択し、
前記候補韻律のそれぞれに対して、前記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の前記自然度に応じて変化するコストを算出し、
前記算出されたコストに基づいて前記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する前記音声合成処理を行う、音声合成方法。 - 請求項9に記載の音声合成方法であって、
前記コストは、前記自然度が前記基準値よりも低下する程度が大きくなるほど大きくなる値であり、
前記算出されたコストが所定の閾値よりも小さい前記候補韻律のうちの、前記要求韻律に類似している程度が最も高い候補韻律を特定するように構成された音声合成方法。 - 情報処理装置に、
基準となる韻律である基準韻律を有する音声を合成するために用いられた場合に、人間が発した音声と類似している程度を表す自然度が所定の基準値よりも高い音声を合成可能な音声素片を表す音声素片情報を記憶装置に記憶させる音声素片情報記憶処理手段と、
ユーザにより要求された韻律である要求韻律を表す要求韻律情報を受け付ける要求韻律情報受付手段と、
前記基準韻律と前記要求韻律との間の韻律である中間韻律を表す中間韻律情報を生成する中間韻律情報生成手段と、
前記生成された中間韻律情報と、前記記憶されている音声素片情報と、に基づいて音声を合成する音声合成処理を行う音声合成手段と、
を実現させるための音声合成プログラム。 - 請求項11に記載の音声合成プログラムであって、
前記音声合成手段は、
前記中間韻律を含む候補韻律のそれぞれに対して、前記記憶されている音声素片情報の中からその候補韻律に対応する音声素片情報を選択する音声素片情報選択手段と、
前記候補韻律のそれぞれに対して、前記選択された音声素片情報に基づいて、当該候補韻律を有する音声を合成した場合におけるその音声の前記自然度に応じて変化するコストを算出するコスト算出手段と、
を含むとともに、
前記算出されたコストに基づいて前記候補韻律の1つを特定し、当該特定した候補韻律に対して選択された音声素片情報に基づいて、当該特定した候補韻律を有する音声を合成する前記音声合成処理を行うように構成された音声合成プログラム。 - 請求項12に記載の音声合成プログラムであって、
前記コストは、前記自然度が前記基準値よりも低下する程度が大きくなるほど大きくなる値であり、
前記音声合成手段は、前記算出されたコストが所定の閾値よりも小さい前記候補韻律のうちの、前記要求韻律に類似している程度が最も高い候補韻律を特定するように構成された音声合成プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010535626A JPWO2010050103A1 (ja) | 2008-10-28 | 2009-08-21 | 音声合成装置 |
US13/125,507 US20110196680A1 (en) | 2008-10-28 | 2009-08-21 | Speech synthesis system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008276654 | 2008-10-28 | ||
JP2008-276654 | 2008-10-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010050103A1 true WO2010050103A1 (ja) | 2010-05-06 |
Family
ID=42128477
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/004004 WO2010050103A1 (ja) | 2008-10-28 | 2009-08-21 | 音声合成装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20110196680A1 (ja) |
JP (1) | JPWO2010050103A1 (ja) |
WO (1) | WO2010050103A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103137124A (zh) * | 2013-02-04 | 2013-06-05 | 武汉今视道电子信息科技有限公司 | 一种语音合成方法 |
JP2014038208A (ja) * | 2012-08-16 | 2014-02-27 | Toshiba Corp | 音声合成装置、方法及びプログラム |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108040032A (zh) * | 2017-11-02 | 2018-05-15 | 阿里巴巴集团控股有限公司 | 一种声纹认证方法、账号注册方法及装置 |
KR102637341B1 (ko) * | 2019-10-15 | 2024-02-16 | 삼성전자주식회사 | 음성 생성 방법 및 장치 |
US11984124B2 (en) * | 2020-11-13 | 2024-05-14 | Apple Inc. | Speculative task flow execution |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10153998A (ja) * | 1996-09-24 | 1998-06-09 | Nippon Telegr & Teleph Corp <Ntt> | 補助情報利用型音声合成方法、この方法を実施する手順を記録した記録媒体、およびこの方法を実施する装置 |
JPH11175082A (ja) * | 1997-12-10 | 1999-07-02 | Toshiba Corp | 音声対話装置及び音声対話用音声合成方法 |
JPH11259094A (ja) * | 1998-03-10 | 1999-09-24 | Hitachi Ltd | 規則音声合成装置 |
JP2002258885A (ja) * | 2001-02-27 | 2002-09-11 | Sharp Corp | テキスト音声合成装置およびプログラム記録媒体 |
JP2008015424A (ja) * | 2006-07-10 | 2008-01-24 | Nippon Telegr & Teleph Corp <Ntt> | 様式指定型音声合成方法、及び様式指定型音声合成装置とそのプログラムと、その記憶媒体 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4680429B2 (ja) * | 2001-06-26 | 2011-05-11 | Okiセミコンダクタ株式会社 | テキスト音声変換装置における高速読上げ制御方法 |
US8583438B2 (en) * | 2007-09-20 | 2013-11-12 | Microsoft Corporation | Unnatural prosody detection in speech synthesis |
-
2009
- 2009-08-21 US US13/125,507 patent/US20110196680A1/en not_active Abandoned
- 2009-08-21 WO PCT/JP2009/004004 patent/WO2010050103A1/ja active Application Filing
- 2009-08-21 JP JP2010535626A patent/JPWO2010050103A1/ja active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10153998A (ja) * | 1996-09-24 | 1998-06-09 | Nippon Telegr & Teleph Corp <Ntt> | 補助情報利用型音声合成方法、この方法を実施する手順を記録した記録媒体、およびこの方法を実施する装置 |
JPH11175082A (ja) * | 1997-12-10 | 1999-07-02 | Toshiba Corp | 音声対話装置及び音声対話用音声合成方法 |
JPH11259094A (ja) * | 1998-03-10 | 1999-09-24 | Hitachi Ltd | 規則音声合成装置 |
JP2002258885A (ja) * | 2001-02-27 | 2002-09-11 | Sharp Corp | テキスト音声合成装置およびプログラム記録媒体 |
JP2008015424A (ja) * | 2006-07-10 | 2008-01-24 | Nippon Telegr & Teleph Corp <Ntt> | 様式指定型音声合成方法、及び様式指定型音声合成装置とそのプログラムと、その記憶媒体 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014038208A (ja) * | 2012-08-16 | 2014-02-27 | Toshiba Corp | 音声合成装置、方法及びプログラム |
CN103137124A (zh) * | 2013-02-04 | 2013-06-05 | 武汉今视道电子信息科技有限公司 | 一种语音合成方法 |
Also Published As
Publication number | Publication date |
---|---|
US20110196680A1 (en) | 2011-08-11 |
JPWO2010050103A1 (ja) | 2012-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3913770B2 (ja) | 音声合成装置および方法 | |
JP4246792B2 (ja) | 声質変換装置および声質変換方法 | |
JP4738057B2 (ja) | ピッチパターン生成方法及びその装置 | |
EP3065130B1 (en) | Voice synthesis | |
JP4829477B2 (ja) | 声質変換装置および声質変換方法ならびに声質変換プログラム | |
JP2006309162A (ja) | ピッチパターン生成方法、ピッチパターン生成装置及びプログラム | |
WO2010050103A1 (ja) | 音声合成装置 | |
US11646044B2 (en) | Sound processing method, sound processing apparatus, and recording medium | |
JP6013104B2 (ja) | 音声合成方法、装置、及びプログラム | |
JP6271748B2 (ja) | 音声処理装置、音声処理方法及びプログラム | |
JP5726822B2 (ja) | 音声合成装置、方法及びプログラム | |
WO2012160767A1 (ja) | 素片情報生成装置、音声合成装置、音声合成方法および音声合成プログラム | |
JP5874639B2 (ja) | 音声合成装置、音声合成方法及び音声合成プログラム | |
JP5375612B2 (ja) | 周波数軸伸縮係数推定装置とシステム方法並びにプログラム | |
JP2011141470A (ja) | 素片情報生成装置、音声合成システム、音声合成方法、及び、プログラム | |
KR20100111544A (ko) | 음성인식을 이용한 발음 교정 시스템 및 그 방법 | |
JP2006084854A (ja) | 音声合成装置、音声合成方法および音声合成プログラム | |
JP7106897B2 (ja) | 音声処理方法、音声処理装置およびプログラム | |
JP7200483B2 (ja) | 音声処理方法、音声処理装置およびプログラム | |
JP2018004997A (ja) | 音声合成装置及びプログラム | |
JP2004054063A (ja) | 基本周波数パターン生成方法、基本周波数パターン生成装置、音声合成装置、基本周波数パターン生成プログラムおよび音声合成プログラム | |
Hirose | Use of generation process model for improved control of fundamental frequency contours in HMM-based speech synthesis | |
JP2015219430A (ja) | 音声合成装置、その方法及びプログラム | |
JP2008275698A (ja) | 所望のイントネーションを備えた音声信号を生成するための音声合成装置 | |
WO2014017024A1 (ja) | 音声合成装置、音声合成方法、及び音声合成プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09823220 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13125507 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2010535626 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09823220 Country of ref document: EP Kind code of ref document: A1 |