WO2017197809A1 - 语音合成方法和语音合成装置 - Google Patents

语音合成方法和语音合成装置 Download PDF

Info

Publication number
WO2017197809A1
WO2017197809A1 PCT/CN2016/098126 CN2016098126W WO2017197809A1 WO 2017197809 A1 WO2017197809 A1 WO 2017197809A1 CN 2016098126 W CN2016098126 W CN 2016098126W WO 2017197809 A1 WO2017197809 A1 WO 2017197809A1
Authority
WO
WIPO (PCT)
Prior art keywords
language type
parameter
model
fundamental frequency
spectral
Prior art date
Application number
PCT/CN2016/098126
Other languages
English (en)
French (fr)
Inventor
李�昊
康永国
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Priority to US16/099,257 priority Critical patent/US10789938B2/en
Publication of WO2017197809A1 publication Critical patent/WO2017197809A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language

Definitions

  • the present invention relates to the field of speech synthesis technology, and in particular, to a speech synthesis method and a speech synthesis device.
  • speech synthesis services are increasingly accepted and used by users.
  • a large proportion of users in the speech synthesis service are dual-language or multi-language users, and speech synthesis is increasingly applied to multi-language content. Therefore, there is a need for multilingual speech synthesis, among which Chinese and English mixed reading is the most common.
  • the user's usual requirements for multilingual speech synthesis are first understandable, followed by accurate pronunciation, natural and uniform tone.
  • speech synthesis technology has basically solved the intelligibility, how to synthesize natural, accurate, and timbre unified multilingual speech has become a technical problem of speech synthesis.
  • the current problems are: (1) The method of using different native speakers' data for different languages will cause the problem that the synthesized sounds are not uniform, which will affect the naturalness and user experience of speech synthesis; (2) Multilingual use In the method of speaker data, most speakers have a language other than their mother tongue, which is not authentic, has an accent, and has a large gap with the native speaker, which reduces the user experience, and the speech synthesized by such data is in addition to the speaker's native language. The pronunciation is not standard enough, and the standard pronunciation of many languages is usually a professional, and the data collection cost is higher.
  • the object of the present invention is to solve at least one of the above technical problems to some extent.
  • a first object of the present invention is to propose a speech synthesis method.
  • the method can reduce the data cost and implementation difficulty established by the language basic model, reduce the dependence of multilingual synthesis on professional multilingual speaker data, and can effectively synthesize the multi-language to-synthesis sentence text into a natural, accurate, and unified tone color.
  • Multilingual voice enhances the user experience.
  • a second object of the present invention is to provide a speech synthesis apparatus.
  • a third object of the present invention is to propose a terminal.
  • a fourth object of the present invention is to provide a storage medium.
  • the speech synthesis method of the first aspect of the present invention includes: determining a language type to which the sentence text information to be synthesized belongs, wherein the language type includes a first language type and a second language type; a first basic model corresponding to the first language type, and determining a second basic model corresponding to the second language type, wherein the first basic model includes a first spectral parameter model and a first fundamental frequency parameter module, The second basic model includes a second spectral parameter model and a second fundamental frequency parameter module; determining a target timbre, and adaptively transforming the first spectral parameter model and the second spectral parameter model according to the target timbre, And training the text information of the sentence to be synthesized according to the first spectral parameter model and the second spectral parameter model after the adaptive transformation to generate corresponding spectral parameters; according to the first fundamental frequency parameter module, The second fundamental frequency parameter module trains the text information of the sentence to be synthesized to generate a corresponding fundamental frequency parameter, and according to the target
  • the speech synthesis method of the embodiment of the present invention determines which language types are included in the text to be synthesized, and then adaptively trains the spectral parameter models of each language type according to the target timbre, and generates a corresponding spectral parameter model by adaptive training.
  • the spectral parameters are adjusted according to the fundamental frequency parameters of each language type generated by the target timbre to obtain a unified multilingual speech of the timbre. It can be understood that most of the above language basic models utilize models established by monolingual data, and the model is reduced.
  • the established data cost and implementation difficulty reduce the dependence of multilingual synthesis on professional multilingual speaker data, and can effectively synthesize multi-language to-synthesis sentence text into natural, accurate, and timbre unified multilingual speech. user experience.
  • the speech synthesis apparatus of the second aspect of the present invention includes: a first determining module, configured to determine a language type to which the sentence text information to be synthesized belongs, wherein the language type includes a first language type and a a second type determining module, configured to determine a first base model corresponding to the first language type, and determine a second base model corresponding to the second language type, wherein the first basic model includes a spectral parameter model and a first fundamental frequency parameter module, the second basic model comprising a second spectral parameter model and a second fundamental frequency parameter module; a third determining module for determining a target timbre; and an adaptive transform module for And adaptively transforming the first spectral parameter model and the second spectral parameter model according to the target timbre; the spectral parameter generating module, configured to: according to the adaptively transformed first spectral parameter model and the The second spectral parameter model trains the text information of the sentence to be synthesized to generate a corresponding spectral parameter; a first determining module,
  • the speech synthesis apparatus of the embodiment of the present invention determines which language types are included in the text to be synthesized, and then adaptively trains the spectral parameter models of each language type according to the target timbre, and generates corresponding correspondences by using the adaptively trained spectral parameter model.
  • the spectral parameters are adjusted according to the fundamental frequency parameters of each language type generated by the target timbre to obtain a unified multilingual speech of the timbre. It can be understood that most of the above language basic models utilize models established by monolingual data, and the model is reduced.
  • the established data cost and implementation difficulty reduce the dependence of multilingual synthesis on professional multilingual speaker data, and can effectively synthesize multi-language to-synthesis sentence text into natural, accurate, and timbre unified multilingual speech. user experience.
  • a terminal of a third aspect of the present invention includes: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when The one or more processors perform the following operations: determining a language type to which the sentence text information to be synthesized belongs, wherein the language type includes a first language type and a second language type; and determining a first language type corresponding to the first language type a base model, and determining a second base model corresponding to the second language type, wherein the first base model includes a first spectral parameter model and a first fundamental frequency parameter module, and the second basic model includes a second a spectral parameter model and a second fundamental frequency parameter module; determining a target timbre, and adaptively transforming the first spectral parameter model and the second spectral parameter model according to the target timbre, respectively, according to the adaptive transformation
  • the first spectral parameter model and the second spectral parameter model train the text information of the sentence to be synthesized to generate corresponding
  • a storage medium for storing an application for performing a speech synthesis method according to the first aspect of the present invention.
  • FIG. 1 is a flow chart of a speech synthesis method in accordance with one embodiment of the present invention.
  • FIG. 2 is a flow chart of a speech synthesis method in accordance with an embodiment of the present invention.
  • FIG. 3 is a diagram showing an example of a speech synthesis method according to an embodiment of the present invention.
  • FIG. 4 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment of the present invention.
  • FIG. 5 is a structural block diagram of a speech synthesis apparatus according to an embodiment of the present invention.
  • FIG. 6 is a block diagram showing the structure of a speech synthesis apparatus according to another embodiment of the present invention.
  • multi-lingual speech synthesis applications have gradually been required by people, for example, taking a news application in a mobile terminal as an example, when a user uses a news application to listen to news through a function of speech synthesis, the news content
  • technology news in addition to Chinese, is also mixed with a large amount of English, so this application is a typical multi-lingual speech synthesis, but the naturalness, accuracy and uniformity of synthesized speech will affect the user experience.
  • the present invention proposes a speech synthesis method and apparatus to effectively solve the problem of pronunciation accuracy and uniform tone. Specifically, a speech synthesis method and a speech synthesis apparatus according to an embodiment of the present invention will be described below with reference to the drawings.
  • the voice synthesis method in the embodiment of the present invention can be applied to an electronic device having a voice synthesis function, such as a mobile terminal (such as a mobile phone, a tablet computer, a personal digital assistant, etc.) or a terminal (such as a PC).
  • a mobile terminal such as a mobile phone, a tablet computer, a personal digital assistant, etc.
  • a terminal such as a PC
  • the speech synthesis method of the embodiment of the present invention is applicable to a scenario in which a plurality of language texts are synthesized into a plurality of languages.
  • the speech synthesis method may include:
  • S101 Determine a language type to which the sentence text information to be synthesized belongs, where the language type includes a first language type and a second language type.
  • the text information of the sentence to be synthesized may be acquired first, and the text information of the sentence to be synthesized may be understood as the text content of the text of the sentence to be synthesized, and then the language information of the sentence to be synthesized may be determined by language to determine the text information of the sentence to be synthesized.
  • the text content in the synthesized sentence text information may be sentenced according to the language character and the context content information, and the language type of each sentence segment may be determined, and the language type may include the first language type and the second type.
  • a language type, wherein the second language type may be one or more, that is, the language type of the sentence text in the text to be synthesized may belong to two languages, or may be three or more language types. .
  • S102 Determine a first basic model corresponding to the first language type, and determine a second basic model corresponding to the second language type, where the first basic model includes a first spectral parameter model and a first fundamental frequency parameter module, and the second basis Model includes The second spectral parameter model and the second fundamental frequency parameter module.
  • the language base model corresponding to the language types may be determined. For example, if the text of the sentence to be synthesized includes Chinese and English mixed sentence texts, it can be determined that the voice type of the Chinese-English mixed sentence text includes the Chinese language type and the English language type, and then the Chinese basic model corresponding to the Chinese language type can be determined. The English basic model corresponding to the English language type.
  • each language base model may include a context-dependent HMM model (Hidden Markov Model) and a state clustering decision tree corresponding to the HMM model.
  • HMM model Hidden Markov Model
  • state clustering decision tree corresponding to the HMM model.
  • each state of the HMM model is represented as a Gaussian model, and the function of the decision tree is to cluster the training data so that each state obtains sufficient training data.
  • the above first basic model can be understood as a model established by using the voice data of the training speaker in the first language type, and the training speaker can speak the second language, but speak for the training.
  • the standard of human second language pronunciation is not required.
  • the second basic model described above can be understood as a model established by training a speaker's voice data in a second language type, the training speaker can speak the first language, but the first language of the training speaker is pronounced. The standard does not require.
  • the multi-lingual speech model training it is not necessary to require a certain speaker to have a very standard bilingual pronunciation, as long as one of the language standards is available, the basic model of other languages can be performed using the speaker data of other pronunciation standards. training. Thereby, the dependence on professional multilingual speaker data in multilingual synthesis can be reduced, and more monolingual data can be utilized to reduce data cost and implementation difficulty.
  • S103 Determine a target timbre, and adaptively transform the first spectral parameter model and the second spectral parameter model according to the target timbre, and treat the synthesized sentence text according to the adaptively transformed first spectral parameter model and the second spectral parameter model.
  • the information is trained to generate corresponding spectral parameters.
  • the target timbre can be determined in many ways.
  • the target timbre can be determined by determining the category of the user's native language, and the electronic device (such as the mobile terminal) used by the user can also be determined.
  • the language setting selected in the language setting of the terminal, etc. determines the target timbre, and the target timbre can also be determined by other means, which will not be exemplified herein.
  • the specific implementation process of determining the target timbre may be as follows: acquiring user information of the user (such as a user name or an account name, etc.), and determining a category of the user's native language according to the user information, wherein the category of the native language is included in the language type.
  • the timbre of the training speaker corresponding to the basic model of the category to which the user's native language belongs is used as the target timbre.
  • the user information of the user A is obtained, and according to the user information, it is determined that the type of the native language of the user A is Chinese, and the training speaker corresponding to the basic model of the Chinese language of the user A (ie, the Chinese basic model) can be used.
  • the tone is the target tone.
  • the first spectral parameter model and the second spectral parameter model may be separately selected according to the target timbre.
  • the row adaptive transform enables the first spectral parameter model and the second spectral parameter model to be applied to generate spectral parameters having the same or similar timbres. That is, after determining the target timbre, the first base model and the second base model may be adaptively trained according to the target timbre to be generated by using the first base model and the second base model after the adaptive training.
  • the spectral parameters are the same or similar.
  • the specific adaptive transform refer to the description of the subsequent embodiments.
  • the text of the sentence to be synthesized corresponding to each language type in the synthesized sentence text information is correspondingly trained to generate a to-be-synthesized sentence with the first language type.
  • the fundamental frequency parameter corresponding to the text and the fundamental frequency parameter corresponding to the sentence text to be synthesized of the second language type are correspondingly trained to generate a to-be-synthesized sentence with the first language type.
  • the fundamental frequency parameters of the first language type and the second language type may be adjusted according to the target timbre, for example, the fundamental frequency of the first language type and the fundamental frequency parameter of the second language type may be
  • the global mean and variance of the curve are uniformly adjusted to be the same as the global mean and variance of the fundamental frequency curve in the fundamental frequency parameter corresponding to the target timbre, so that the voice timbre obtained by the first basic model and the voice timbre obtained by the second basic model are Uniform into the target tone, ensuring multilingual speech that synthesizes multi-language texts.
  • S105 Synthesize the target voice according to the spectral parameter of the first language type, the spectral parameter of the second language type, the adjusted fundamental frequency parameter of the first language type, and the fundamental frequency parameter of the second language type.
  • the spectral parameter of the first language type, the spectral parameter of the second language type, the adjusted fundamental frequency parameter of the first language type, and the fundamental frequency parameter of the second language type may be combined into a target voice via a vocoder.
  • the target speech is multi-lingual speech.
  • the speech synthesis method of the embodiment of the present invention firstly determines the language type to which the sentence text information to be synthesized belongs, wherein the language type includes the first language type and the second language type, and then determines the first basic model corresponding to the first language type.
  • the model and the second basic model perform training on the text information of the synthesized sentence to generate corresponding spectral parameters and fundamental frequency parameters, and then adjust the fundamental frequency parameters of the first language type and the second language type according to the target timbre, and finally, based on The spectral parameters of the first language type, the spectral parameters of the second language type, the adjusted fundamental frequency parameters of the first language type, and the fundamental frequency parameters of the second language type synthesize the target speech.
  • FIG. 2 is a flow chart of a speech synthesis method in accordance with an embodiment of the present invention.
  • the target timbre may be the timbre of the speaker that the user prefers the timbre of the synthesized voice to be more inclined.
  • the target timbre may be the timbre of the training speaker corresponding to the first basic model, or may be the second.
  • the target timbre is taken as the timbre of the training speaker corresponding to the first basic model, as shown in FIG. 2, when the target timbre is the timbre of the training speaker corresponding to the first basic model.
  • the speech synthesis method can include:
  • S201 Determine a language type to which the sentence text information to be synthesized belongs, where the language type includes a first language type and a second language type.
  • S202 Determine a first basic model corresponding to the first language type, and determine a second basic model corresponding to the second language type, where the first basic model includes a first spectral parameter model and a first fundamental frequency parameter module, and the second basis
  • the model includes a second spectral parameter model and a second fundamental frequency parameter module.
  • S203 Determine a target timbre, and adaptively transform the second spectral parameter model according to the target timbre.
  • the target timbre is determined to be the timbre of the training speaker corresponding to the first basic model, that is, the parameter generated by the second basic model is adjusted to be the same as the timbre of the training speaker corresponding to the first basic model.
  • the first base model can be directly used for parameter generation without adaptive training.
  • the training speech data corresponding to the second language type of the training speaker corresponding to the first basic model may be acquired, and according to the first
  • the training speaker corresponding to the basic model adaptively transforms the second spectral parameter model for the training speech data of the second language type. It can be understood that the adaptive transformation of the spectral parameter model is done before the parameter generation.
  • the first basic model may correspond to
  • the training speaker takes as input the training speech data of the second language type, performs clustering through the decision tree of the second spectral parameter model, obtains training data of each state, and uses each state of training data for each spectral parameter.
  • the HMM state estimates the transformation matrix such that the state Gaussian model after the transformation matrix can be applied subsequently can generate spectral parameters that are similar to the first language type training speaker.
  • the sentence to be synthesized corresponding to the first language type in the synthesized sentence text information may be directly trained according to the first spectral parameter model to generate Spectral parameters of the first language type.
  • the sentence to be synthesized corresponding to the second language type may be trained according to the second spectral parameter model after the adaptive transformation to generate a spectral parameter of the second language type.
  • S205 Train the synthesized text information according to the first fundamental frequency parameter module and the second fundamental frequency parameter module to generate a corresponding fundamental frequency parameter, and adjust a base frequency parameter of the second language type according to the target timbre.
  • the to-be-synthesized sentences corresponding to the language types in the synthesized sentence text information are correspondingly trained to generate a fundamental frequency parameter corresponding to each language type, that is, The fundamental frequency parameter of the first language type and the fundamental frequency parameter of the second language type.
  • the fundamental frequency parameter of the first language type may not be adjusted, but the fundamental frequency parameter of the second language type needs to be adjusted.
  • the specific implementation process of adjusting the fundamental frequency parameter of the second language type according to the target timbre may include: firstly acquiring training speech data of the training speaker corresponding to the second language type corresponding to the first basic model, and then And training the second fundamental frequency parameter model according to the training speech data of the second language type corresponding to the training speaker corresponding to the first basic model to generate the target speaker fundamental frequency parameter corresponding to the target timbre, and finally, according to the target speaker
  • the fundamental frequency parameter adjusts the fundamental frequency parameters of the second language type.
  • the adjustment of the fundamental frequency parameter is completed after the parameter is generated.
  • the training speech data corresponding to the second language type of the training speaker corresponding to the first basic model may be acquired first (eg, training including the second language type)
  • the statement and its annotation, etc. and using the training speech data as input, clustering through the decision tree of the second fundamental frequency parameter model, obtaining training data of each state, and using each state of training data for each
  • the HMM state of the fundamental frequency is trained to obtain a Gaussian parameter of the HMM state, which is called a target speaker fundamental frequency model.
  • the target speaker base frequency model is used to generate the parameters, and the global mean and variance of the generated fundamental frequency curve are calculated and saved.
  • the fundamental frequency parameter is generated by the second basic model, and the generated fundamental frequency curve is linearly transformed, so that the mean and variance are converted to the same as the fundamental mean global variance and variance generated by the target speaker's fundamental frequency model, and the basis is completed. Adjustment of the frequency curve.
  • S206 Synthesize the target voice according to the spectral parameter of the first language type, the spectral parameter of the second language type, the fundamental frequency parameter of the first language type, and the adjusted fundamental frequency parameter of the second language type.
  • the timbre of the second language speech is converted into the timbre of the training speaker corresponding to the first basic model by means of adaptive and fundamental frequency parameter adjustment, and the original duration and tone information of the second language speech are retained, so that the original The non-authentic second language speech spoken by the training speaker corresponding to the first basic model becomes close to the training speaker pronunciation corresponding to the second basic model.
  • the first basic model is the Chinese basic model
  • the second basic model is the English basic model.
  • the Chinese basic model is a model built using bilingual speech data of a Chinese-speaking Chinese-English bilingual speaker.
  • the basic model is a model built using English speech data from native speakers of English, in which the standard of English pronunciation of Chinese native speakers is not required.
  • the text content in the text of the sentence to be synthesized may be sentenced according to the character and context of the language, and the language of each sentence segment is determined (S301). Since the chasing tone needs to be adjusted to the pronunciation of the Chinese speaker, the Chinese basic model is directly used for parameter generation, and the English basic model needs to be converted. That is, before the parameter is generated, the English model can be converted into a Chinese speaker tone as an example, and the Chinese speaker's English training sentence data (such as an English sentence and its annotation) can be obtained as an input, and the English spectral parameter model in the English basic model is obtained.
  • the decision tree is clustered to obtain the training data of each state, and the transformation matrix of the HMM state of each spectral parameter is estimated by the data, so that the state Gaussian model after applying the transformation matrix can generate spectral parameters close to the Chinese speaker.
  • the Chinese speaker's English training sentence data (such as the English sentence and its annotation) can be input, and the decision tree of the English fundamental frequency parameter model in the English basic model is clustered to obtain the training data of each state. And use this data to train the HMM state of each fundamental frequency to obtain the Gaussian parameter of the HMM state, which is called the target speaker fundamental frequency model.
  • the target speaker base frequency model is used to generate the parameters, and the global mean and variance of the generated fundamental frequency curve are calculated and saved.
  • the fundamental frequency parameter is generated by the English basic model, and the generated fundamental frequency curve is linearly transformed, so that the mean and variance are converted to the same as the fundamental mean global variance and variance generated by the target speaker's fundamental frequency model, and the fundamental frequency is completed. Conversion of the curve (S303).
  • the spectral parameters corresponding to the generated Chinese sentence text, the fundamental frequency parameters, the spectral parameters corresponding to the English sentence text obtained after the adaptation, and the fundamental frequency parameters corresponding to the adjusted English sentence text are synthesized by the vocoder to obtain the speech.
  • Mixed speech in Chinese and English S304).
  • the speech synthesis method of the embodiment of the present invention can synthesize multi-lingual speaker data of multi-lingual synthesis by synthesizing speech with uniform tone and pronunciation standard without relying on a standard multi-lingual speaker data.
  • Dependence while using more monolingual data, reducing data costs and implementation difficulties.
  • the speech synthesis method of the embodiment of the present invention determines which language types are included in the text to be synthesized, and determines the second spectral parameter model according to the target timbre when determining that the target timbre is the timbre of the training speaker corresponding to the first basic model.
  • the first basic model can be directly used for parameter generation without adaptive training, and generate corresponding second language type spectral parameters according to the second spectral parameter model after adaptive training, and according to the target timbre
  • the generated fundamental frequency parameter of the second language type is adjusted to adjust the timbre of the speaker of the second language to be the same or similar to the timbre of the speaker of the first language, which can reduce the data cost and implementation difficulty of the model establishment, and reduce the Synthetic pair
  • the professional multi-lingual speaker data dependence can effectively synthesize the multi-language to-synthesis sentence text into a natural, accurate, multi-lingual voice with uniform tone, which enhances the user experience.
  • an embodiment of the present invention further provides a speech synthesis device, the speech synthesis device provided by the embodiment of the present invention and the speech synthesis method provided by the above several embodiments
  • the implementation of the foregoing speech synthesis method is also applicable to the speech synthesis apparatus provided in this embodiment, which will not be described in detail in this embodiment.
  • 4 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment of the present invention.
  • the voice synthesizing apparatus may include: a first determining module 10, a second determining module 20, a third determining module 30, an adaptive transform module 40, a spectral parameter generating module 50, and a baseband parameter generating module 60.
  • the baseband parameter adjustment module 70 and the speech synthesis module 80 may include: a first determining module 10, a second determining module 20, a third determining module 30, an adaptive transform module 40, a spectral parameter generating module 50, and a baseband parameter generating module 60.
  • the first determining module 10 is configured to determine a language type to which the sentence text information to be synthesized belongs, where the language type includes a first language type and a second language type.
  • the second determining module 20 is configured to determine a first basic model corresponding to the first language type, and determine a second basic model corresponding to the second language type, where the first basic model includes the first spectral parameter model and the first fundamental frequency parameter
  • the module, the second base model includes a second spectral parameter model and a second fundamental frequency parameter module.
  • the third determination module 30 can be used to determine a target timbre.
  • the third determining module 30 may include: a first determining unit 31 and a second determining unit 32.
  • the first determining unit 31 is configured to acquire user information of the user, and determine, according to the user information, the category of the user's native language, wherein the category of the native language belongs to the language type.
  • the second determining unit 32 is configured to use the timbre of the training speaker corresponding to the basic model of the category to which the user's native language belongs as the target timbre.
  • the adaptive transform module 40 is configured to adaptively transform the first spectral parameter model and the second spectral parameter model according to the target timbre.
  • the spectral parameter generation module 50 is configured to train the synthesized sentence text information according to the adaptively transformed first spectral parameter model and the second spectral parameter model to generate corresponding spectral parameters.
  • the baseband parameter generation module 60 is configured to perform training on the synthesized sentence text information according to the first fundamental frequency parameter module and the second fundamental frequency parameter module to generate a corresponding fundamental frequency parameter.
  • the baseband parameter adjustment module 70 is configured to perform training on the synthesized sentence text information according to the first fundamental frequency parameter module and the second fundamental frequency parameter module to generate a corresponding fundamental frequency parameter.
  • the speech synthesis module 80 can be configured to synthesize the target speech according to the spectral parameter of the first language type, the spectral parameter of the second language type, the adjusted fundamental frequency parameter of the first language type, and the fundamental frequency parameter of the second language type.
  • the adaptive transform module 40 is further configured to adaptively transform the second spectral parameter model according to the target timbre.
  • Spectral parameter generation module 50 also And the method for training the to-be-synthesized sentence corresponding to the first language type in the synthesized sentence text information according to the first spectral parameter model to generate a spectral parameter of the first language type, and to be synthesized according to the second spectral parameter model after the adaptive transformation
  • the sentence to be synthesized corresponding to the second language type in the sentence text information is trained to generate a spectral parameter of the second language type.
  • the baseband parameter adjustment module 70 is further configured to adjust the fundamental frequency parameter of the second language type according to the target timbre.
  • the speech synthesis module 80 is further configured to synthesize the target speech according to the spectral parameter of the first language type, the spectral parameter of the second language type, the fundamental frequency parameter of the first language type, and the adjusted fundamental frequency parameter of the second language type.
  • the adaptive transform module 40 may include an obtaining unit 41 and an adaptive transform unit 42.
  • the acquiring unit 41 is configured to acquire training voice data of the training speaker corresponding to the first basic model for the second language type.
  • the adaptive transform unit 42 is configured to adaptively transform the second spectral parameter model according to the training speech data corresponding to the first basic model for the training speech data of the second language type.
  • the basic frequency parameter adjustment module 70 may include an acquisition unit 71, a target speaker base frequency parameter generation unit 72, and a base frequency parameter adjustment unit 73.
  • the obtaining unit 71 is configured to acquire training voice data of the training speaker corresponding to the first basic model for the second language type.
  • the target speaker base frequency parameter generating unit 72 is configured to train the second fundamental frequency parameter model according to the training speech data corresponding to the first basic model to the training speech data of the second language type to generate a The target speaker's fundamental frequency parameter corresponding to the target timbre.
  • the base frequency parameter adjustment unit 73 is configured to adjust the base frequency parameter of the second language type according to the target speaker base frequency parameter.
  • the speech synthesis apparatus of the embodiment of the present invention determines which language types are included in the text to be synthesized, and then adaptively trains the spectral parameter models of each language type according to the target timbre, and generates corresponding correspondences by using the adaptively trained spectral parameter model.
  • the spectral parameters are adjusted according to the fundamental frequency parameters of each language type generated by the target timbre to obtain a unified multilingual speech of the timbre. It can be understood that most of the above language basic models utilize models established by monolingual data, and the model is reduced.
  • the established data cost and implementation difficulty reduce the dependence of multilingual synthesis on professional multilingual speaker data, and can effectively synthesize multi-language to-synthesis sentence text into natural, accurate, and timbre unified multilingual speech. user experience.
  • the present invention also proposes a terminal comprising: one or more processors; a memory; one or more programs, one or more programs stored in the memory when being one or more processors Perform the following operations when performing:
  • the present invention also provides a storage medium for storing an application for executing the speech synthesis method according to any of the above embodiments of the present invention.
  • first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
  • the meaning of “multiple” is at least two, for example, two, three, etc., unless specifically defined otherwise.
  • a "computer-readable medium” can be any program that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with such an instruction execution system, apparatus, or device. s installation.
  • computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Machine Translation (AREA)

Abstract

一种语音合成方法以及装置,该方法包括:确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型(S101);确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型(S102);确定目标音色,并根据目标音色分别对第一基础模型、第二基础模型进行自适应变换,并根据自适应变换后的第一基础模型、第二基础模型对待合成语句文本信息进行训练,以生成对应的谱参数和基频参数(S103);根据目标音色对第一语种类型和第二语种类型的基频参数进行调整(S104);依据第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数合成目标语音(S105)。

Description

语音合成方法和语音合成装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2016年5月18日提交的、发明名称为“语音合成方法和语音合成装置”的、中国专利申请号“201610329738.1”的优先权。
技术领域
本发明涉及语音合成技术领域,尤其涉及一种语音合成方法和语音合成装置。
背景技术
随着语音合成技术的发展和应用的普及,语音合成业务正越来越多的被用户接受和使用。在语音合成业务的用户中,有很大一部分是双语言或多语言用户,而语音合成也越来越多的应用到多语种内容的场合。因此产生了多语种语音合成的需求,其中,尤其以中英混读最为普遍。用户对多语种语音合成通常的要求首先是可懂,其次是发音准确、自然且音色统一。在当前语音合成技术已经基本解决可懂度的情况下,如何合成自然、准确的、音色统一多语种语音,成了语音合成的一个技术难题。
相关技术中,通常在涉及到多语种合成的场合会采用分别对不同语言使用不同说话人数据建模的方式,或寻求各种语言均发音比较标准的发音人数据进行建模。
但是目前存在的问题是:(1)针对不同语言使用不同母语说话人的数据的方法,会造成合成音色不统一的问题,以至于影响语音合成的自然度和用户体验;(2)采用多语种说话人数据的方法,大多数发音人除母语外的语言并不地道,带有口音,与母语说话人有较大差距,降低用户体验,而采用这样的数据合成的语音除说话人母语外,读音均不够标准,而多种语言均标准的发音人通常是专业人员,数据采集成本又较高。
因此,如何低成本、高效率地将多种语言文本合成自然、准确的、音色统一的多语种语音已经成为亟待解决的问题。
发明内容
本发明的目的旨在至少在一定程度上解决上述的技术问题之一。
为此,本发明的第一个目的在于提出一种语音合成方法。该方法可以降低语言基础模型建立的数据成本和实现难度,减少了多语种合成对专业的多语发音人数据的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
本发明的第二个目的在于提出一种语音合成装置。
本发明的第三个目的在于提出一种终端。
本发明的第四个目的在于提出一种存储介质。
为达到上述目的,本发明第一方面实施例的语音合成方法,包括:确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;确定目标音色,并根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换,并根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数,并根据所述目标音色对所述第一语种类型和第二语种类型的基频参数进行调整;依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
本发明实施例的语音合成方法,确定待合成语句文本中包含哪些语种类型,之后根据目标音色对各语种类型的谱参数模型进行自适应训练,并采用自适应训练后的谱参数模型生成对应的谱参数,并根据目标音色将生成的各语种类型的基频参数进行调整,以得到音色统一的多语种语音,可以理解,上述语言基础模型利用的大多是单语数据建立的模型,降低了模型建立的数据成本和实现难度,减少了多语种合成对专业的多语发音人数据的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
为达到上述目的,本发明第二方面实施例的语音合成装置,包括:第一确定模块,用于确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;第二确定模块,用于确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;第三确定模块,用于确定目标音色;自适应变换模块,用于根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换;谱参数生成模块,用于根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;基频参数生成模块,用于根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数;基频参数调整模块,用于根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练, 以生成对应的基频参数;语音合成模块,用于依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
本发明实施例的语音合成装置,确定待合成语句文本中包含哪些语种类型,之后根据目标音色对各语种类型的谱参数模型进行自适应训练,并采用自适应训练后的谱参数模型生成对应的谱参数,并根据目标音色将生成的各语种类型的基频参数进行调整,以得到音色统一的多语种语音,可以理解,上述语言基础模型利用的大多是单语数据建立的模型,降低了模型建立的数据成本和实现难度,减少了多语种合成对专业的多语发音人数据的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
为达到上述目的,本发明第三方面实施例的终端,包括:一个或者多个处理器;存储器;一个或多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时进行如下操作:确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;确定目标音色,并根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换,并根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数,并根据所述目标音色对所述第一语种类型和第二语种类型的基频参数进行调整;依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
为达到上述目的,本发明第四方面实施例的存储介质,用于存储应用程序,所述应用程序用于执行本发明第一方面实施例所述的语音合成方法。
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1是根据本发明一个实施例的语音合成方法的流程图;
图2是根据本发明一个具体实施例的语音合成方法的流程图;
图3是根据本发明一个实施例的语音合成方法的示例图;
图4是根据本发明一个实施例的语音合成装置的结构框图;
图5是根据本发明一个具体实施例的语音合成装置的结构框图;
图6是根据本发明另一个具体实施例的语音合成装置的结构框图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
可以理解,在日常生活中,多语种语音合成应用已经逐渐被人们所需要,例如,以移动终端中的新闻应用程序为例,当用户使用新闻应用程序通过语音合成的功能听取新闻时,新闻内容,尤其是科技新闻,除中文外还夹杂有大量的英文,因此这一应用是典型的多语种语音合成,然而合成语音的自然度、准确度以及音色是否统一都会对用户体验造成影响。为此,本发明提出了一种语音合成方法和装置,以有效地解决发音准确性和统一音色的问题。具体地,下面参考附图描述本发明实施例的语音合成方法和语音合成装置。
图1是根据本发明一个实施例的语音合成方法的流程图。需要说明的是,本发明实施例的语音合成方法可应用于移动终端(如手机、平板电脑、个人数字助理等)、终端(如PC机)等具有语音合成功能的电子设备中。此外,本发明实施例的语音合成方法适用于多种语言文本合成多种语种语音的场景。
如图1所示,该语音合成方法可以包括:
S101,确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型。
具体地,可先获取待合成语句文本信息,该待合成语句文本信息可理解为待合成语句文本的文本内容,之后,可对该待合成语句文本信息进行语言判别,以确定待合成语句文本信息中语句文本所属的语种类型。
作为一种示例,可根据语种的字符和上下文内容信息来对待合成语句文本信息中的文字内容进行断句,并判断出每个断句片段的语种类型,该语种类型可包括第一语种类型和第二语种类型,其中,该第二语种类型可为一个或多个,也就是说,该待合成语句文本中语句文本所属的语种类型可以是两种语种,也可以是三种或三种以上语种类型。
S102,确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型,其中,第一基础模型包括第一谱参数模型和第一基频参数模块,第二基础模型包括 第二谱参数模型和第二基频参数模块。
具体地,在确定待合成语句文本信息中所属哪些语种类型之后,可确定这些语种类型所对应的语言基础模型。例如,以待合成语句文本包括中英文混合语句文本为例,则可确定该中英文混合语句文本所属的语音类型包括中文语种类型和英文语种类型,之后,可确定中文语种类型对应的中文基础模型和英文语种类型对应的英文基础模型。
可以理解,每种语言基础模型可包括上下文相关的HMM模型(Hidden Markov Model,隐马尔可夫模型)及该HMM模型对应的状态聚类决策树。其中,HMM模型的每一个状态表示为一个高斯模型,决策树的作用为对训练数据进行聚类,以使得每一个状态都获得足够的训练数据。
需要说明的是,上述第一基础模型可理解是使用以第一语种类型为母语的训练说话人的语音数据而建立的模型,而该训练说话人可以会说第二语种,但对于该训练说话人的第二语种发音的标准性不做要求。上述第二基础模型可理解是使用以第二语种类型为母语的训练说话人的语音数据而建立的模型,该训练说话人可以会说第一语种,但对于该训练说话人的第一语种发音的标准性不做要求。
也就是说,在进行多语种语音合模型训练时,不必要求某一个发音人具有非常标准的双语发音,只要其中一个语言标准即可,其他语言的基础模型可以用其他发音标准的发音人数据进行训练。由此,可以减少多语种合成中对专业的多语发音人数据的依赖,而利用更多的单语数据,降低数据成本和实现难度。
S103,确定目标音色,并根据目标音色分别对第一谱参数模型、第二谱参数模型进行自适应变换,并根据自适应变换后的第一谱参数模型和第二谱参数模型对待合成语句文本信息进行训练,以生成对应的谱参数。
可以理解,在本发明的实施例中,目标音色的确定方式可以有很多种,例如,可以通过确定用户的母语所属种类来确定目标音色,还可以通过确定用户所使用的电子设备(如移动终端、终端等)的语言设置中选择的是哪种语种类型来确定目标音色,还可以通过其他方式来确定目标音色,在此不再一一举例示出。
作为一种示例,确定目标音色的具体实现过程可如下:获取用户的用户信息(如用户名或账户名等),并根据用户信息确定用户的母语所属种类,其中,母语所属种类包含于语种类型中;将用户的母语所属种类的基础模型所对应的训练说话人的音色作为目标音色。例如,获取用户A的用户信息,并根据该用户信息确定该用户A的母语所属种类为中文,此时可将该用户A的母语中文的基础模型(即中文基础模型)所对应的训练说话人的音色作为目标音色。
在确定目标音色之后,可根据该目标音色分别对第一谱参数模型、第二谱参数模型进 行自适应变换,使得应用该变换后的第一谱参数模型、第二谱参数模型可以生成具有相同或相近音色的谱参数。也就是说,在确定目标音色之后,可根据该目标音色对第一基础模型和第二基础模型进行自适应训练,以使得利用自适应训练后的第一基础模型和第二基础模型所生成的谱参数相同或相近。具体的自适应变换的实现方式可参照后续实施例的描述。
S104,根据第一基频参数模块、第二基频参数模块对待合成语句文本信息进行训练,以生成对应的基频参数,并根据目标音色对第一语种类型和第二语种类型的基频参数进行调整。
具体地,可根据第一基频参数模块、第二基频参数模块对待合成语句文本信息中各语种类型所对应的待合成语句文本进行相应的训练,以生成与第一语种类型的待合成语句文本对应的基频参数以及与第二语种类型的待合成语句文本对应的基频参数。在生成基频参数之后,可根据上述目标音色对第一语种类型和第二语种类型的基频参数进行调整,例如,可将第一语种类型以及第二语种类型的基频参数中的基频曲线的全局均值和方差,统一调整与目标音色所对应的基频参数中的基频曲线的全局均值和方差相同,以使得将第一基础模型得到的语音音色与第二基础模型得到的语音音色统一成该目标音色,保证将多语言文本合成音色统一的多语种语音。
S105,依据第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
作为一种示例,可将第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数经由声码器合成目标语音。可以理解,该目标语音为多语种语音。
本发明实施例的语音合成方法,首先,确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型,之后,确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型,然后,确定目标音色,并根据目标音色分别对第一基础模型、第二基础模型进行自适应变换,并根据自适应变换后的第一基础模型、第二基础模型对待合成语句文本信息进行训练,以生成对应的谱参数和基频参数,之后,根据目标音色对第一语种类型和第二语种类型的基频参数进行调整,最后,依据第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。即先确定待合成语句文本中包含哪些语种类型,之后根据目标音色对各语种类型的谱参数模型进行自适应训练,并采用自适应训练后的谱参数模型生成对应的谱参数,并根据目标音色将生成的各语种类型的基频参数进行调整,以得到音色统一的多语种语音,可以理解,上述语言基础模型利用的大多是单语数据建立的模型,降低了模型建立的数据成本和实现难度,减少了多语种合成对专业的多语发音人数据 的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
图2是根据本发明一个具体实施例的语音合成方法的流程图。
可以理解,目标音色可以是用户更想该合成语音的音色更倾向于哪种说话人的音色,例如,该目标音色可以是第一基础模型所对应的训练说话人的音色,也可以是第二基础模型所对应的训练说话人的音色。
在本发明的实施例中,以目标音色为第一基础模型所对应的训练说话人的音色为例,如图2所示,当目标音色为第一基础模型所对应的训练说话人的音色时,该语音合成方法可以包括:
S201,确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型。
S202,确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型,其中,第一基础模型包括第一谱参数模型和第一基频参数模块,第二基础模型包括第二谱参数模型和第二基频参数模块。
S203,确定目标音色,并根据目标音色对第二谱参数模型进行自适应变换。
可以理解,当确定目标音色为第一基础模型所对应的训练说话人的音色时,也就是,将第二基础模型所生成的参数调整至与第一基础模型所对应的训练说话人的音色相同,而第一基础模型可不进行自适应训练而可以直接用来参数生成。
作为一种示例,在确定目标音色为第一基础模型所对应的训练说话人的音色时,可获取第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据,并根据第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据对第二谱参数模型进行自适应变换。可以理解,谱参数模型的自适应变换是在参数生成之前完成的。
具体地,在获取到第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据(如包括第二语种类型的训练语句及其标注等)时,可以该第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据作为输入,经由第二谱参数模型的决策树进行聚类,得到每一个状态的训练数据,并以每一个状态的训练数据对每个谱参数的HMM状态估计变换矩阵,使得后续可以应用变换矩阵后的状态高斯模型可以生成与第一语种类型训练说话人相近的谱参数。
S204,根据第一谱参数模型对待合成语句文本信息中第一语种类型对应的待合成语句进行训练,以生成第一语种类型的谱参数,并根据自适应变换后的第二谱参数模型对待合成语句文本信息中第二语种类型对应的待合成语句进行训练,以生成第二语种类型的谱参数。
具体地,当确定目标音色为第一基础模型所对应的训练说话人的音色时,可直接根据第一谱参数模型对待合成语句文本信息中第一语种类型对应的待合成语句进行训练,以生成第一语种类型的谱参数。而对于第二语种类型对应的待合成语句,可根据自适应变换后的第二谱参数模型对该第二语种类型对应的待合成语句进行训练,以生成第二语种类型的谱参数。
S205,根据第一基频参数模块、第二基频参数模块对待合成语句文本信息进行训练,以生成对应的基频参数,并根据目标音色对第二语种类型的基频参数进行调整。
具体地,可根据第一基频参数模块、第二基频参数模块对待合成语句文本信息中各语种类型所对应的待合成语句进行相应的训练,以生成各语种类型对应的基频参数,即第一语种类型的基频参数和第二语种类型的基频参数。
可以理解,在确定目标音色为第一基础模型所对应的训练说话人的音色时,可对第一语种类型的基频参数不进行调整,而需对第二语种类型的基频参数进行调整。
作为一种示例,根据目标音色对第二语种类型的基频参数进行调整的具体实现过程可包括:可先获取第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据,然后,根据第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据对第二基频参数模型进行训练,以生成目标音色对应的目标说话人基频参数,最后,根据目标说话人基频参数对第二语种类型的基频参数进行调整。
可以理解,基频参数的调整是在参数生成之后完成的。在确定目标音色为第一基础模型所对应的训练说话人的音色时,可先获取第一基础模型所对应的训练说话人针对第二语种类型的训练语音数据(如包括第二语种类型的训练语句及其标注等),并以该训练语音数据为输入,经由第二基频参数模型的决策树进行聚类,得到每一个状态的训练数据,并以此每一个状态的训练数据对每个基频的HMM状态进行训练,得到HMM状态的高斯参数,称为目标说话人基频模型。合成时,先以目标说话人基频模型进行参数生成,计算生成的基频曲线的全局均值和方差,将其保存。然后,以第二基础模型进行基频参数生成,并将生成的基频曲线经过线性变换,使得其均值和方差转换为与目标说话人基频模型生成的基频全局均值和方差相同,完成基频曲线的调整。
S206,依据第一语种类型的谱参数、第二语种类型的谱参数、第一语种类型的基频参数以及调整后的第二语种类型的基频参数合成目标语音。
可以理解,第二语种语音的音色通过自适应和基频参数调整的方式转换为第一基础模型所对应的训练说话人的音色,保留了第二语种语音原有的时长、语调信息,使原本第一基础模型所对应的训练说话人讲的并不地道的第二语种语音变得与第二基础模型所对应的训练说话人发音接近。
为了使得本领域技术人员能够更加清楚地了解本发明,下面以中文和英文两种语言混合合成,并统一为中文说话人音色为例介绍本发明的方法。
举例而言,假设第一基础模型为中文基础模型,第二基础模型为英文基础模型,假设中文基础模型是利用一名以中文为母语的中英双语说话人的双语语音数据建立的模型,英文基础模型是利用一名以英语母语说话人的英语语音数据建立的模型,其中,对于中文母语说话人的英文发音的标准性不做要求。
如图3所示,在获取到待合成语句文本信息之后,可根据语种的字符和上下文来对该待合成语句文本中的文字内容进行断句,并判断出每一个句子片段的语种(S301)。由于追中音色需要调整至中文发音人发音,因此中文基础模型被直接用来进行参数生成,英文基础模型需要进行转换处理。即,在参数生成之前,可以英文模型转换为中文说话人音色为例,可获取中文说话人的英文训练语句数据(如英文语句及其标注)为输入,经由英文基础模型中的英文谱参数模型的决策树进行聚类,得到每一个状态的训练数据,并以此数据对每个谱参数的HMM状态估计变换矩阵,使得应用变换矩阵后的状态高斯模型可以生成与中文说话人相近的谱参数,用以进行参数生成(S302)。在参数生成之后,可以中文说话人的英文训练语句数据(如英文语句及其标注)为输入,经由英文基础模型中的英文基频参数模型的决策树进行聚类,得到每一个状态的训练数据,并以此数据对每个基频的HMM状态进行训练,得到HMM状态的高斯参数,称为目标说话人基频模型。合成时,先以目标说话人基频模型进行参数生成,计算生成的基频曲线的全局均值和方差,将其保存。然后,以英文基础模型进行基频参数生成,并将生成的基频曲线经过线性变换,使得其均值和方差转换为与目标说话人基频模型生成的基频全局均值和方差相同,完成基频曲线的转换(S303)。最后,将生成的中文语句文本对应的谱参数、基频参数、适应后得到的英文语句文本对应的谱参数、调整后的英文语句文本对应的基频参数经由声码器进行语音合成,以得到中英文混合语音(S304)。
综上,本发明实施例的语音合成方法通过不依赖某一个标准的多语发音人数据的情况下,合成音色统一且发音标准的语音,可以减少多语种合成中对专业的多语发音人数据的依赖,而利用更多的单语数据,降低数据成本和实现难度。
本发明实施例的语音合成方法,确定待合成语句文本中包含哪些语种类型,并在确定目标音色为第一基础模型所对应的训练说话人的音色时,根据该目标音色对第二谱参数模型进行自适应变换,而第一基础模型可不进行自适应训练而可以直接用来参数生成,并根据自适应训练后的第二谱参数模型生成对应的第二语种类型谱参数,并根据目标音色将生成的第二语种类型的基频参数进行调整,以将第二语种说话人的音色调整为与第一语种说话人的音色相同或相近,可以降低模型建立的数据成本和实现难度,减少了多语种合成对 专业的多语发音人数据的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
与上述几种实施例提供的语音合成方法相对应,本发明的一种实施例还提供一种语音合成装置,由于本发明实施例提供的语音合成装置与上述几种实施例提供的语音合成方法相对应,因此在前述语音合成方法的实施方式也适用于本实施例提供的语音合成装置,在本实施例中不再详细描述。图4是根据本发明一个实施例的语音合成装置的结构框图。如图4所示,该语音合成装置可以包括:第一确定模块10、第二确定模块20、第三确定模块30、自适应变换模块40、谱参数生成模块50、基频参数生成模块60、基频参数调整模块70和语音合成模块80。
具体地,第一确定模块10可用于确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型。
第二确定模块20可用于确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型,其中,第一基础模型包括第一谱参数模型和第一基频参数模块,第二基础模型包括第二谱参数模型和第二基频参数模块。
第三确定模块30可用于确定目标音色。具体而言,在本发明的一个实施例中,如图5所示,该第三确定模块30可包括:第一确定单元31和第二确定单元32。其中,第一确定单元31用于获取用户的用户信息,并根据用户信息确定用户的母语所属种类,其中,母语所属种类包含于语种类型中。第二确定单元32用于将用户的母语所属种类的基础模型所对应的训练说话人的音色作为目标音色。
自适应变换模块40可用于根据目标音色分别对第一谱参数模型、第二谱参数模型进行自适应变换。
谱参数生成模块50可用于根据自适应变换后的第一谱参数模型和第二谱参数模型对待合成语句文本信息进行训练,以生成对应的谱参数。
基频参数生成模块60可用于根据第一基频参数模块、第二基频参数模块对待合成语句文本信息进行训练,以生成对应的基频参数。
基频参数调整模块70可用于根据第一基频参数模块、第二基频参数模块对待合成语句文本信息进行训练,以生成对应的基频参数。
语音合成模块80可用于依据第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
作为一种示例,在目标音色为第一基础模型所对应的训练说话人的音色时,自适应变换模块40还用于根据目标音色对第二谱参数模型进行自适应变换。谱参数生成模块50还 用于根据第一谱参数模型对待合成语句文本信息中第一语种类型对应的待合成语句进行训练,以生成第一语种类型的谱参数,并根据自适应变换后的第二谱参数模型对待合成语句文本信息中第二语种类型对应的待合成语句进行训练,以生成第二语种类型的谱参数。基频参数调整模块70还用于根据目标音色对第二语种类型的基频参数进行调整。语音合成模块80还用于依据第一语种类型的谱参数、第二语种类型的谱参数、第一语种类型的基频参数以及调整后的第二语种类型的基频参数合成目标语音。
在本实施例中,如图6所示,该自适应变换模块40可包括:获取单元41和自适应变换单元42。其中,获取单元41用于获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据。自适应变换单元42用于根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二谱参数模型进行自适应变换。
在本实施例中,如图6所示,该基频参数调整模块70可包括:获取单元71、目标说话人基频参数生成单元72和基频参数调整单元73。其中,获取单元71用于获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据。目标说话人基频参数生成单元72用于根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二基频参数模型进行训练,以生成所述目标音色对应的目标说话人基频参数。基频参数调整单元73用于根据所述目标说话人基频参数对所述第二语种类型的基频参数进行调整。
本发明实施例的语音合成装置,确定待合成语句文本中包含哪些语种类型,之后根据目标音色对各语种类型的谱参数模型进行自适应训练,并采用自适应训练后的谱参数模型生成对应的谱参数,并根据目标音色将生成的各语种类型的基频参数进行调整,以得到音色统一的多语种语音,可以理解,上述语言基础模型利用的大多是单语数据建立的模型,降低了模型建立的数据成本和实现难度,减少了多语种合成对专业的多语发音人数据的依赖,可以有效地将多语言待合成语句文本合成自然的、准确的、音色统一的多语种语音,提升了用户体验。
为了实现上述实施例,本发明还提出了一种终端,包括:一个或者多个处理器;存储器;一个或多个程序,一个或者多个程序存储在存储器中,当被一个或者多个处理器执行时进行如下操作:
S101’,确定待合成语句文本信息所属的语种类型,其中,语种类型包括第一语种类型和第二语种类型。
S102’,确定第一语种类型对应的第一基础模型,并确定第二语种类型对应的第二基础模型,其中,第一基础模型包括第一谱参数模型和第一基频参数模块,第二基础模型包括第二谱参数模型和第二基频参数模块。
S103’,确定目标音色,并根据目标音色分别对第一谱参数模型、第二谱参数模型进行自适应变换,并根据自适应变换后的第一谱参数模型和第二谱参数模型对待合成语句文本信息进行训练,以生成对应的谱参数。
S104’,根据第一基频参数模块、第二基频参数模块对待合成语句文本信息进行训练,以生成对应的基频参数,并根据目标音色对第一语种类型和第二语种类型的基频参数进行调整。
S105’,依据第一语种类型的谱参数、第二语种类型的谱参数、调整后的第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
为了实现上述实施例,本发明还提出了一种存储介质,用于存储应用程序,应用程序用于执行本发明上述任一个实施例所述的语音合成方法。
在本发明的描述中,需要理解的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多种”的含义是至少两种,例如两种,三种等,除非另有明确具体的限定。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行***、装置或设备(如基于计算机的***、包括处理器的***或其他可以从指令执行***、装置或设备取指令并执行指令的***)使用,或结合这些指令执行***、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行***、装置或设备或结合这些指令执行***、装置或设备而使用 的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行***执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (12)

  1. 一种语音合成方法,其特征在于,包括以下步骤:
    确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;
    确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;
    确定目标音色,并根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换,并根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;
    根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数,并根据所述目标音色对所述第一语种类型和第二语种类型的基频参数进行调整;
    依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
  2. 如权利要求1所述的语音合成方法,其特征在于,所述确定目标音色,包括:
    获取用户的用户信息,并根据所述用户信息确定所述用户的母语所属种类,其中,所述母语所属种类包含于所述语种类型中;
    将所述用户的母语所属种类的基础模型所对应的训练说话人的音色作为所述目标音色。
  3. 如权利要求1或2所述的语音合成方法,其特征在于,当所述目标音色为所述第一基础模型所对应的训练说话人的音色时,
    所述根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换,包括:
    根据所述目标音色对所述第二谱参数模型进行自适应变换;
    所述根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数,包括:
    根据所述第一谱参数模型对所述待合成语句文本信息中所述第一语种类型对应的待合成语句进行训练,以生成所述第一语种类型的谱参数,并根据自适应变换后的所述第二谱参数模型对所述待合成语句文本信息中所述第二语种类型对应的待合成语句进行训练,以生成所述第二语种类型的谱参数;
    所述根据所述目标音色对所述第一语种类型和第二语种类型的基频参数进行调整,包括:
    根据所述目标音色对所述第二语种类型的基频参数进行调整;
    所述依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音,包括:
    依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、所述第一语种类型的基频参数以及调整后的所述第二语种类型的基频参数合成所述目标语音。
  4. 如权利要求3所述的语音合成方法,其特征在于,所述根据所述目标音色对所述第二谱参数模型进行自适应变换,包括:
    获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据;
    根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二谱参数模型进行自适应变换。
  5. 如权利要求3所述的语音合成方法,其特征在于,所述根据所述目标音色对所述第二语种类型的基频参数进行调整,包括:
    获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据;
    根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二基频参数模型进行训练,以生成所述目标音色对应的目标说话人基频参数;
    根据所述目标说话人基频参数对所述第二语种类型的基频参数进行调整。
  6. 一种语音合成装置,其特征在于,包括:
    第一确定模块,用于确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;
    第二确定模块,用于确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;
    第三确定模块,用于确定目标音色;
    自适应变换模块,用于根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换;
    谱参数生成模块,用于根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;
    基频参数生成模块,用于根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数;
    基频参数调整模块,用于根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数;
    语音合成模块,用于依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
  7. 如权利要求6所述的语音合成装置,其特征在于,所述第三确定模块包括:
    第一确定单元,用于获取用户的用户信息,并根据所述用户信息确定所述用户的母语所属种类,其中,所述母语所属种类包含于所述语种类型中;
    第二确定单元,用于将所述用户的母语所属种类的基础模型所对应的训练说话人的音色作为所述目标音色。
  8. 如权利要求6或7所述的语音合成装置,其特征在于,在所述目标音色为所述第一基础模型所对应的训练说话人的音色时,
    所述自适应变换模块还用于根据所述目标音色对所述第二谱参数模型进行自适应变换;
    所述谱参数生成模块还用于根据所述第一谱参数模型对所述待合成语句文本信息中所述第一语种类型对应的待合成语句进行训练,以生成所述第一语种类型的谱参数,并根据自适应变换后的所述第二谱参数模型对所述待合成语句文本信息中所述第二语种类型对应的待合成语句进行训练,以生成所述第二语种类型的谱参数;
    所述基频参数调整模块还用于根据所述目标音色对所述第二语种类型的基频参数进行调整;
    所述语音合成模块还用于依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、所述第一语种类型的基频参数以及调整后的所述第二语种类型的基频参数合成所述目标语音。
  9. 如权利要求8所述的语音合成装置,其特征在于,所述自适应变换模块包括:
    获取单元,用于获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据;
    自适应变换单元,用于根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二谱参数模型进行自适应变换。
  10. 如权利要求8所述的语音合成装置,其特征在于,所述基频参数调整模块包括:
    获取单元,用于获取所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据;
    目标说话人基频参数生成单元,用于根据所述第一基础模型所对应的训练说话人针对所述第二语种类型的训练语音数据对所述第二基频参数模型进行训练,以生成所述目标音 色对应的目标说话人基频参数;
    基频参数调整单元,用于根据所述目标说话人基频参数对所述第二语种类型的基频参数进行调整。
  11. 一种终端,其特征在于,包括:
    一个或者多个处理器;
    存储器;
    一个或多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时进行如下操作:
    确定待合成语句文本信息所属的语种类型,其中,所述语种类型包括第一语种类型和第二语种类型;
    确定所述第一语种类型对应的第一基础模型,并确定所述第二语种类型对应的第二基础模型,其中,所述第一基础模型包括第一谱参数模型和第一基频参数模块,所述第二基础模型包括第二谱参数模型和第二基频参数模块;
    确定目标音色,并根据所述目标音色分别对所述第一谱参数模型、所述第二谱参数模型进行自适应变换,并根据自适应变换后的所述第一谱参数模型和所述第二谱参数模型对所述待合成语句文本信息进行训练,以生成对应的谱参数;
    根据所述第一基频参数模块、第二基频参数模块对所述待合成语句文本信息进行训练,以生成对应的基频参数,并根据所述目标音色对所述第一语种类型和第二语种类型的基频参数进行调整;
    依据所述第一语种类型的谱参数、所述第二语种类型的谱参数、调整后的所述第一语种类型的基频参数、第二语种类型的基频参数合成目标语音。
  12. 一种存储介质,其特征在于,用于存储应用程序,所述应用程序用于执行权利要求1至5中任一项所述的语音合成方法。
PCT/CN2016/098126 2016-05-18 2016-09-05 语音合成方法和语音合成装置 WO2017197809A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/099,257 US10789938B2 (en) 2016-05-18 2016-09-05 Speech synthesis method terminal and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610329738.1A CN105845125B (zh) 2016-05-18 2016-05-18 语音合成方法和语音合成装置
CN201610329738.1 2016-05-18

Publications (1)

Publication Number Publication Date
WO2017197809A1 true WO2017197809A1 (zh) 2017-11-23

Family

ID=56592862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/098126 WO2017197809A1 (zh) 2016-05-18 2016-09-05 语音合成方法和语音合成装置

Country Status (3)

Country Link
US (1) US10789938B2 (zh)
CN (1) CN105845125B (zh)
WO (1) WO2017197809A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164414A (zh) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 语音处理方法、装置及智能设备
CN110610720A (zh) * 2019-09-19 2019-12-24 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN111986646A (zh) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 一种基于小语料库的方言合成方法及***
CN113327575A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN113539233A (zh) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845125B (zh) 2016-05-18 2019-05-03 百度在线网络技术(北京)有限公司 语音合成方法和语音合成装置
CN106856091A (zh) * 2016-12-21 2017-06-16 北京智能管家科技有限公司 一种多语言文本的自动播报方法及***
CN107452369B (zh) * 2017-09-28 2021-03-19 百度在线网络技术(北京)有限公司 语音合成模型生成方法和装置
CN110164445B (zh) * 2018-02-13 2023-06-16 阿里巴巴集团控股有限公司 语音识别方法、装置、设备及计算机存储介质
CN110399547B (zh) * 2018-04-17 2022-03-04 百度在线网络技术(北京)有限公司 用于更新模型参数的方法、装置、设备和存储介质
CN108831437B (zh) * 2018-06-15 2020-09-01 百度在线网络技术(北京)有限公司 一种歌声生成方法、装置、终端和存储介质
WO2020060151A1 (en) * 2018-09-19 2020-03-26 Samsung Electronics Co., Ltd. System and method for providing voice assistant service
CN110459201B (zh) * 2019-08-22 2022-01-07 云知声智能科技股份有限公司 一种产生新音色的语音合成方法
CN112767910B (zh) * 2020-05-13 2024-06-18 腾讯科技(深圳)有限公司 音频信息合成方法、装置、计算机可读介质及电子设备
CN111667814B (zh) * 2020-05-26 2023-09-12 北京声智科技有限公司 一种多语种的语音合成方法及装置
CN112164407B (zh) * 2020-09-22 2024-06-18 腾讯音乐娱乐科技(深圳)有限公司 音色转换方法及装置
CN112581933B (zh) * 2020-11-18 2022-05-03 北京百度网讯科技有限公司 语音合成模型获取方法、装置、电子设备及存储介质
CN112530406A (zh) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 一种语音合成方法、语音合成装置及智能设备
CN112652318B (zh) * 2020-12-21 2024-03-29 北京捷通华声科技股份有限公司 音色转换方法、装置及电子设备
CN112652294B (zh) * 2020-12-25 2023-10-24 深圳追一科技有限公司 语音合成方法、装置、计算机设备和存储介质
CN112667865A (zh) * 2020-12-29 2021-04-16 西安掌上盛唐网络信息有限公司 中英混合语音合成技术在汉语言教学中的应用的方法及***
CN112992117B (zh) * 2021-02-26 2023-05-26 平安科技(深圳)有限公司 多语言语音模型生成方法、装置、计算机设备及存储介质
CN115910021A (zh) * 2021-09-22 2023-04-04 脸萌有限公司 语音合成方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008139631A (ja) * 2006-12-04 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> 音声合成方法、装置、プログラム
WO2009014465A2 (en) * 2007-07-25 2009-01-29 Slobodan Jovicic System and method for multilingual translation of communicative speech
JP2011237886A (ja) * 2010-05-06 2011-11-24 Canon Inc 情報処理装置およびその制御方法
CN102360543A (zh) * 2007-08-20 2012-02-22 微软公司 基于hmm的双语(普通话-英语)tts技术
CN102543069A (zh) * 2010-12-30 2012-07-04 财团法人工业技术研究院 多语言的文字转语音合成***与方法
CN105845125A (zh) * 2016-05-18 2016-08-10 百度在线网络技术(北京)有限公司 语音合成方法和语音合成装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4517045B2 (ja) * 2005-04-01 2010-08-04 独立行政法人産業技術総合研究所 音高推定方法及び装置並びに音高推定用プラグラム
CN1835074B (zh) * 2006-04-07 2010-05-12 安徽中科大讯飞信息科技有限公司 一种结合高层描述信息和模型自适应的说话人转换方法
US9798653B1 (en) * 2010-05-05 2017-10-24 Nuance Communications, Inc. Methods, apparatus and data structure for cross-language speech adaptation
CN102513069B (zh) * 2011-12-16 2013-07-17 浙江农林大学 分等级结构的孔配位聚合物吸附材料的生产方法
US9786296B2 (en) * 2013-07-08 2017-10-10 Qualcomm Incorporated Method and apparatus for assigning keyword model to voice operated function
US10137902B2 (en) * 2015-02-12 2018-11-27 Harman International Industries, Incorporated Adaptive interactive voice system
US9865251B2 (en) * 2015-07-21 2018-01-09 Asustek Computer Inc. Text-to-speech method and multi-lingual speech synthesizer using the method
CN105529023B (zh) * 2016-01-25 2019-09-03 百度在线网络技术(北京)有限公司 语音合成方法和装置
US10319365B1 (en) * 2016-06-27 2019-06-11 Amazon Technologies, Inc. Text-to-speech processing with emphasized output audio
CN116319631A (zh) * 2017-04-07 2023-06-23 微软技术许可有限责任公司 自动聊天中的语音转发
TW202009924A (zh) * 2018-08-16 2020-03-01 國立臺灣科技大學 音色可選之人聲播放系統、其播放方法及電腦可讀取記錄媒體

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008139631A (ja) * 2006-12-04 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> 音声合成方法、装置、プログラム
WO2009014465A2 (en) * 2007-07-25 2009-01-29 Slobodan Jovicic System and method for multilingual translation of communicative speech
CN102360543A (zh) * 2007-08-20 2012-02-22 微软公司 基于hmm的双语(普通话-英语)tts技术
JP2011237886A (ja) * 2010-05-06 2011-11-24 Canon Inc 情報処理装置およびその制御方法
CN102543069A (zh) * 2010-12-30 2012-07-04 财团法人工业技术研究院 多语言的文字转语音合成***与方法
CN105845125A (zh) * 2016-05-18 2016-08-10 百度在线网络技术(北京)有限公司 语音合成方法和语音合成装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164414A (zh) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 语音处理方法、装置及智能设备
CN110164414B (zh) * 2018-11-30 2023-02-14 腾讯科技(深圳)有限公司 语音处理方法、装置及智能设备
CN110610720A (zh) * 2019-09-19 2019-12-24 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN110610720B (zh) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN113539233A (zh) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111986646A (zh) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 一种基于小语料库的方言合成方法及***
CN111986646B (zh) * 2020-08-17 2023-12-15 云知声智能科技股份有限公司 一种基于小语料库的方言合成方法及***
CN113327575A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN113327575B (zh) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN105845125A (zh) 2016-08-10
US10789938B2 (en) 2020-09-29
CN105845125B (zh) 2019-05-03
US20190213995A1 (en) 2019-07-11

Similar Documents

Publication Publication Date Title
WO2017197809A1 (zh) 语音合成方法和语音合成装置
JP7280386B2 (ja) 多言語音声合成およびクロスランゲージボイスクローニング
US6556972B1 (en) Method and apparatus for time-synchronized translation and synthesis of natural-language speech
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
WO2017067206A1 (zh) 个性化多声学模型的训练方法、语音合成方法及装置
CN108831437B (zh) 一种歌声生成方法、装置、终端和存储介质
US6859778B1 (en) Method and apparatus for translating natural-language speech using multiple output phrases
US20230206897A1 (en) Electronic apparatus and method for controlling thereof
CN105957515B (zh) 声音合成方法、声音合成装置和存储声音合成程序的介质
US7415413B2 (en) Methods for conveying synthetic speech style from a text-to-speech system
US20090177473A1 (en) Applying vocal characteristics from a target speaker to a source speaker for synthetic speech
US10224021B2 (en) Method, apparatus and program capable of outputting response perceivable to a user as natural-sounding
CN113870833A (zh) 语音合成相关***、方法、装置及设备
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models
CN113421571B (zh) 一种语音转换方法、装置、电子设备和存储介质
CN114333758A (zh) 语音合成方法、装置、计算机设备、存储介质和产品
Wutiwiwatchai et al. Accent level adjustment in bilingual Thai-English text-to-speech synthesis
CN115700871A (zh) 模型训练和语音合成方法、装置、设备及介质
CN114446304A (zh) 语音交互方法、数据处理方法、装置和电子设备
TWI725608B (zh) 語音合成系統、方法及非暫態電腦可讀取媒體
WO2018179209A1 (ja) 電子機器、音声制御方法、およびプログラム
KR20180103273A (ko) 음성 합성 장치 및 음성 합성 방법
US11335321B2 (en) Building a text-to-speech system from a small amount of speech data
Yoon et al. Enhancing Multilingual TTS with Voice Conversion Based Data Augmentation and Posterior Embedding
KR20230138710A (ko) 발화자의 음성 특징을 반영한 음성 번역 방법 및 시스템

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16902189

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16902189

Country of ref document: EP

Kind code of ref document: A1