WO2017067206A1 - 个性化多声学模型的训练方法、语音合成方法及装置 - Google Patents

个性化多声学模型的训练方法、语音合成方法及装置 Download PDF

Info

Publication number
WO2017067206A1
WO2017067206A1 PCT/CN2016/087321 CN2016087321W WO2017067206A1 WO 2017067206 A1 WO2017067206 A1 WO 2017067206A1 CN 2016087321 W CN2016087321 W CN 2016087321W WO 2017067206 A1 WO2017067206 A1 WO 2017067206A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
synthesized
acoustic
target user
data
Prior art date
Application number
PCT/CN2016/087321
Other languages
English (en)
French (fr)
Inventor
李秀林
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Priority to US15/758,280 priority Critical patent/US10410621B2/en
Publication of WO2017067206A1 publication Critical patent/WO2017067206A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present invention relates to the field of voice technologies, and in particular, to a training method, a voice synthesis method and a device for a personalized multi-acoustic model for speech synthesis.
  • Speech synthesis also known as Text to Speech, is a technique that converts textual information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.
  • the process of converting text information into sound information is as follows: first, the input text needs to be processed, including preprocessing, word segmentation, part-of-speech tagging, polyphonic word prediction, prosodic level prediction, etc., and then through the acoustic model, The acoustic characteristics corresponding to each unit are predicted, and finally the sound is directly synthesized by the vocoder using the acoustic parameters, or the unit is selected from the recording corpus to be spliced to generate sound information corresponding to the text.
  • the acoustic model is one of the foundations of the entire speech synthesis system, and the acoustic model is usually obtained by training large-scale speech data.
  • the process of training the acoustic model is as follows: First, design a certain amount of recorded text corpus to meet the requirements of phonetic coverage and prosody coverage. Second, select the appropriate speaker, and the speaker will record the voice data accordingly. Next, text, pinyin, rhythm, and cell boundary are marked, and the labeled data is used for model training and sound library generation. It can be seen that the process of training an acoustic model is more complicated, the period is longer, and the training process is based on the speech data of the fixed speaker. Therefore, in the process of synthesizing speech through the acoustic model, the synthesized The tone of the voice is fixed.
  • the ways to obtain a personalized acoustic model in the related art mainly include the following two methods:
  • the first way using parallel corpus or non-parallel corpus, trains the individualized acoustic models required by the user at the acoustic parameter level.
  • the second way is to use the mapping between models to achieve the conversion between the reference acoustic model and the personalized acoustic model.
  • HMM-GMM Hidden Markov Models and Gaussian Mixture Models, Hidden Markov The model and the Gaussian mixture model are modeled and mapped between decision trees to generate a personalized acoustic model.
  • the decision tree is a shallow model, its description ability is limited, especially when the amount of voice data of the user is relatively small, the accuracy of the generated personalized acoustic model is not high, thus causing prediction.
  • the parameters may be inconsistent, which may cause the synthesized speech to jump, the tone is unstable, and the speech is unnatural.
  • the present invention aims to solve at least one of the technical problems in the related art to some extent.
  • a training method for a personalized multi-acoustic model for speech synthesis that reduces the size of the target user's speech data during training of the target user acoustic model.
  • multiple personalized acoustic models containing the voice characteristics of the target user can be trained, thereby meeting the personalized voice requirements and improving the user experience.
  • a second object of the present invention is to propose a speech synthesis method.
  • a third object of the present invention is to propose a speech synthesis method.
  • a fourth object of the present invention is to propose a training apparatus for a personalized multi-acoustic model for speech synthesis.
  • a fifth object of the present invention is to provide a speech synthesis apparatus.
  • a sixth object of the present invention is to provide a speech synthesis apparatus.
  • a first aspect of the present invention provides a training method for a personalized multi-acoustic model for speech synthesis, comprising: first acoustic feature data based on training speech data and the training speech data Corresponding first text annotation data, training a reference acoustic model; acquiring speech data of the target user; training the first target user acoustic model according to the reference acoustic model and the speech data; according to the first target user acoustic model and The first text annotation data generates second acoustic feature data of the first text annotation data; and training the second target user acoustic model based on the first text annotation data and the second acoustic feature data.
  • the training method for the personalized multi-acoustic model for speech synthesis firstly trains the reference acoustic model based on the first acoustic feature data of the training speech data and the first text annotation data corresponding to the training speech data, and then Acquiring the voice data of the target user and training the first target user acoustics based on the reference acoustic model and the voice data And generating a second acoustic feature data of the first text annotation data according to the first target user acoustic model and the first text annotation data, and finally training the second target user acoustic model based on the first text annotation data and the second acoustic feature data Therefore, in the process of training the target user acoustic model, the requirement for the scale of the target user's voice data is reduced, for example, the number of thousands of sentences can be reduced to a few hundred sentences or even dozens of sentences, that is, a small number of users can be utilized.
  • the voice data can train a plurality of
  • a second aspect of the present invention provides a method for performing speech synthesis using a first target user acoustic model according to the first aspect of the present invention, comprising: acquiring a text to be synthesized, and The synthesized text is segmented; the part of the text to be synthesized after the word segmentation is subjected to part-of-speech tagging, and the prosody-predicted model is used to perform prosody prediction on the to-be-synthesized text after the part-of-speech tagging to generate the prosodic feature of the text to be synthesized; according to the segmentation result and the part-of-speech tagging
  • the phonetic to be synthesized is phoneticized to generate a phonetic result of the text to be synthesized; and the phonetic result, the prosodic feature, and the context feature of the text to be synthesized are input to the first target a user acoustic model, acou
  • the speech synthesis method of the embodiment of the present invention firstly obtains a text to be synthesized, performs word segmentation on the synthesized text, and then performs part-of-speech tagging on the text to be synthesized after the word segmentation, and performs prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model.
  • the prosodic features of the text to be synthesized are generated, and then the synthesized text is phoneticized according to the word segmentation result, the part-of-speech tagging result and the prosodic feature, to generate a phonetic result of the text to be synthesized, and the phonetic result, the prosodic feature and the context feature of the text to be synthesized are input.
  • the speech synthesis result synthesized in the speech synthesis system includes the voice characteristics of the target user, which satisfies the requirement of the user to generate personalized speech, and improves the user experience.
  • a third aspect of the present invention provides a method for performing speech synthesis using a second target user acoustic model according to the first aspect of the present invention, including: acquiring a text to be synthesized, and Synthesizing text for word segmentation; performing part-of-speech tagging on the text to be synthesized after segmentation, and prosodically predicting the to-be-synthesized text after part-of-speech tagging by prosody prediction model to generate prosodic features of the text to be synthesized; according to the segmentation result,
  • the speech synthesis method of the embodiment of the present invention firstly obtains a text to be synthesized, performs word segmentation on the synthesized text, and then performs part-of-speech tagging on the text to be synthesized after the word segmentation, and performs prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model.
  • the prosody features of the text to be synthesized are generated, and then the synthesized text is phoneticized according to the word segmentation result, the part-of-speech tagging result and the prosodic feature, to generate the phonetic result of the text to be synthesized, and the phonetic result, the prosody feature and the
  • the contextual feature of the synthesized text is input to the second target user acoustic model, the second target user acoustic model is used to acoustically predict the synthesized text to generate an acoustic parameter sequence of the text to be synthesized, and finally the speech synthesis of the text to be synthesized is generated according to the acoustic parameter sequence
  • the speech synthesis result synthesized in the speech synthesis system includes the speech characteristics of the target user, which satisfies the requirement of the user to generate the personalized speech, and improves the user experience.
  • a fourth aspect of the present invention provides a training apparatus for a personalized multi-acoustic model for speech synthesis, comprising: a first model training module for first acoustic features based on training speech data Data and first text annotation data corresponding to the training speech data, training reference acoustic model; acquisition module for acquiring speech data of the target user; second model training module for using the reference acoustic model and the Voice data, training a first target user acoustic model; generating a module, configured to generate second acoustic feature data of the first text annotation data according to the first target user acoustic model and the first text annotation data; And a three-model training module for training the second target user acoustic model based on the first text annotation data and the second acoustic feature data.
  • the training device for the personalized multi-acoustic model for speech synthesis is trained by the first model training module based on the first acoustic feature data of the training speech data and the first text annotation data corresponding to the training speech data.
  • the acquisition module acquires voice data of the target user
  • the second model training module trains the first target user acoustic model according to the reference acoustic model and the voice data, and then generates a module according to the first target user acoustic model and the first text annotation
  • the data generates second acoustic feature data of the first text annotation data
  • the third model training module trains the second target user acoustic model based on the first text annotation data and the second acoustic feature data, thereby training the target user acoustic model
  • the requirement for the size of the target user's voice data is reduced, for example, the number of thousands of sentences can be reduced to a few hundred sentences or even dozens of sentences, that is, a plurality of target objects can be trained with a small amount of user voice data.
  • an embodiment of the fifth aspect of the present invention provides an apparatus for performing speech synthesis using a first target user acoustic model according to an embodiment of the fourth aspect of the present invention, comprising: an obtaining module, configured to acquire a text to be synthesized a word segmentation module for segmenting the text to be synthesized; a part of speech tagging module for performing part-of-speech tagging of the text to be synthesized after the segmentation; a prosody prediction module for syndicating the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model Performing a prosody prediction to generate a prosodic feature of the text to be synthesized; a phonetic module, configured to perform a phonetic transcription of the text to be synthesized according to the word segmentation result, the part-of-speech tagging result, and the prosodic feature to generate a phonetic transcription of the text to be synthesized a result;
  • the text to be synthesized is first acquired by the acquisition module, and then the word segmentation module performs word segmentation on the synthesized text, and the part of speech tagging module performs part-of-speech tagging on the text to be synthesized after the segmentation, and the prosody prediction module uses the prosody prediction model to perform part of speech.
  • the labeled text to be synthesized is nowadaysly predicted to generate prosodic features of the text to be synthesized, and then the phonetic module performs phonetic transcription according to the word segmentation result, the part-of-speech tagging result and the prosodic feature, to generate a phonetic Forming a phonetic result of the synthesized text, and the acoustic prediction module inputs the phonetic result, the prosodic feature, and the context feature of the text to be synthesized into the first target user acoustic model, and acoustically predicts the synthesized text through the first target user acoustic model to generate The acoustic parameter sequence of the text to be synthesized, and finally the acoustic prediction module generates a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence, thereby causing the speech synthesis result synthesized in the speech synthesis system to include the speech characteristics of the target user, and satisfying the user.
  • the sixth aspect of the present invention provides an apparatus for performing speech synthesis using a second target user acoustic model according to the fourth aspect of the present invention, comprising: an obtaining module, configured to acquire a text to be synthesized a word segmentation module for segmenting the text to be synthesized; a part of speech tagging module for performing part-of-speech tagging of the text to be synthesized after the segmentation; a prosody prediction module for syndicating the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model Performing a prosody prediction to generate a prosodic feature of the text to be synthesized; a phonetic module, configured to perform a phonetic transcription of the text to be synthesized according to the word segmentation result, a part-of-speech tagging result, and a prosodic feature to generate a phonetic transcription of the to-be-composited text a result; an acoustic module, configured to perform a
  • the text to be synthesized is first acquired by the acquisition module, and then the word segmentation module performs word segmentation on the synthesized text, and the part of speech tagging module performs part-of-speech tagging on the text to be synthesized after the segmentation, and the prosody prediction module uses the prosody prediction model to perform part of speech.
  • the labeled text to be synthesized is nowadaysly predicted to generate prosodic features of the text to be synthesized, and then the phonetic module performs phonetic transcription according to the word segmentation result, the part-of-speech tagging result and the prosodic feature to generate the phonetic result of the text to be synthesized, and acoustically predicts
  • the module inputs the phonetic result, the prosodic feature and the context feature of the text to be synthesized into the second target user acoustic model, and acoustically predicts the synthesized text through the second target user acoustic model to generate an acoustic parameter sequence of the text to be synthesized, and finally acoustic prediction
  • the module generates a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence, thereby causing the speech synthesis result synthesized in the speech synthesis system to include the speech characteristics of the target user, and satisfying the requirement of the user to generate the personalized speech,
  • FIG. 1 is a flow chart of a training method for a personalized multi-acoustic model for speech synthesis, in accordance with one embodiment of the present invention.
  • FIG. 2 is a detailed flowchart of step S13.
  • FIG. 3 is a flow chart of a speech synthesis method in accordance with one embodiment of the present invention.
  • FIG. 4 is a flow chart of a speech synthesis method in accordance with another embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a training apparatus for a personalized multi-acoustic model for speech synthesis according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a training apparatus for a personalized multi-acoustic model for speech synthesis according to another embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a speech synthesis apparatus according to another embodiment of the present invention.
  • FIG. 1 is a flow chart of a training method for a personalized multi-acoustic model for speech synthesis, in accordance with one embodiment of the present invention.
  • the training method for the personalized multi-acoustic model for speech synthesis includes:
  • a certain amount of recorded text corpus can be designed first, and then a suitable speaker is selected to obtain The training voice data of the large-scale non-target user, and the first acoustic feature data of the training voice data are extracted, and the recorded text corpus corresponding to the training voice data is marked to obtain the first text annotation data of the training voice data.
  • the first acoustic feature data includes acoustic features such as duration, spectrum, and fundamental frequency.
  • the first text annotation data includes text features such as pinyin and rhythm level annotation.
  • the first acoustic feature data and the first text annotation data may be trained through the neural network, and according to the training The result is a baseline acoustic model.
  • the voice data includes the voice characteristics of the target user.
  • the voice data of the target user may be selected according to requirements.
  • the voice data of the target user may be obtained by using a live recording method, or the existing voice of the target user may be directly used. data.
  • the recorded text is pre-designed according to the phonon coverage and the prosody coverage, and is provided to the target user for reading to obtain the voice data of the target user.
  • the set recording text contains all the vowels to improve the accuracy of the subsequent model training.
  • the present invention can directly record the voice data of the target user by the user equipment, and perform subsequent operations. After the user equipment records the voice data of the target user, the network device is sent, and the network device performs subsequent operations.
  • the foregoing user equipment may be a hardware device having various operating systems, such as a computer, a smart phone, and a tablet computer.
  • the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or given A cloud of clouds consisting of a large number of computers or web servers.
  • the voice data of the target user is saved in real time, and if the target user cannot complete the recording of the voice data of all the target users at one time, the voice of the currently recorded target user may be retained. Data, and continue to complete the remaining unrecorded target user's voice data during the next recording.
  • the voice data of the target user can be utilized on the basis of the reference acoustic model, and the adaptive technology, for example, the LSTM (Long Short-Term Memory) neural network structure or The two-way LSTM neural network structure trains the first target user acoustic model to adaptively update the reference acoustic model to the first target user acoustic model.
  • the adaptive technology for example, the LSTM (Long Short-Term Memory) neural network structure or The two-way LSTM neural network structure trains the first target user acoustic model to adaptively update the reference acoustic model to the first target user acoustic model.
  • the reference acoustic model is based on large-scale training speech data, which is obtained through neural network structure training, and the reference acoustic model has better sound sub-coverage and prosody coverage capability, and can have more speech phenomena. Therefore, the reference acoustic model has constructed the framework structure of the model, such as the multi-layer neural network structure and the neuron connection relationship. Therefore, when training the first target user acoustic model, only a small amount of target user's voice data needs to be acquired.
  • the first target user acoustic model can be obtained by adaptive training update, so that the first target user acoustic model not only has the general information in the reference acoustic model, but also has the voice characteristics of the target user.
  • the process of training the first target user acoustic model according to the reference acoustic model and the voice data, as shown in FIG. 2 may include:
  • the method before performing acoustic feature extraction on the voice data, the method further includes: performing data denoising, data detection, data filtering, and segmentation on the voice data of the target user, for example, filtering out the voice data of the target user. Blank data segments, etc., to improve the accuracy of the voice data used to train the first target user data.
  • acoustic characteristics such as duration, spectrum, and fundamental frequency can be extracted from the voice data of the target user.
  • S132 Perform voice annotation on the voice data to obtain second text annotation data of the voice data.
  • the voice data may be voice-marked by an automatic identification method or a manual labeling method to obtain second text annotation data of the voice data.
  • the second text annotation data includes text feature data such as pinyin and rhythm level annotation.
  • the neural network structure of the reference acoustic model is obtained first, and then the first target user acoustic model is trained according to the third acoustic feature data, the second text annotation data, and the neural network structure of the reference acoustic model.
  • the neural network adaptive technique is used to perform an iterative operation to update the connection weight of the neurons in the reference acoustic model neural network structure.
  • the parameters are equal to obtain a first target user acoustic model with the characteristics of the target user's voice.
  • the first text annotation data of the reference acoustic model may also be input to the first In a target user acoustic model, second acoustic feature data corresponding to the first text annotation data is generated. Thereby, a large-scale acoustic feature data having the speech characteristics of the target user is acquired.
  • the second acoustic feature data includes acoustic features such as duration, spectrum, and fundamental frequency.
  • the first text annotation data and the second acoustic feature data are trained, and a second target user acoustic model is established according to the training result.
  • the second target user acoustic model obtained by the training can better describe the personalized sound characteristics in different contexts.
  • the second target user acoustic model can cover a wider range of linguistic phenomena than the HMM acoustic model obtained directly by direct training of the target user's speech data.
  • the amount of computation applied to the speech synthesis system is much smaller than that based on the LSTM neural network or the bidirectional LSTM neural network for prediction, it is very suitable for some devices with lower computing power.
  • the training method for the personalized multi-acoustic model for speech synthesis firstly trains the reference acoustic model based on the first acoustic feature data of the training speech data and the first text annotation data corresponding to the training speech data, and then Obtaining voice data of the target user, and training the first target user acoustic model according to the reference acoustic model and the voice data, and then generating second acoustic feature data of the first text annotation data according to the first target user acoustic model and the first text annotation data Finally, the second target user acoustic model is trained based on the first text annotation data and the second acoustic feature data, thereby reducing the requirement for the size of the target user's voice data during the training of the target user acoustic model, for example, From the scale of thousands of sentences to hundreds or even dozens of sentences, a small number of user voice data can be used to train a plurality of personalized acoustic models containing the voice characteristics
  • the plurality of acoustic models obtained by the above training are applied to the speech synthesis system.
  • acoustic models are one of the foundations of the entire system. Therefore, after generating the acoustic models of the plurality of target users by the training method of the embodiment of the present invention, a plurality of acoustic models can be applied to the speech synthesis system, and at this time, the user can select according to the condition of the device or the willingness of the user. Select the language used in the speech synthesis system A personalized acoustic model, the speech synthesis system will perform speech synthesis based on the acoustic model selected by the user. To this end, the present invention also proposes a speech synthesis method.
  • FIG. 3 is a flow chart of a speech synthesis method in accordance with one embodiment of the present invention.
  • the user selects speech synthesis using the first target user acoustic model.
  • the first target user acoustic model used in the speech synthesis method of the embodiment of the present invention is generated by the training method of the personalized multi-acoustic model for speech synthesis of the foregoing embodiment.
  • the speech synthesis method may include:
  • S301 Acquire a text to be synthesized, and perform word segmentation on the synthesized text.
  • S303 Perform a phonetic transcription on the synthesized text according to the word segmentation result, the part-of-speech tagging result, and the prosodic feature to generate a phonetic result of the text to be synthesized.
  • the text to be synthesized is: our family goes to Shanghai to extract the literal features and part-of-speech features of the text to be synthesized. Then, the phonetic dictionary processes the synthesized file according to the prosodic, literal and part-of-speech features to generate the text to be synthesized. The result of the phonetic.
  • the phonetic result, the prosody feature, and the context information of the text to be synthesized may be input into the acoustic prediction model, so that the synthesized text is acoustically predicted to generate a sequence of acoustic parameters such as a corresponding duration, spectrum, and fundamental frequency.
  • the vocoder can be used to synthesize a speech signal based on the acoustic parameter sequence to generate a final speech synthesis result.
  • the speech synthesis method of the embodiment of the present invention firstly obtains a text to be synthesized, performs word segmentation on the synthesized text, and then performs part-of-speech tagging on the text to be synthesized after the word segmentation, and performs prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model.
  • the prosodic features of the text to be synthesized are generated, and then the synthesized text is phoneticized according to the word segmentation result, the part-of-speech tagging result and the prosodic feature, to generate a phonetic result of the text to be synthesized, and the phonetic result, the prosodic feature and the context feature of the text to be synthesized are input.
  • the speech synthesis result synthesized in the speech synthesis system includes the voice characteristics of the target user, which satisfies the requirement of the user to generate personalized speech, and improves the user experience.
  • FIG. 4 is a flow chart of a speech synthesis method in accordance with another embodiment of the present invention.
  • the user selects speech synthesis using the second target user acoustic model.
  • the present invention is implemented.
  • the second target user acoustic model used in the speech synthesis method of the example is generated by the training method of the personalized multi-acoustic model for speech synthesis of the foregoing embodiment.
  • the speech synthesis method may include:
  • S401 Acquire a text to be synthesized, and perform word segmentation on the synthesized text.
  • S403 Perform a phonetic transcription on the synthesized text according to the prosodic feature to generate a phonetic result of the text to be synthesized.
  • the text to be synthesized is: our family goes to Shanghai to extract the literal features and part-of-speech features of the text to be synthesized. Then, the phonetic dictionary processes the synthesized file according to the prosodic, literal and part-of-speech features to generate the text to be synthesized. The result of the phonetic.
  • the phonetic result, the prosody feature, and the context information of the text to be synthesized may be input into the acoustic prediction model, so that the synthesized text is acoustically predicted to generate a sequence of acoustic parameters such as a corresponding duration, spectrum, and fundamental frequency.
  • the vocoder can be used to synthesize a speech signal based on the acoustic parameter sequence to generate a final speech synthesis result.
  • the second target user acoustic model can cover a wider range of linguistic phenomena, and therefore, the speech data synthesized by the second target user acoustic model is more accurate.
  • the speech synthesis method of the embodiment of the present invention firstly obtains a text to be synthesized, performs word segmentation on the synthesized text, and then performs part-of-speech tagging on the text to be synthesized after the word segmentation, and performs prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model.
  • the prosodic features of the text to be synthesized are generated, and then the synthesized text is phoneticized according to the word segmentation result, the part-of-speech tagging result and the prosodic feature, to generate a phonetic result of the text to be synthesized, and the phonetic result, the prosodic feature and the context feature of the text to be synthesized are input.
  • the speech synthesis result synthesized in the speech synthesis system includes the voice characteristics of the target user, which satisfies the requirement of the user to generate personalized speech, and improves the user experience.
  • the present invention also proposes a training apparatus for a personalized multi-acoustic model for speech synthesis.
  • FIG. 5 is a schematic structural diagram of a training apparatus for a personalized multi-acoustic model for speech synthesis according to an embodiment of the present invention.
  • the training device for the personalized multi-acoustic model for speech synthesis includes a first model training module. 110.
  • the first model training module 110 is configured to train the reference acoustic model based on the first acoustic feature data of the training voice data and the first text annotation data corresponding to the training voice data.
  • the first acoustic feature data includes acoustic features such as duration, spectrum, and fundamental frequency.
  • the first text annotation data includes text features such as pinyin and rhythm level annotation.
  • the obtaining module 120 is configured to acquire voice data of the target user.
  • the acquiring module 120 obtains the voice data of the target user in multiple manners.
  • the obtaining module 120 may select the voice data of the target user by using the live recording method, or directly use the voice data of the target user.
  • the existing voice data of the target user may be selected by using the live recording method, or directly use the voice data of the target user.
  • the second model training module 130 is configured to train the first target user acoustic model based on the reference acoustic model and the voice data.
  • the second model training module 130 may use the voice data of the target user based on the reference acoustic model, and adopt an adaptive technique, for example, by using LSTM (Long Short- Term Memory (long-term memory network) neural network structure or bidirectional LSTM neural network structure, training the first target user acoustic model, and adaptively updating the reference acoustic model to the first target user acoustic model.
  • LSTM Long Short- Term Memory
  • the second model training module 130 may include an extracting unit 131, a voice labeling module 132, and a model training unit 133, where:
  • the extracting unit 131 is configured to perform acoustic feature extraction on the voice data to acquire third acoustic feature data of the voice data.
  • the voice annotation module 132 is configured to perform voice annotation on the voice data to obtain second text annotation data of the voice data.
  • the model training unit 133 is configured to train the first target user acoustic model based on the reference acoustic model, the third acoustic feature data, and the second text annotation data.
  • the third acoustic feature data includes acoustic characteristics such as duration, spectrum, and fundamental frequency. That is, the extracting unit 131 can extract acoustic characteristics such as the duration, the spectrum, and the fundamental frequency from the voice data.
  • the second text annotation data includes text feature data such as pinyin and rhythm level annotation.
  • the model training unit 133 is specifically configured to: acquire a neural network structure of the reference acoustic model, and train the first target user acoustic model according to the third acoustic feature data, the second text annotation data, and the neural network structure of the reference acoustic model.
  • the generating module 140 is configured to generate second acoustic feature data of the first text annotation data according to the first target user acoustic model and the first text annotation data.
  • the second acoustic feature data includes acoustic features such as duration, spectrum, and fundamental frequency.
  • the third model training module 150 is configured to train the second target user acoustic model based on the first text annotation data and the second acoustic feature data.
  • the third model training module 150 may train the first text annotation data and the second acoustic feature data based on the hidden Markov model, and establish a second target user acoustic model according to the training result.
  • the training device for the personalized multi-acoustic model for speech synthesis is trained by the first model training module based on the first acoustic feature data of the training speech data and the first text annotation data corresponding to the training speech data.
  • the acquisition module acquires voice data of the target user
  • the second model training module trains the first target user acoustic model according to the reference acoustic model and the voice data, and then generates a module according to the first target user acoustic model and the first text annotation
  • the data generates second acoustic feature data of the first text annotation data
  • the third model training module trains the second target user acoustic model based on the first text annotation data and the second acoustic feature data, thereby training the target user acoustic model
  • the requirement for the size of the target user's voice data is reduced, for example, the number of thousands of sentences can be reduced to a few hundred sentences or even dozens of sentences, that is, a plurality of target objects can be trained with a small amount of user voice data.
  • the present invention also proposes a speech synthesis apparatus.
  • FIG. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention. It should be noted that the first target user acoustic model used by the speech synthesis apparatus of the embodiment of the present invention is generated by the training apparatus of the personalized multi-acoustic model for speech synthesis of any of the above embodiments.
  • the speech synthesis apparatus may include an acquisition module 210, a word segmentation module 220, a part-of-speech tagging module 230, a prosody prediction module 240, a phonetic module 250, an acoustic prediction module 260, and a speech synthesis module 270, where:
  • the obtaining module 210 is configured to acquire text to be synthesized; the word segmentation module 220 is configured to perform word segmentation on the synthesized text.
  • the part-of-speech tagging module 230 is configured to perform part-of-speech tagging on the text to be synthesized after the segmentation.
  • the prosody prediction module 240 is configured to perform prosody prediction on the to-be-composed text after the part-of-speech tagging by the prosody prediction model to generate a prosodic feature of the text to be synthesized.
  • the phoneme module 250 is configured to perform phonetic transcription on the synthesized text according to the word segmentation result, the part-of-speech tagging result, and the prosody feature to generate a phonetic result of the text to be synthesized.
  • the acoustic prediction module 260 is configured to input the phonetic result, the prosodic feature, and the context feature of the text to be synthesized to the first target user acoustic model, and perform acoustic prediction on the synthesized text by the first target user acoustic model to generate an acoustic parameter of the text to be synthesized. sequence.
  • the speech synthesis module 270 is configured to generate a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence.
  • the speech synthesis device of the embodiment of the present invention first acquires a text to be synthesized through an acquisition module, and then a word segmentation module
  • the synthesized text is segmented, and the part-of-speech tagging module performs part-of-speech tagging on the text to be synthesized after the segmentation.
  • the prosody prediction module performs prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model to generate the prosodic features of the text to be synthesized, and then the phonetic transcription.
  • the module performs phonetic transcription on the synthesized text according to the word segmentation result, the part-of-speech tagging result and the prosody feature to generate the phonetic result of the text to be synthesized, and the acoustic prediction module inputs the phonetic result, the prosodic feature and the context feature of the text to be synthesized into the first target user acoustics.
  • the synthesized speech synthesis results include the voice characteristics of the target user, which satisfies the user's need to generate personalized voice and improves the user experience.
  • the present invention also proposes a speech synthesis apparatus.
  • FIG. 8 is a schematic structural diagram of a speech synthesis apparatus according to another embodiment of the present invention. It should be noted that the second target user acoustic model used by the speech synthesis apparatus of the embodiment of the present invention is generated by the training apparatus of the personalized multi-acoustic model for speech synthesis of any of the above embodiments.
  • the speech synthesis apparatus may include an acquisition module 310, a word segmentation module 320, a part-of-speech tagging module 330, a prosody prediction module 340, a phonetic module 350, an acoustic prediction module 360, and a speech synthesis module 370, where:
  • the obtaining module 310 is configured to obtain text to be synthesized; the word segmentation module 220 is configured to perform word segmentation on the synthesized text.
  • the part of speech tagging module 330 is configured to perform part-of-speech tagging on the text to be synthesized after the segmentation.
  • the prosody prediction module 340 is configured to perform prosody prediction on the to-be-synthesized text after the part-of-speech tagging by the prosody prediction model to generate a prosodic feature of the text to be synthesized.
  • the phoneme module 350 is configured to perform phonetic transcription on the synthesized text according to the word segmentation result, the part-of-speech tagging result, and the prosody feature to generate a phonetic result of the text to be synthesized.
  • the acoustic prediction module 360 is configured to input the phonetic result, the prosodic feature, and the context feature of the text to be synthesized to the second target user acoustic model, and perform acoustic prediction on the synthesized text by the second target user acoustic model to generate an acoustic parameter of the text to be synthesized. sequence.
  • the speech synthesis module 370 is configured to generate a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence.
  • the text to be synthesized is first acquired by the acquisition module, and then the word segmentation module performs word segmentation on the synthesized text, and the part of speech tagging module performs part-of-speech tagging on the text to be synthesized after the segmentation, and the prosody prediction module uses the prosody prediction model to perform part of speech.
  • the labeled text to be synthesized is nowadaysly predicted to generate prosodic features of the text to be synthesized, and then the phonetic module performs phonetic transcription according to the word segmentation result, the part-of-speech tagging result and the prosodic feature to generate the phonetic result of the text to be synthesized, and acoustically predicts
  • the module inputs the phonetic result, the prosodic feature and the context feature of the text to be synthesized into the second target user acoustic model, and acoustically pre-forms the synthesized text through the second target user acoustic model Measured to generate an acoustic parameter sequence of the text to be synthesized, and finally the acoustic prediction module generates a speech synthesis result of the text to be synthesized according to the acoustic parameter sequence, thereby causing the speech synthesis result synthesized in the speech synthesis system to include the speech characteristics of the target user , to meet the needs of users to generate personalized voice
  • first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
  • the meaning of "a plurality” is at least two, such as two, three, etc., unless specifically defined otherwise.
  • a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device.
  • computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • a plurality of steps or methods may be implemented by software stored in a memory and executed by a suitable instruction execution system or Firmware to achieve.
  • a suitable instruction execution system or Firmware to achieve.
  • it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种用于语音合成的个性化多声学模型的训练方法、语音合成方法及装置,其中,该方法包括:基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型(S11);获取目标用户的语音数据(S12);根据基准声学模型和语音数据,训练第一目标用户声学模型(S13);根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据(S14);以及基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型(S15)。该模型训练方法,在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型。

Description

个性化多声学模型的训练方法、语音合成方法及装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2015年10月20日提交的、发明名称为“个性化多声学模型的训练方法、语音合成方法及装置”的、中国专利申请号“201510684475.1”的优先权。
技术领域
本发明涉及语音技术领域,尤其涉及一种用于语音合成的个性化多声学模型的训练方法、语音合成方法及装置。
背景技术
语音合成,又称文语转换(Text to Speech)技术,是一种能够将文字信息转化为语音并进行朗读的技术。其涉及声学、语言学、数字信号处理、计算机科学等多个学科技术,是中文信息处理领域的一项前沿技术,解决的主要问题是如何将文字信息转化为可听的声音信息。
在语音合成***中,将文本信息转换为声音信息的过程为:首先需要对输入的文本进行处理,包括预处理、分词、词性标注、多音字预测、韵律层级预测等,然后再通过声学模型,预测各个单元对应的声学特征,最后利用声学参数直接通过声码器合成声音,或者从录音语料库中挑选单元进行拼接,以生成与文本对应的声音信息。
其中,声学模型是整个语音合成***的基础之一,声学模型通常是通过对大规模的语音数据进行训练而得到的。训练声学模型的过程为:首先,设计一定数量的录音文本语料,以满足音子覆盖、韵律覆盖等要求。其次,挑选合适的发音人,发音人据此录制语音数据。接下来,进行文本、拼音、韵律、单元边界的标注,标注好的数据用于模型训练、音库生成。由此可以看出,训练一个声学模型的过程比较复杂,周期比较长,并且训练过程中是基于固定发音人的语音数据训练的,因此,在通过该声学模型合成语音的过程中,所合成的语音的音色是固定的。
然而,在很多情况下希望用自己的声音、家人/朋友的声源,或者明星的声音进行语音合成,即用户希望语音合成***所合成的语音具有个性的语音特点。为了满足个性化声音的需求,相关技术中获得个性化的声学模型的方式主要包括以下两种方式:
第一种方式,利用平行语料或者非平行语料,在声学参数层面,训练用户所需的个性化的声学模型。
第二种方式,采用模型间的映射,实现基准声学模型与个性化的声学模型之间的转换。具体地,采用HMM-GMM(Hidden Markov Models and Gaussian Mixture Models,隐马尔可 夫模型和高斯混合模型)建模,并进行决策树间的映射,以生成个性化的声学模型。
然而,在实现本发明的过程中,发明人发现相关技术存在至少以下问题:
针对第一种方式来说,(1)采用平行语料,在声学参数层面,训练个性化的声学模型,要求两个发音人按照同样的文本来生成原始语音,而这一点有时候是不太现实的。并且采用平行语料,语料规模的要求可能会比较高,所需的时间比较长,加工量比较大,难以快速获得个性化的声学模型。(2)采用非平行语音,在声学参数层面,训练个性化的声学模型。由于两个发音人按照不同的文本生成原始语音,且同一个音节,在不同的句子环境中,发音是明显有区别的,因此,如果把不同发音人的不同句子中的某个相同音子做映射,则容易造成所训练得到的个性化的声学模型不准确,从而导致合成的语音不够自然。
针对第二种方式来说,由于决策树是一种浅层模型,其描述能力有限,尤其在用户的语音数据量比较少时,所生成的个性化的声学模型的准确性不高,从而造成预测出的参数可能有不连贯的情况,进而使合成的语音出现跳变、音色不稳定等现象,造成语音的不自然。
发明内容
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本发明的一个目的在于提出一种用于语音合成的个性化多声学模型的训练方法,该方法在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型,进而可满足个性化语音需求,提升了用户体验度。
本发明的第二个目的在于提出一种语音合成方法。
本发明的第三个目的在于提出一种语音合成方法。
本发明的第四个目的在于提出一种用于语音合成的个性化多声学模型的训练装置。
本发明的第五个目的在于提出一种用于语音合成装置。
本发明的第六个目的在于提出一种用于语音合成装置。
为达上述目的,本发明第一方面实施例提出了一种用于语音合成的个性化多声学模型的训练方法,包括:基于训练语音数据的第一声学特征数据和与所述训练语音数据对应的第一文本标注数据,训练基准声学模型;获取目标用户的语音数据;根据所述基准声学模型和所述语音数据,训练第一目标用户声学模型;根据所述第一目标用户声学模型和所述第一文本标注数据生成所述第一文本标注数据的第二声学特征数据;以及基于所述第一文本标注数据和所述第二声学特征数据,训练第二目标用户声学模型。
本发明实施例的用于语音合成的个性化多声学模型的训练方法,首先基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型,然后获取目标用户的语音数据,并根据基准声学模型和语音数据,训练第一目标用户声学 模型,进而根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据,最后基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型,由此,在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,例如可以从几千句的规模降低到几百句甚至几十句,即可以利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型,进而可满足个性化语音需求,提升了用户体验度。
为达上述目的,本发明第二方面实施例提出了一种使用本发明第一方面实施例所述的第一目标用户声学模型进行语音合成的方法,包括:获取待合成文本,对所述待合成文本进行分词;对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第一目标用户声学模型,通过所述第一目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及根据所述声学参数序列生成所述待合成文本的语音合成结果。
本发明实施例的语音合成方法,首先获取待合成文本,对待合成文本进行分词,然后对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
为达上述目的,本发明第三方面实施例提出了一种使用本发明第一方面实施例所述的第二目标用户声学模型进行语音合成的方法,包括:获取待合成文本,对所述待合成文本进行分词;对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;根据所述分词结果、词性标注结果和韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第二目标用户声学模型,通过所述第二目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及根据所述声学参数序列生成所述待合成文本的语音合成结果。
本发明实施例的语音合成方法,首先获取待合成文本,对待合成文本进行分词,然后对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并将注音结果、韵律特征及待 合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
为达上述目的,本发明第四方面实施例提出了一种用于语音合成的个性化多声学模型的训练装置,包括:第一模型训练模块,用于基于训练语音数据的第一声学特征数据和与所述训练语音数据对应的第一文本标注数据,训练基准声学模型;获取模块,用于获取目标用户的语音数据;第二模型训练模块,用于根据所述基准声学模型和所述语音数据,训练第一目标用户声学模型;生成模块,用于根据所述第一目标用户声学模型和所述第一文本标注数据生成所述第一文本标注数据的第二声学特征数据;以及第三模型训练模块,用于基于所述第一文本标注数据和所述第二声学特征数据,训练第二目标用户声学模型。
本发明实施例的用于语音合成的个性化多声学模型的训练装置,通过第一模型训练模块基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型,获取模块获取目标用户的语音数据,并第二模型训练模块根据基准声学模型和语音数据,训练第一目标用户声学模型,进而生成模块根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据,最后第三模型训练模块基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型,由此,在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,例如可以从几千句的规模降低到几百句甚至几十句,即可以利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型,进而可满足个性化语音需求,提升了用户体验度。
为达上述目的,本发明第五方面实施例提出了一种使用本发明第四方面实施例所述的第一目标用户声学模型进行语音合成的装置,包括:获取模块,用于获取待合成文本;分词模块,用于对所述待合成文本进行分词;词性标注模块,用于对分词后的待合成文本进行词性标注;韵律预测模块,用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;注音模块,用于根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;声学预测模块,用于将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第一目标用户声学模型,通过所述第一目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及语音合成模块,用于根据所述声学参数序列生成所述待合成文本的语音合成结果。
本发明实施例的语音合成装置,首先通过获取模块获取待合成文本,然后分词模块对待合成文本进行分词,词性标注模块对分词后的待合成文本进行词性标注,韵律预测模块通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而注音模块根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生 成待合成文本的注音结果,并声学预测模块将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后声学预测模块根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
为达上述目的,本发明第六方面实施例提出了一种使用本发明第四方面实施例所述的第二目标用户声学模型进行语音合成的装置,包括:获取模块,用于获取待合成文本;分词模块,用于对所述待合成文本进行分词;词性标注模块,用于对分词后的待合成文本进行词性标注;韵律预测模块,用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;注音模块,用于根据所述分词结果、词性标注结果和韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;声学预测模块,用于将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第二目标用户声学模型,通过所述第二目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及语音合成模块,用于根据所述声学参数序列生成所述待合成文本的语音合成结果。
本发明实施例的语音合成装置,首先通过获取模块获取待合成文本,然后分词模块对待合成文本进行分词,词性标注模块对分词后的待合成文本进行词性标注,韵律预测模块通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而注音模块根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并声学预测模块将注音结果、韵律特征及待合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后声学预测模块根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
图1是本发明一个实施例的用于语音合成的个性化多声学模型的训练方法的流程图。
图2是步骤S13的细化流程图。
图3是根据本发明一个实施例的语音合成方法的流程图。
图4是根据本发明另一个实施例的语音合成方法的流程图。
图5是本发明一个实施例的用于语音合成的个性化多声学模型的训练装置的结构示意图。
图6是本发明另一个实施例的用于语音合成的个性化多声学模型的训练装置的结构示意图
图7是根据本发明一个实施例的语音合成装置的结构示意图。
图8是根据本发明另一个实施例的语音合成装置的结构示意图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
下面参考附图描述本发明实施例的用于语音合成的个性化多声学模型的训练方法、语音合成方法及装置。
图1是本发明一个实施例的用于语音合成的个性化多声学模型的训练方法的流程图。
如图1所示,该用于语音合成的个性化多声学模型的训练方法包括:
S11,基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型。
具体地,为了使得训练得到的基准声学模型具有较好的音子覆盖和韵律覆盖能力,且能够描述多种语音现象,可先设计一定数量的录音文本语料,然后挑选合适的发音人,以获取大规模非目标用户的训练语音数据,以及提取训练语音数据的第一声学特征数据,并对与训练语音数据对应录音文本语料进行标注,以获得训练语音数据的第一文本标注数据。
其中,第一声学特征数据中包含时长、谱、基频等声学特征。
其中,第一文本标注数据包含拼音、韵律层级标注等文本特征。
在获得训练语音数据的第一声学特征数据和与训练数据语音数据对应的第一文本标注数据后,可通过神经网络对第一声学特征数据和第一文本标注数据进行训练,并根据训练结果生成基准声学模型。
S12,获取目标用户的语音数据。
其中,语音数据中包含目标用户的语音特点。
具体地,获取目标用户的语音数据的方式有多种,在实际应用中,可根据需要选择,例如,可通过采用现场录制的方式获取目标用户的语音数据,或者直接使用目标用户现有的语音数据。
下面以采用现场录制的方式详细说明获取目标用户的语音数据的过程。
一般情况下,首先根据音子覆盖以及韵律覆盖等指标,预先设计录音文本,并提供给目标用户进行朗读,以获得目标用户的语音数据。
在设计录音文本时,例如汉语文本,优选的,设置录音文本包含全部的声韵母,以提高后续模型训练的准确度。
需要说明的是,本发明可以直接由用户设备录制目标用户的语音数据,并执行后续操作,还可以由用户设备录制目标用户的语音数据后,并发送网络设备,由网络设备执行后续操作。
其中,需要说明的是,上述用户设备可以是计算机、智能手机和平板电脑等具有各种操作***的硬件设备,上述网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或给予云计算的由大量计算机或网络服务器构成的云。
进一步的,在录制目标用户的语音数据时,优选的,将目标用户的语音数据进行实时保存,若目标用户不能一次性完成全部目标用户的语音数据的录制,可保留当前录制的目标用户的语音数据,并在下次录制时,继续完成剩余未录制的目标用户的语音数据。
S13,根据基准声学模型和语音数据,训练第一目标用户声学模型。
在获取目标用户的语音数据后,可在基准声学模型基础上,利用目标用户的语音数据,通过自适应技术,例如,可通过LSTM(Long Short-Term Memory,长短时记忆网络)神经网络结构或双向LSTM神经网络结构,训练第一目标用户声学模型,使基准声学模型自适应更新为第一目标用户声学模型。
通过上述描述,可知基准声学模型是基于大规模的训练语音数据,通过神经网络结构训练得到的,且基准声学模型具有较好的音子覆盖和韵律覆盖能力,能够较多的语音现象。因此,基准声学模型已构建好模型的框架结构,例如多层神经网络结构以及神经元连接关系等,所以在训练第一目标用户声学模型时,仅需获取少量目标用户的语音数据,在上述基准声学模型基础上,自适应训练更新即可获得第一目标用户声学模型,使第一目标用户声学模型不仅具有基准声学模型中的通用信息,还具有目标用户的语音特点。
具体地,在本发明的一个实施例中,根据基准声学模型和语音数据,训练第一目标用户声学模型的过程,如图2所示,可以包括:
S131,对语音数据进行声学特征提取,以获取语音数据的第三声学特征数据。
可选的,在对语音数据进行声学特征提取之前,还可以包括对目标用户的语音数据进行数据降噪、数据检测、数据筛选以及切分等预处理,例如滤除目标用户的语音数据中的空白数据段等,以提高用于训练第一目标用户数据的语音数据的准确性。
具体地,可从目标用户的语音数据中提取出时长、频谱和基频等声学特征。
S132,对语音数据进行语音标注,以获取语音数据的第二文本标注数据。
具体地,在获得语音数据后,可通过自动识别方法或人工标注方法对语音数据进行语音标注,以获取语音数据的第二文本标注数据。
其中,第二文本标注数据包含拼音、韵律层级标注等文本特征数据。
S133,根据基准声学模型、第三声学特征数据和第二文本标注数据,训练第一目标用户声学模型。
具体地,在获得目标用户的语音数据的第三声学特征数据和第二文本标注数据后,可 先获取基准声学模型的神经网络结构,然后,根据第三声学特征数据、第二文本标注数据以及基准声学模型的神经网络结构,训练第一目标用户声学模型。
具体而言,根据第三声学特征数据、第二文本标注数据以及基准声学模型的神经网络结构,通过神经网络自适应技术,进行迭代运算,更新基准声学模型神经网络结构中神经元的连接权值等参数,以获得具有目标用户语音特点的第一目标用户声学模型。
S14,根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据。
具体地,为了可以生成多种复杂度的声学模型,满足在不同终端设备上的使用要求,在获得第一目标用户声学模型后,还可以将构建基准声学模型的第一文本标注数据输入至第一目标用户声学模型中,以生成第一文本标注数据对应的第二声学特征数据。由此,获取一个较大规模的具有目标用户的语音特点的声学特征数据。
其中,第二声学特征数据包含时长、谱、基频等声学特征。
S15,基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型。
具体地,基于隐马尔可夫模型(HMM,Hidden Markov Models),对第一文本标注数据和第二声学特征数据进行训练,并根据训练结果建立第二目标用户声学模型。由于第二声学特征数据中已经涵盖了目标用户的语音特点,因此,训练所得到的第二目标用户声学模型,能够较好地描述不同上下文情况下的个性化声音特点。相对于直接根据目标用户的语音数据直接训练所获得的HMM声学模型来说,第二目标用户声学模型可以覆盖更广泛的语言现象。而且,由于其应用到语音合成***中的运算量远小于基于LSTM神经网络或者双向LSTM神经网络进行预测的运算量,所以非常适合一些运算能力较低的设备。
本发明实施例的用于语音合成的个性化多声学模型的训练方法,首先基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型,然后获取目标用户的语音数据,并根据基准声学模型和语音数据,训练第一目标用户声学模型,进而根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据,最后基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型,由此,在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,例如可以从几千句的规模降低到几百句甚至几十句,即可以利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型,进而可满足个性化语音需求,提升了用户体验度。
可以理解,上述训练所获得的多个声学模型是应用于语音合成***中的。在语音合成***中,声学模型是整个***的基础之一。因此,在通过本发明实施例的训练方法生成多个目标用户的声学模型之后,可将多个声学模型应用到语音合成***中,此时,用户可根据自己的设备情况,或者意愿,有选择地选择语音合成***中所使用的 个性化的声学模型,语音合成***将根据用户所选择的声学模型进行语音合成。为此,本发明还提出了一种语音合成方法。
图3是根据本发明一个实施例的语音合成方法的流程图。在该实施例中,假定用户选择用第一目标用户声学模型进行语音合成。另外,需要说明的是,本发明实施例的语音合成方法所使用的第一目标用户声学模型是由前述实施例的用于语音合成的个性化多声学模型的训练方法所生成的。
如图3所示,该语音合成方法可以包括:
S301,获取待合成文本,对待合成文本进行分词。
S302,对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征。
S303,根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果。
例如,待合成文本为:我们一家人去上海,可提取该待合成文本的字面特征和词性特征,然后,注音词典根据韵律特征、字面特征和词性特征对待合成文件进行注音,以生成待合成文本的注音结果。
S304,将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列。
具体地,可将待合成文本的注音结果、韵律特征及上下文信息输入到声学预测模型中,从而对待合成文本进行声学预测,生成对应的时长、谱、基频等声学参数序列。
S305,根据声学参数序列生成待合成文本的语音合成结果。
具体地,可利用声码器根据声学参数序列合成语音信号,从而生成最终的语音合成结果。
本发明实施例的语音合成方法,首先获取待合成文本,对待合成文本进行分词,然后对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
图4是根据本发明另一个实施例的语音合成方法的流程图。在该实施例中,假定用户选择用第二目标用户声学模型进行语音合成。另外,需要说明的是,本发明实施 例的语音合成方法所使用的第二目标用户声学模型是由前述实施例的用于语音合成的个性化多声学模型的训练方法所生成的。
如图4所示,该语音合成方法可以包括:
S401,获取待合成文本,对待合成文本进行分词。
S402,对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征。
S403,根据韵律特征对待合成文本进行注音,以生成待合成文本的注音结果。
例如,待合成文本为:我们一家人去上海,可提取该待合成文本的字面特征和词性特征,然后,注音词典根据韵律特征、字面特征和词性特征对待合成文件进行注音,以生成待合成文本的注音结果。
S404,将注音结果、韵律特征及待合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列。
具体地,可将待合成文本的注音结果、韵律特征及上下文信息输入到声学预测模型中,从而对待合成文本进行声学预测,生成对应的时长、谱、基频等声学参数序列。
S405,根据声学参数序列生成待合成文本的语音合成结果。
具体地,可利用声码器根据声学参数序列合成语音信号,从而生成最终的语音合成结果。
需要说明的是,第二目标用户声学模型可以覆盖更广泛的语言现象,因此,通过第二目标用户声学模型所合成的语音数据更加准确。
本发明实施例的语音合成方法,首先获取待合成文本,对待合成文本进行分词,然后对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并将注音结果、韵律特征及待合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
为了实现上述实施例,本发明还提出了一种用于语音合成的个性化多声学模型的训练装置。
图5是本发明一个实施例的用于语音合成的个性化多声学模型的训练装置的结构示意图。
如图5所示,该用于语音合成的个性化多声学模型的训练装置包括第一模型训练模块 110、获取模块120、第二模型训练模块130、生成模块140和第三模型训练模块150。
具体地,第一模型训练模块110用于基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型。
其中,第一声学特征数据中包含时长、谱、基频等声学特征。
其中,第一文本标注数据包含拼音、韵律层级标注等文本特征。
获取模块120用于获取目标用户的语音数据。
具体地,获取模块120获取目标用户的语音数据的方式有多种,在实际应用中,可根据需要选择,例如,获取模块120可通过采用现场录制的方式获取目标用户的语音数据,或者直接使用目标用户现有的语音数据。
第二模型训练模块130用于根据基准声学模型和语音数据,训练第一目标用户声学模型。
具体地,在获取模块120获取目标用户的语音数据后,第二模型训练模块130可在基准声学模型基础上,利用目标用户的语音数据,通过自适应技术,例如,可通过LSTM(Long Short-Term Memory,长短时记忆网络)神经网络结构或者双向LSTM神经网络结构,训练第一目标用户声学模型,使基准声学模型自适应更新为第一目标用户声学模型。
如图6所示,上述第二模型训练模块130可以包括提取单元131、语音标注模块132和模型训练单元133,其中:
提取单元131用于对语音数据进行声学特征提取,以获取语音数据的第三声学特征数据。
语音标注模块132用于对语音数据进行语音标注,以获取语音数据的第二文本标注数据。
模型训练单元133用于根据基准声学模型、第三声学特征数据和第二文本标注数据,训练第一目标用户声学模型。
其中,第三声学特征数据包含时长、频谱和基频等声学特征。即提取单元131可从语音数据中提取出时长、频谱和基频等声学特征。
其中,第二文本标注数据包含拼音、韵律层级标注等文本特征数据。
模型训练单元133具体用于:获取基准声学模型的神经网络结构,并根据第三声学特征数据、第二文本标注数据以及基准声学模型的神经网络结构,训练第一目标用户声学模型。
生成模块140用于根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据。
其中,第二声学特征数据包含时长、谱、基频等声学特征。
第三模型训练模块150用于基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型。
具体地,第三模型训练模块150可基于隐马尔可夫模型,对第一文本标注数据和第二声学特征数据进行训练,并根据训练结果建立第二目标用户声学模型。
需要说明的是,前述对用于语音合成的个性化多声学模型的训练方法实施例的解释说明也适用于该实施例的用于语音合成的个性化多声学模型的训练装置,此处不再赘述。
本发明实施例的用于语音合成的个性化多声学模型的训练装置,通过第一模型训练模块基于训练语音数据的第一声学特征数据和与训练语音数据对应的第一文本标注数据,训练基准声学模型,获取模块获取目标用户的语音数据,并第二模型训练模块根据基准声学模型和语音数据,训练第一目标用户声学模型,进而生成模块根据第一目标用户声学模型和第一文本标注数据生成第一文本标注数据的第二声学特征数据,最后第三模型训练模块基于第一文本标注数据和第二声学特征数据,训练第二目标用户声学模型,由此,在训练目标用户声学模型过程中,降低了对目标用户的语音数据的规模的要求,例如可以从几千句的规模降低到几百句甚至几十句,即可以利用少量的用户语音数据就可训练出多个包含目标用户的语音特点的个性化声学模型,进而可满足个性化语音需求,提升了用户体验度。
为了实现上述实施例,本发明还提出了一种语音合成装置。
图7是根据本发明一个实施例的语音合成装置的结构示意图。需要说明的是,本发明实施例的语音合成装置所使用的第一目标用户声学模型是由上述任一个实施例的用于语音合成的个性化多声学模型的训练装置所生成的。
如图7所示,该语音合成装置可以包括获取模块210、分词模块220、词性标注模块230、韵律预测模块240、注音模块250、声学预测模块260和语音合成模块270,其中:
获取模块210用于获取待合成文本;分词模块220用于对待合成文本进行分词。
词性标注模块230用于对分词后的待合成文本进行词性标注。
韵律预测模块240用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征。
注音模块250用于根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果。
声学预测模块260用于将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列。
语音合成模块270用于根据声学参数序列生成待合成文本的语音合成结果。
需要说明的是,前述对语音合成方法实施例的解释说明也适用于该实施例的语音合成装置,此处不再赘述。
本发明实施例的语音合成装置,首先通过获取模块获取待合成文本,然后分词模块对 待合成文本进行分词,词性标注模块对分词后的待合成文本进行词性标注,韵律预测模块通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而注音模块根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并声学预测模块将注音结果、韵律特征及待合成文本的上下文特征输入至第一目标用户声学模型,通过第一目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列,最后声学预测模块根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
为了实现上述实施例,本发明还提出了一种语音合成装置。
图8是根据本发明另一个实施例的语音合成装置的结构示意图。需要说明的是,本发明实施例的语音合成装置所使用的第二目标用户声学模型是由上述任一个实施例的用于语音合成的个性化多声学模型的训练装置所生成的。
如图8所示,该语音合成装置可以包括获取模块310、分词模块320、词性标注模块330、韵律预测模块340、注音模块350、声学预测模块360和语音合成模块370,其中:
获取模块310用于获取待合成文本;分词模块220用于对待合成文本进行分词。
词性标注模块330用于对分词后的待合成文本进行词性标注。
韵律预测模块340用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征。
注音模块350用于根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果。
声学预测模块360用于将注音结果、韵律特征及待合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预测,以生成待合成文本的声学参数序列。
语音合成模块370用于根据声学参数序列生成待合成文本的语音合成结果。
需要说明的是,前述对语音合成方法实施例的解释说明也适用于该实施例的语音合成装置,此处不再赘述。
本发明实施例的语音合成装置,首先通过获取模块获取待合成文本,然后分词模块对待合成文本进行分词,词性标注模块对分词后的待合成文本进行词性标注,韵律预测模块通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成待合成文本的韵律特征,进而注音模块根据分词结果、词性标注结果和韵律特征对待合成文本进行注音,以生成待合成文本的注音结果,并声学预测模块将注音结果、韵律特征及待合成文本的上下文特征输入至第二目标用户声学模型,通过第二目标用户声学模型对待合成文本进行声学预 测,以生成待合成文本的声学参数序列,最后声学预测模块根据声学参数序列生成待合成文本的语音合成结果,由此,使得语音合成***中合成出的语音合成结果中包含目标用户的语音特点,满足了用户生成个性化语音的需求,提升了用户体验。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行***、装置或设备(如基于计算机的***、包括处理器的***或其他可以从指令执行***、装置或设备取指令并执行指令的***)使用,或结合这些指令执行***、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行***、装置或设备或结合这些指令执行***、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行***执行的软件或 固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (12)

  1. 一种用于语音合成的个性化多声学模型的训练方法,其特征在于,包括以下步骤:
    基于训练语音数据的第一声学特征数据和与所述训练语音数据对应的第一文本标注数据,训练基准声学模型;
    获取目标用户的语音数据;
    根据所述基准声学模型和所述语音数据,训练第一目标用户声学模型;
    根据所述第一目标用户声学模型和所述第一文本标注数据生成所述第一文本标注数据的第二声学特征数据;以及
    基于所述第一文本标注数据和所述第二声学特征数据,训练第二目标用户声学模型。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述基准声学模型和所述语音数据,训练第一目标用户声学模型,具体包括:
    对所述语音数据进行声学特征提取,以获取所述语音数据的第三声学特征数据;
    对所述语音数据进行语音标注,以获取所述语音数据的第二文本标注数据;
    根据所述基准声学模型、所述第三声学特征数据和所述第二文本标注数据,训练所述第一目标用户声学模型。
  3. 如权利要求1或2所述的方法,其特征在于,所述根据所述基准声学模型、所述第三声学特征数据和所述第二文本标注数据,训练所述第一目标用户声学模型,具体包括:
    获取所述基准声学模型的神经网络结构;
    根据所述第三声学特征数据、所述第二文本标注数据以及所述基准声学模型的神经网络结构,训练所述第一目标用户声学模型。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述基于所述第一文本标注数据和所述第二声学特征数据,训练第二目标用户声学模型,具体包括:
    基于隐马尔可夫模型,对所述第一文本标注数据和所述第二声学特征数据进行训练,并根据训练结果建立所述第二目标用户声学模型。
  5. 一种使用如权利要求1至4中任一项所述的第一目标用户声学模型进行语音合成的方法,其特征在于,包括:
    获取待合成文本,对所述待合成文本进行分词;
    对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;
    根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;
    将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第一目标用户声学模型,通过所述第一目标用户声学模型对所述待合成文本进行声学预测,以生成所述 待合成文本的声学参数序列;以及
    根据所述声学参数序列生成所述待合成文本的语音合成结果。
  6. 一种使用如权利要求1至4中任一项所述的第二目标用户声学模型进行语音合成的方法,其特征在于,包括:
    获取待合成文本,对所述待合成文本进行分词;
    对分词后的待合成文本进行词性标注,并通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;
    根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;
    将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第二目标用户声学模型,通过所述第二目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及
    根据所述声学参数序列生成所述待合成文本的语音合成结果。
  7. 一种用于语音合成的个性化多声学模型的训练装置,其特征在于,包括:
    第一模型训练模块,用于基于训练语音数据的第一声学特征数据和与所述训练语音数据对应的第一文本标注数据,训练基准声学模型;
    获取模块,用于获取目标用户的语音数据;
    第二模型训练模块,用于根据所述基准声学模型和所述语音数据,训练第一目标用户声学模型;
    生成模块,用于根据所述第一目标用户声学模型和所述第一文本标注数据生成所述第一文本标注数据的第二声学特征数据;以及
    第三模型训练模块,用于基于所述第一文本标注数据和所述第二声学特征数据,训练第二目标用户声学模型。
  8. 如权利要求7所述的装置,其特征在于,所述第二模型训练模块,具体包括:
    提取单元,用于对所述语音数据进行声学特征提取,以获取所述语音数据的第三声学特征数据;
    语音标注模块,用于对所述语音数据进行语音标注,以获取所述语音数据的第二文本标注数据;
    模型训练单元,用于根据所述基准声学模型、所述第三声学特征数据和所述第二文本标注数据,训练所述第一目标用户声学模型。
  9. 如权利要求7或8所述的装置,其特征在于,所述模型训练单元,具体用于:
    获取所述基准声学模型的神经网络结构,并根据所述第三声学特征数据、所述第二文本标注数据以及所述基准声学模型的神经网络结构,训练所述第一目标用户声学模型。
  10. 如权利要求7-9中任一项所述的装置,其特征在于,所述第三模型训练模块,具 体用于:
    基于隐马尔可夫模型,对所述第一文本标注数据和所述第二声学特征数据进行训练,并根据训练结果建立所述第二目标用户声学模型。
  11. 一种使用如权利要求7至10中任一项所述的第一目标用户声学模型进行语音合成的装置,其特征在于,包括:
    获取模块,用于获取待合成文本;
    分词模块,用于对所述待合成文本进行分词;
    词性标注模块,用于对分词后的待合成文本进行词性标注;
    韵律预测模块,用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;
    注音模块,用于根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;
    声学预测模块,用于将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第一目标用户声学模型,通过所述第一目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及
    语音合成模块,用于根据所述声学参数序列生成所述待合成文本的语音合成结果。
  12. 一种使用如权利要求7至10中任一项所述的第二目标用户声学模型进行语音合成的装置,其特征在于,包括:
    获取模块,用于获取待合成文本;
    分词模块,用于对所述待合成文本进行分词;
    词性标注模块,用于对分词后的待合成文本进行词性标注;
    韵律预测模块,用于通过韵律预测模型对词性标注后的待合成文本进行韵律预测,以生成所述待合成文本的韵律特征;
    注音模块,用于根据分词结果、词性标注结果和所述韵律特征对所述待合成文本进行注音,以生成所述待合成文本的注音结果;
    声学预测模块,用于将所述注音结果、所述韵律特征及所述待合成文本的上下文特征输入至第二目标用户声学模型,通过所述第二目标用户声学模型对所述待合成文本进行声学预测,以生成所述待合成文本的声学参数序列;以及
    语音合成模块,用于根据所述声学参数序列生成所述待合成文本的语音合成结果。
PCT/CN2016/087321 2015-10-20 2016-06-27 个性化多声学模型的训练方法、语音合成方法及装置 WO2017067206A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/758,280 US10410621B2 (en) 2015-10-20 2016-06-27 Training method for multiple personalized acoustic models, and voice synthesis method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510684475.1A CN105185372B (zh) 2015-10-20 2015-10-20 个性化多声学模型的训练方法、语音合成方法及装置
CN201510684475.1 2015-10-20

Publications (1)

Publication Number Publication Date
WO2017067206A1 true WO2017067206A1 (zh) 2017-04-27

Family

ID=54907400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087321 WO2017067206A1 (zh) 2015-10-20 2016-06-27 个性化多声学模型的训练方法、语音合成方法及装置

Country Status (3)

Country Link
US (1) US10410621B2 (zh)
CN (1) CN105185372B (zh)
WO (1) WO2017067206A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962219A (zh) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 用于处理文本的方法和装置
CN111968617A (zh) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 一种非平行数据的语音转换方法及***
CN113302683A (zh) * 2019-12-24 2021-08-24 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质

Families Citing this family (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105206258B (zh) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105185372B (zh) 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN105609096A (zh) * 2015-12-30 2016-05-25 小米科技有限责任公司 文本数据输出方法和装置
US10229672B1 (en) 2015-12-31 2019-03-12 Google Llc Training acoustic models using connectionist temporal classification
CN105702263B (zh) * 2016-01-06 2019-08-30 清华大学 语音重放检测方法和装置
CN105845130A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 用于语音识别的声学模型训练方法及装置
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置
CN108346423B (zh) * 2017-01-23 2021-08-20 北京搜狗科技发展有限公司 语音合成模型的处理方法和装置
CN108305619B (zh) * 2017-03-10 2020-08-04 腾讯科技(深圳)有限公司 语音数据集训练方法和装置
CN107103903B (zh) * 2017-05-05 2020-05-29 百度在线网络技术(北京)有限公司 基于人工智能的声学模型训练方法、装置及存储介质
CN109313891B (zh) * 2017-05-16 2023-02-21 北京嘀嘀无限科技发展有限公司 用于语音合成的***和方法
CN107154263B (zh) * 2017-05-25 2020-10-16 宇龙计算机通信科技(深圳)有限公司 声音处理方法、装置及电子设备
US10726828B2 (en) * 2017-05-31 2020-07-28 International Business Machines Corporation Generation of voice data as data augmentation for acoustic model training
CN107293288B (zh) * 2017-06-09 2020-04-21 清华大学 一种残差长短期记忆循环神经网络的声学模型建模方法
CN107481717B (zh) * 2017-08-01 2021-03-19 百度在线网络技术(北京)有限公司 一种声学模型训练方法及***
CN107452369B (zh) * 2017-09-28 2021-03-19 百度在线网络技术(北京)有限公司 语音合成模型生成方法和装置
CN107705783B (zh) * 2017-11-27 2022-04-26 北京搜狗科技发展有限公司 一种语音合成方法及装置
CN108039168B (zh) * 2017-12-12 2020-09-11 科大讯飞股份有限公司 声学模型优化方法及装置
CN108172209A (zh) * 2018-01-09 2018-06-15 上海大学 构建语音偶像方法
US11238843B2 (en) * 2018-02-09 2022-02-01 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN108597538B (zh) * 2018-03-05 2020-02-11 标贝(北京)科技有限公司 语音合成***的评测方法和***
CN110379411B (zh) * 2018-04-11 2023-06-23 阿里巴巴集团控股有限公司 针对目标说话人的语音合成方法和装置
CN110399547B (zh) * 2018-04-17 2022-03-04 百度在线网络技术(北京)有限公司 用于更新模型参数的方法、装置、设备和存储介质
CN108877765A (zh) * 2018-05-31 2018-11-23 百度在线网络技术(北京)有限公司 语音拼接合成的处理方法及装置、计算机设备及可读介质
US11605371B2 (en) * 2018-06-19 2023-03-14 Georgetown University Method and system for parametric speech synthesis
CN109064789A (zh) * 2018-08-17 2018-12-21 重庆第二师范学院 一种伴随脑瘫性口齿不清辅助控制***及方法、辅助器
CN109065016B (zh) * 2018-08-30 2021-04-13 出门问问信息科技有限公司 语音合成方法、装置、电子设备及非暂态计算机存储介质
CN109300468B (zh) * 2018-09-12 2022-09-06 科大讯飞股份有限公司 一种语音标注方法及装置
CN111063338B (zh) * 2018-09-29 2023-09-19 阿里巴巴集团控股有限公司 音频信号识别方法、装置、设备、***和存储介质
CN109346107B (zh) * 2018-10-10 2022-09-30 中山大学 一种基于lstm的独立说话人语音发音逆求解的方法
KR102247902B1 (ko) * 2018-10-16 2021-05-04 엘지전자 주식회사 단말기
US11200884B1 (en) * 2018-11-06 2021-12-14 Amazon Technologies, Inc. Voice profile updating
US11004454B1 (en) * 2018-11-06 2021-05-11 Amazon Technologies, Inc. Voice profile updating
CN109599095B (zh) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 一种语音数据的标注方法、装置、设备和计算机存储介质
CN109285536B (zh) * 2018-11-23 2022-05-13 出门问问创新科技有限公司 一种语音特效合成方法、装置、电子设备及存储介质
CN111383627B (zh) * 2018-12-28 2024-03-22 北京猎户星空科技有限公司 一种语音数据处理方法、装置、设备及介质
CN110444191B (zh) * 2019-01-22 2021-11-26 清华大学深圳研究生院 一种韵律层级标注的方法、模型训练的方法及装置
CN110010136B (zh) * 2019-04-04 2021-07-20 北京地平线机器人技术研发有限公司 韵律预测模型的训练和文本分析方法、装置、介质和设备
CN110164413B (zh) * 2019-05-13 2021-06-04 北京百度网讯科技有限公司 语音合成方法、装置、计算机设备和存储介质
US11094311B2 (en) 2019-05-14 2021-08-17 Sony Corporation Speech synthesizing devices and methods for mimicking voices of public figures
CN110428819B (zh) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 解码网络生成方法、语音识别方法、装置、设备及介质
US11141669B2 (en) * 2019-06-05 2021-10-12 Sony Corporation Speech synthesizing dolls for mimicking voices of parents and guardians of children
CN110379407B (zh) * 2019-07-22 2021-10-19 出门问问(苏州)信息科技有限公司 自适应语音合成方法、装置、可读存储介质及计算设备
CN110459201B (zh) * 2019-08-22 2022-01-07 云知声智能科技股份有限公司 一种产生新音色的语音合成方法
US11322135B2 (en) * 2019-09-12 2022-05-03 International Business Machines Corporation Generating acoustic sequences via neural networks using combined prosody info
CN110767212B (zh) * 2019-10-24 2022-04-26 百度在线网络技术(北京)有限公司 一种语音处理方法、装置和电子设备
CN112750423B (zh) * 2019-10-29 2023-11-17 阿里巴巴集团控股有限公司 个性化语音合成模型构建方法、装置、***及电子设备
US11430424B2 (en) * 2019-11-13 2022-08-30 Meta Platforms Technologies, Llc Generating a voice model for a user
CN110827799B (zh) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 用于处理语音信号的方法、装置、设备和介质
CN112863476B (zh) * 2019-11-27 2024-07-02 阿里巴巴集团控股有限公司 个性化语音合成模型构建、语音合成和测试方法及装置
WO2021127821A1 (zh) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN111433847B (zh) * 2019-12-31 2023-06-09 深圳市优必选科技股份有限公司 语音转换的方法及训练方法、智能装置和存储介质
CN111128119B (zh) * 2019-12-31 2022-04-22 云知声智能科技股份有限公司 一种语音合成方法及装置
WO2021134591A1 (zh) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 语音合成方法、装置、终端及存储介质
CN113192482B (zh) * 2020-01-13 2023-03-21 北京地平线机器人技术研发有限公司 语音合成方法及语音合成模型的训练方法、装置、设备
CN111223474A (zh) * 2020-01-15 2020-06-02 武汉水象电子科技有限公司 一种基于多神经网络的语音克隆方法和***
CN111276120B (zh) * 2020-01-21 2022-08-19 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
US11562744B1 (en) * 2020-02-13 2023-01-24 Meta Platforms Technologies, Llc Stylizing text-to-speech (TTS) voice response for assistant systems
CN111326138A (zh) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 语音生成方法及装置
CN113314096A (zh) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 语音合成方法、装置、设备和存储介质
CN111477210A (zh) * 2020-04-02 2020-07-31 北京字节跳动网络技术有限公司 语音合成方法和装置
CN113539233A (zh) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111627418B (zh) * 2020-05-27 2023-01-31 携程计算机技术(上海)有限公司 语音合成模型的训练方法、合成方法、***、设备和介质
CN111816168A (zh) * 2020-07-21 2020-10-23 腾讯科技(深圳)有限公司 一种模型训练的方法、语音播放的方法、装置及存储介质
CN112151008B (zh) * 2020-09-22 2022-07-15 中用科技有限公司 一种语音合成方法、***及计算机设备
US20220310058A1 (en) * 2020-11-03 2022-09-29 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices
CN112331177B (zh) * 2020-11-05 2024-07-02 携程计算机技术(上海)有限公司 基于韵律的语音合成方法、模型训练方法及相关设备
CN112420017A (zh) * 2020-11-13 2021-02-26 北京沃东天骏信息技术有限公司 语音合成方法及装置
CN112466294B (zh) * 2020-11-24 2021-12-14 北京百度网讯科技有限公司 声学模型的生成方法、装置及电子设备
CN112365876B (zh) * 2020-11-27 2022-04-12 北京百度网讯科技有限公司 语音合成模型的训练方法、装置、设备以及存储介质
CN112365882B (zh) * 2020-11-30 2023-09-22 北京百度网讯科技有限公司 语音合成方法及模型训练方法、装置、设备及存储介质
WO2022141126A1 (zh) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 个性化语音转换训练方法、计算机设备及存储介质
CN112927674B (zh) * 2021-01-20 2024-03-12 北京有竹居网络技术有限公司 语音风格的迁移方法、装置、可读介质和电子设备
CN113241056B (zh) * 2021-04-26 2024-03-15 标贝(青岛)科技有限公司 语音合成模型的训练与语音合成方法、装置、***及介质
CN113436601A (zh) * 2021-05-27 2021-09-24 北京达佳互联信息技术有限公司 音频合成方法、装置、电子设备及存储介质
CN113327577B (zh) * 2021-06-07 2024-01-16 北京百度网讯科技有限公司 语音合成方法、装置和电子设备
CN113393829B (zh) * 2021-06-16 2023-08-29 哈尔滨工业大学(深圳) 一种融合韵律和个人信息的中文语音合成方法
CN113838453B (zh) * 2021-08-17 2022-06-28 北京百度网讯科技有限公司 语音处理方法、装置、设备和计算机存储介质
CN113793593B (zh) * 2021-11-18 2022-03-18 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1534595A (zh) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 语音转换合成装置及其方法
CN101308652A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种个性化歌唱语音的合成方法
CN101751921A (zh) * 2009-12-16 2010-06-23 南京邮电大学 一种在训练数据量极少条件下的实时语音转换方法
CN103117057A (zh) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 一种特定人语音合成技术在手机漫画配音中的应用方法
US20140324435A1 (en) * 2010-08-27 2014-10-30 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US20150012277A1 (en) * 2008-08-12 2015-01-08 Morphism Llc Training and Applying Prosody Models
CN105185372A (zh) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105261355A (zh) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 一种语音合成方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970820B2 (en) * 2001-02-26 2005-11-29 Matsushita Electric Industrial Co., Ltd. Voice personalization of speech synthesizer
CN101178895A (zh) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 基于生成参数听感误差最小化的模型自适应方法
GB2505400B (en) * 2012-07-18 2015-01-07 Toshiba Res Europ Ltd A speech processing system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1534595A (zh) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 语音转换合成装置及其方法
CN101308652A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种个性化歌唱语音的合成方法
US20150012277A1 (en) * 2008-08-12 2015-01-08 Morphism Llc Training and Applying Prosody Models
CN101751921A (zh) * 2009-12-16 2010-06-23 南京邮电大学 一种在训练数据量极少条件下的实时语音转换方法
US20140324435A1 (en) * 2010-08-27 2014-10-30 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
CN103117057A (zh) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 一种特定人语音合成技术在手机漫画配音中的应用方法
CN105261355A (zh) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 一种语音合成方法和装置
CN105206258A (zh) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 声学模型的生成方法和装置及语音合成方法和装置
CN105185372A (zh) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962219A (zh) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 用于处理文本的方法和装置
CN113302683A (zh) * 2019-12-24 2021-08-24 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质
CN113302683B (zh) * 2019-12-24 2023-08-04 深圳市优必选科技股份有限公司 多音字预测方法及消歧方法、装置、设备及计算机可读存储介质
CN111968617A (zh) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 一种非平行数据的语音转换方法及***
CN111968617B (zh) * 2020-08-25 2024-03-15 云知声智能科技股份有限公司 一种非平行数据的语音转换方法及***

Also Published As

Publication number Publication date
US10410621B2 (en) 2019-09-10
CN105185372A (zh) 2015-12-23
US20180254034A1 (en) 2018-09-06
CN105185372B (zh) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017067206A1 (zh) 个性化多声学模型的训练方法、语音合成方法及装置
KR102582291B1 (ko) 감정 정보 기반의 음성 합성 방법 및 장치
US10614795B2 (en) Acoustic model generation method and device, and speech synthesis method
US11443733B2 (en) Contextual text-to-speech processing
JP7395792B2 (ja) 2レベル音声韻律転写
US11605371B2 (en) Method and system for parametric speech synthesis
US11881210B2 (en) Speech synthesis prosody using a BERT model
KR20230003056A (ko) 비음성 텍스트 및 스피치 합성을 사용한 스피치 인식
SG185300A1 (en) System and method for distributed text-to-speech synthesis and intelligibility
JP2008134475A (ja) 入力された音声のアクセントを認識する技術
KR20230043084A (ko) 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
Fendji et al. Automatic speech recognition using limited vocabulary: A survey
US10685644B2 (en) Method and system for text-to-speech synthesis
JP2018146803A (ja) 音声合成装置及びプログラム
EP3776532A1 (en) Text-to-speech synthesis system and method
CN109448704A (zh) 语音解码图的构建方法、装置、服务器和存储介质
JP2024505076A (ja) 多様で自然なテキスト読み上げサンプルを生成する
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
Meng et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
JPWO2016103652A1 (ja) 音声処理装置、音声処理方法、およびプログラム
Laurinčiukaitė et al. Lithuanian Speech Corpus Liepa for development of human-computer interfaces working in voice recognition and synthesis mode
Reddy et al. Speech-to-Text and Text-to-Speech Recognition Using Deep Learning
KR102277205B1 (ko) 오디오 변환 장치 및 방법
Sudhakar et al. Development of Concatenative Syllable-Based Text to Speech Synthesis System for Tamil
Gardini Data preparation and improvement of NLP software modules for parametric speech synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16856646

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15758280

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16856646

Country of ref document: EP

Kind code of ref document: A1