WO2022151931A1 - Speech synthesis method and apparatus, synthesis model training method and apparatus, medium, and device - Google Patents

Speech synthesis method and apparatus, synthesis model training method and apparatus, medium, and device Download PDF

Info

Publication number
WO2022151931A1
WO2022151931A1 PCT/CN2021/139988 CN2021139988W WO2022151931A1 WO 2022151931 A1 WO2022151931 A1 WO 2022151931A1 CN 2021139988 W CN2021139988 W CN 2021139988W WO 2022151931 A1 WO2022151931 A1 WO 2022151931A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
flow
text
speech synthesis
synthesized
Prior art date
Application number
PCT/CN2021/139988
Other languages
French (fr)
Chinese (zh)
Inventor
殷翔
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022151931A1 publication Critical patent/WO2022151931A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the technical field of speech synthesis, and in particular, to a speech synthesis method, a synthesis model training method, an apparatus, a medium, and a device.
  • the acoustic features eg, Mel spectrum, linear spectrum, fundamental frequency, etc.
  • the vocoder is used to extract the corresponding acoustic features according to the acoustic features.
  • the present disclosure provides a speech synthesis method, comprising: acquiring speech feature information corresponding to text to be synthesized; inputting the speech feature information into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized , wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder; and the predicted waveform point
  • the information is expanded by ⁇ -law to obtain audio information corresponding to the text to be synthesized.
  • the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow;
  • the encoding network is configured to generate the The representation sequence corresponding to the text to be synthesized, wherein the representation sequence is formed by the coding of each phoneme in the text to be synthesized according to the sequence of the corresponding phonemes in the text to be synthesized;
  • the duration sub-model uses According to the voice feature information, the duration feature information corresponding to the text to be synthesized is obtained, wherein the duration feature information includes the number of speech frames corresponding to each phoneme in the text to be synthesized;
  • the Gauss sampling module uses According to the representation sequence and the duration feature information, a fixed-length semantic representation corresponding to the text to be synthesized is generated;
  • the linear processing module is used to perform linear transformation on the semantic representation to obtain the corresponding text to be synthesized.
  • the first Mel spectrum information; the flow-based reversible generation model Glow is used to generate the second Mel spectrum information according to the standard normal distribution;
  • the vocoder is used to generate the second Mel spectrum information according to the first Mel spectrum information and the
  • the second mel spectrum information generates predicted waveform point information corresponding to the text to be synthesized.
  • the vocoder is a flow-based generation model Flow;
  • the speech synthesis model is obtained by training in the following ways: acquiring the labeled speech feature information, labeled waveform point information, and labeled mel spectrum corresponding to the text training samples and by using the marked voice feature information as the input of the encoding network and the duration sub-model, respectively, using the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module, The output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model Flow.
  • the method of generating the target output of the model Flow using the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow
  • the acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  • the method further includes synthesizing the audio information with background music.
  • the present disclosure provides a speech synthesis model training method
  • the speech synthesis model includes an acoustic sub-model and a vocoder
  • the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and A flow-based reversible generative model Glow
  • the vocoder is a flow-based generative model Flow
  • the method includes: acquiring labeled speech feature information, labeled waveform point information, and labeled mel spectrum information corresponding to the text training samples; and by The marked voice feature information is used as the input of the encoding network and the duration sub-model respectively, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the Gaussian sampling module is used as the input.
  • the output of the sampling module is used as the input of the linear processing module
  • the output of the linear processing module is used as the input of the flow-based generation model Flow
  • the marked waveform point information is used as the flow-based generation model Flow.
  • Target output using the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow for the sound.
  • the student model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  • the present disclosure provides a speech synthesis device, comprising: a first acquisition module for acquiring speech feature information corresponding to text to be synthesized; a speech synthesis module for The speech feature information is input into the speech synthesis model, and the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model The student model and the vocoder are directly trained jointly; the expansion module is used to perform ⁇ -law expansion on the predicted waveform point information obtained by the speech synthesis module to obtain the audio information corresponding to the text to be synthesized.
  • the present disclosure provides a speech synthesis model training device, the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and A flow-based reversible generation model Glow, and the vocoder is a flow-based generation model Flow;
  • the device includes: a second acquisition module, configured to acquire labeled voice feature information, labeled waveform point information, and labeled corresponding to the text training samples Mel spectrum information; and a training module for using the marked speech feature information as the input of the encoding network and the duration sub-model, respectively, using the output of the encoding network and the output of the duration sub-model as
  • the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the label
  • the point information is used as the target output of the flow-based generation model Flow
  • the labeled mel spectrum information is used as the input of the flow-based reversible generation model Glow
  • the standard normal distribution is used as the flow-based reversible
  • the acoustic sub-model and the vocoder are directly jointly trained in the manner of generating the target output of the model Glow to obtain the speech synthesis model.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the method provided in the first aspect or the second aspect of the present disclosure.
  • the present disclosure provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device, so as to realize the first aspect of the present disclosure. of the method.
  • the present disclosure provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device, so as to implement the second aspect of the present disclosure. of the method.
  • the present disclosure provides a computer program, comprising: instructions that, when executed by a processor, cause the processor to perform the method provided by the first aspect or the second aspect of the present disclosure.
  • the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the method provided by the first or second aspect of the present disclosure.
  • Fig. 1 is a flow chart of a speech synthesis method according to an exemplary embodiment.
  • Fig. 2 is a schematic diagram of a speech synthesis process according to an exemplary embodiment.
  • Fig. 3 is a flowchart of a method for training a speech synthesis model according to an exemplary embodiment.
  • Fig. 4 is a schematic diagram illustrating a training process of a speech synthesis model according to an exemplary embodiment.
  • Fig. 5 is a flow chart of a speech synthesis method according to another exemplary embodiment.
  • Fig. 6 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment.
  • Fig. 7 is a block diagram of an apparatus for training a speech synthesis model according to an exemplary embodiment.
  • Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • the inventors of the present disclosure found that, in the related art, when the acoustic sub-model and the vocoder cooperate to perform speech synthesis, the speech synthesis speed is slow, and the phenomenon of error accumulation is prone to occur, thereby affecting the accuracy of speech synthesis.
  • the acoustic features extracted by the acoustic sub-model may not have universality, so that the audio information generated based on the acoustic features cannot be adapted to special pronunciation requirements, for example, a high-pitched female voice or a low-pitched male voice.
  • the present disclosure provides a speech synthesis method.
  • the speech synthesis method will be described in detail below with reference to the accompanying drawings.
  • Fig. 1 is a flow chart of a speech synthesis method according to an exemplary embodiment. Wherein, as shown in FIG. 1 , the method includes steps S101 to S103.
  • step S101 the speech feature information corresponding to the text to be synthesized is acquired.
  • the text to be synthesized may be a tonal language such as Chinese, Vietnamese, Uyghur, Thai, and Qiang.
  • the speech feature information can be used to represent phoneme, intonation, pause and other related information of the text to be synthesized.
  • the text to be synthesized may be various types of texts such as novels and lyrics.
  • step S102 the speech feature information is input into the speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized.
  • the speech synthesis model includes an acoustic sub-model and a vocoder, wherein the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder.
  • step S103 ⁇ -law expansion is performed on the predicted waveform point information to obtain audio information corresponding to the text to be synthesized.
  • the predicted waveform point information can be directly obtained according to the speech feature information corresponding to the text to be synthesized through the speech synthesis model, and then the predicted waveform point information can be obtained by simple ⁇ -law expansion to obtain the corresponding information of the to-be-synthesized text. Audio information without the cooperation of the acoustic sub-model and the vocoder, thereby improving the efficiency of speech synthesis, and can effectively reduce the accumulation of errors caused by the separate training of the acoustic sub-model and the vocoder in the related art. Accuracy of speech synthesis.
  • the predicted waveform point information can be directly generated according to the speech feature information corresponding to the text to be synthesized, without involving the acoustic features, it can avoid the fact that the generated audio information cannot meet the special pronunciation requirements due to the lack of universality of the acoustic features. problem, thereby improving the effect of speech synthesis.
  • the speech synthesis model can be obtained by directly jointly training the acoustic sub-model and the vocoder, so that the model training period can be shortened, and the prosody fidelity of the speech synthesis model can be better; and the speech synthesis model can also be guaranteed.
  • the matching degree of the acoustic sub-model and the vocoder in the model so as to avoid the problem that the accuracy of the speech synthesis result obtained is low even if the accuracy of the acoustic sub-model and the vocoder are high, which further improves the accuracy of speech synthesis. Accuracy.
  • the speech feature information may include phonemes, tones, participles, and prosodic boundaries.
  • a phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable.
  • An action constitutes a phoneme; phonemes are divided into two categories: vowels and consonants.
  • phonemes include initials (initials, which are consonants used in front of finals, and form a complete syllable together with the finals) and finals (ie, vowels).
  • Tone is the change in the pitch of a sound.
  • Prosodic boundaries are used to indicate where to pause when reading text. For example, the prosodic boundary is divided into four pause levels "#1", "#2", “#3" and "#4", and the degrees of pauses increase in sequence.
  • the above-mentioned voice feature information can be acquired in various ways.
  • the voice feature information corresponding to the text to be synthesized can be marked in advance by the user and stored in a corresponding storage module, so that the voice feature information corresponding to the text to be synthesized can be obtained by accessing the storage module.
  • the text to be synthesized can be input into the information extraction model, and the speech feature information corresponding to the text to be synthesized can be obtained, which is convenient and quick, and does not require manual participation, thus saving manpower.
  • the information extraction model may include a Text Normalization (TN) model, a Grapheme-to-Phoneme (G2P) model, a word segmentation model, and a prosody model.
  • TN Text Normalization
  • G2P Grapheme-to-Phoneme
  • the numbers, symbols, abbreviations, etc. in the text to be synthesized can be converted into language characters through the TN model
  • the phonemes in the text to be synthesized can be obtained through the G2P model
  • the text to be synthesized can be segmented through the word segmentation model
  • the text to be synthesized can be obtained through the prosody model. rhythmic boundaries and tones.
  • the G2P model can use a Recurrent Neural Network (RNN) and a Long Short-Term Memory (LSTM) to realize the conversion from graphemes to phonemes.
  • RNN Recurrent Neural Network
  • LSTM Long Short-Term Memory
  • the word segmentation model can be an n-gram model, a hidden Markov model, a naive Bayes classification model, and the like.
  • the prosody model is a pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), a bidirectional LSTM-CRF (Conditional Random Field) model, etc.
  • the audio information corresponding to the text to be synthesized can be paused according to the text content and word segmentation of the text to be synthesized, the accuracy and intelligibility of the audio information are improved, and the user can quickly understand the text content corresponding to the audio information.
  • the speech can be paused at the natural prosodic boundary during speech synthesis, the naturalness and fluency of the audio information corresponding to the text to be synthesized can be improved.
  • the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling (Gaussian sampling) module, a linear processing module, and a flow-based reversible generative model Glow.
  • Gaussian sampling Gaussian sampling
  • the coding network is used to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information corresponding to the text to be synthesized;
  • the duration sub-model is used to obtain the duration corresponding to the text to be synthesized according to the speech feature information corresponding to the text to be synthesized feature information;
  • the Gauss sampling module is used to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and duration feature information;
  • the linear processing module is used to perform linear transformation on the fixed-length semantic representation to obtain the corresponding text to be synthesized.
  • the first mel spectrum information; the flow-based reversible generation model Glow is used to generate the second mel spectrum information according to the standard normal distribution; the vocoder is used to generate and match the first mel spectrum information and the second mel spectrum information according to the The predicted waveform point information corresponding to the text to be synthesized.
  • the duration feature information includes the number of speech frames corresponding to each phoneme in the text to be synthesized.
  • the representation sequence is formed by arranging the codes of each phoneme in the text to be synthesized according to the sequence of the corresponding phonemes in the text to be synthesized.
  • the phoneme sequence corresponding to the text to be synthesized is “ab", wherein the phoneme “a” is encoded as “A”, and the phoneme “b” is encoded as “B”, then the representation sequence corresponding to the text to be synthesized is "AB" ".
  • the encoding network may include a pre-processing network (Pre-net) sub-model and a Transformer sub-model.
  • Pre-net pre-processing network
  • Transformer Transformer sub-model.
  • the speech feature information obtained after nonlinear transformation is used to obtain the representation sequence corresponding to the text to be synthesized.
  • the duration sub-model can be, for example, a CBHG model, a Long Short Term Memory Network (LSTM) model, an LSTM-RNN (Recurrent Neural Network, Recurrent Neural Network) model, a Deep Neural Network (Deep Neural Networks, DNN) model, Transformer model, flow-based generative model Flow, etc.
  • the duration sub-model can adopt the flow-based generative model Flow suitable for modeling uncertain information in speech, so as to further improve the accuracy of speech synthesis.
  • the above-mentioned duration sub-model can determine the number of speech frames corresponding to each phoneme in the text to be synthesized through the following steps: (1) obtain the pronunciation duration of each phoneme in the text to be synthesized; (2) according to the pronunciation duration of each phoneme, Determine the number of speech frames corresponding to this phoneme.
  • the pronunciation duration of a phoneme is 200ms
  • the duration of a speech frame is 5ms
  • the number of speech frames corresponding to the phoneme is 40.
  • the pronunciation duration of a phoneme is 203ms, and the duration of a speech frame is 5ms, then the number of speech frames corresponding to the phoneme is That is, if the last slice is less than 5ms, it will be processed as one frame.
  • the flow-based reversible generative model Glow includes a compression layer, an activation normalization (Activation Normalization, ActNorm) layer, a reversible 1*1 convolutional layer, and an affine coupling layer.
  • Activation Normalization ActNorm
  • ActNorm activation Normalization
  • reversible 1*1 convolutional layer a reversible 1*1 convolutional layer
  • affine coupling layer an affine coupling layer
  • the ActNorm layer is used to normalize the data, using the scaling and bias parameters of each channel for activation, so that the small batch of data has zero mean and unit variance after activation, which is equivalent to preprocessing the data to avoid the model performance degradation.
  • the reversible 1*1 convolution layer uses matrix multiplication to shuffle the data of each dimension, so that the information is mixed more fully.
  • the affine coupling layer realizes the reversible transformation of the data through the reversible function.
  • the standard normal distribution is input into the affine coupling layer to obtain the inversely transformed data; after that, the inversely transformed data is input into the reversible 1*1 convolutional layer to scramble it; next, The scrambled data is input into the ActNorm layer for data normalization, and then decompressed by the compression layer to obtain the second mel spectrum information, which is output to the vocoder.
  • the vocoder can be a flow-based generative model Flow.
  • the training method can be implemented through steps S301 and S302 shown in FIG. 3 .
  • step S301 the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples are obtained.
  • the labeled waveform point information corresponding to the text training sample can be obtained by performing ⁇ -law compression on the audio information corresponding to the text training sample.
  • step S302 the marked speech feature information is used as the input of the coding network and the duration sub-model respectively, the output of the coding network and the output of the duration sub-model are used as the input of the Gaussian sampling module, and the output of the Gaussian sampling module is used as the linear processing module
  • the input of the linear processing module is used as the input of the flow-based generative model Flow
  • the labeled waveform point information is used as the target output of the flow-based generative model Flow
  • the labeled mel spectrum information is used as the flow-based reversible generative model Glow.
  • the acoustic sub-model and the vocoder are directly jointly trained by taking the standard normal distribution as the target output of the flow-based reversible generative model Glow to obtain the speech synthesis model.
  • the marked voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and the marked voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples;
  • the semantic representation is input into a linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into a vocoder (ie, a flow-based generation model Flow) to obtain text training.
  • a vocoder ie, a flow-based generation model Flow
  • the predicted waveform point information corresponding to the sample then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based reversible generation model Glow are compared. to update.
  • the labeled mel spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labeled mel spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
  • the above method may further include step S104.
  • step S104 the audio information and the background music are synthesized.
  • the above background music may be preset music, that is, any music set by the user, or default music.
  • the use scene information corresponding to the text to be synthesized may be determined according to the text information and/or voice feature information of the text to be synthesized, wherein the use
  • the scene information includes but is not limited to news broadcasts, military introductions, fairy tales, campus broadcasts, etc.; then, according to the use scene information, determine the background music that matches the use scene information.
  • the above-mentioned usage scenario information may be determined in various ways.
  • the usage scene information corresponding to the text to be synthesized may be determined according to the text information of the text to be synthesized, wherein the above text information may be a keyword.
  • automatic keyword recognition can be performed on the text to be synthesized, so as to intelligently predict the usage scene information of the text to be synthesized according to the keywords.
  • the usage scene information corresponding to the text to be synthesized may be determined according to the speech feature information of the text to be synthesized.
  • the scene description words can be identified from the word segmentations in the speech feature information determined in the above step 101, wherein the above scene can be identified by matching each word segmentation with a pre-stored scene description word table. Describe the word, and then determine the usage scene information of the text to be synthesized according to the scene description word.
  • the usage scene information corresponding to the text to be synthesized may be determined according to the text information and voice feature information of the text to be synthesized. Specifically, automatic keyword recognition of the text to be synthesized can be performed, and scene description words can be identified from the word segmentation in the speech feature information determined in the above step 101, and then, the text to be synthesized can be jointly determined according to the keywords and the scene description words. usage scenario information. In this way, the determination accuracy of the usage scene information can be improved.
  • the background music that matches the usage scene information corresponding to the to-be-synthesized text can be determined by using the pre-stored correspondence between the usage scene information and the background music according to the usage scene information. For example, if the scene information is an introduction to military and military, the corresponding background music may be exciting music; if the scene information is fairy tales, the corresponding background music may be lively music.
  • the present disclosure also provides a speech synthesis model training method, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based invertible
  • the generative model Glow, the vocoder is the flow-based generative model Flow.
  • the speech synthesis model can be trained through steps S301 and S302 shown in FIG. 3 .
  • step S301 the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples are obtained.
  • the labeled waveform point information corresponding to the text training sample can be obtained by performing ⁇ -law compression on the audio information corresponding to the text training sample.
  • step S302 by using the marked voice feature information as the input of the encoding network and the duration sub-model, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the output of the Gaussian sampling module is taken as the linear processing module
  • the input of the linear processing module is used as the input of the flow-based generative model Flow
  • the labeled waveform point information is used as the target output of the flow-based generative model Flow
  • the labeled mel spectrum information is used as the flow-based reversible generative model Glow.
  • the acoustic sub-model and the vocoder are directly jointly trained by taking the standard normal distribution as the target output of the flow-based reversible generative model Glow to obtain the speech synthesis model.
  • the marked voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and the marked voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples;
  • the semantic representation is input into the linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into the vocoder (ie, the flow-based generation model Flow) to obtain the text training
  • the predicted waveform point information corresponding to the sample then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based reversible generation model Glow
  • the labelled mel-spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labelled mel-spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
  • Fig. 6 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment.
  • the device 600 includes: a first acquisition module 601 for acquiring speech feature information corresponding to the text to be synthesized; a speech synthesis module 602 for acquiring the speech acquired by the first acquisition module 601
  • the feature information is input into the speech synthesis model, and the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by analyzing the acoustic sub-model.
  • the model and the vocoder are directly trained jointly; and the expansion module 603 is used to perform ⁇ -law expansion on the predicted waveform point information obtained by the speech synthesis module 602 to obtain the audio corresponding to the text to be synthesized. information.
  • the text to be synthesized may be a tonal language such as Chinese, Vietnamese, Uyghur, Thai, and Qiang.
  • the speech feature information can be used to represent phoneme, intonation, pause and other related information of the text to be synthesized.
  • the text to be synthesized may be various types of texts such as novels and lyrics.
  • the predicted waveform point information can be directly obtained according to the speech feature information corresponding to the text to be synthesized through the speech synthesis model, and then the predicted waveform point information can be obtained by simple ⁇ -law expansion to obtain the corresponding information of the to-be-synthesized text. Audio information without the cooperation of the acoustic sub-model and the vocoder, thereby improving the efficiency of speech synthesis, and can effectively reduce the accumulation of errors caused by the separate training of the acoustic sub-model and the vocoder in the related art. Accuracy of speech synthesis.
  • the predicted waveform point information can be directly generated according to the speech feature information corresponding to the text to be synthesized, without involving the acoustic features, it can avoid the fact that the generated audio information cannot meet the special pronunciation requirements due to the lack of universality of the acoustic features. problem, thereby improving the effect of speech synthesis.
  • the speech synthesis model can be obtained by directly jointly training the acoustic sub-model and the vocoder, so that the model training period can be shortened, and the prosody fidelity of the speech synthesis model can be better; and the speech synthesis model can also be guaranteed.
  • the matching degree of the acoustic sub-model and the vocoder in the model so as to avoid the problem that the accuracy of the speech synthesis result obtained is low even if the accuracy of the acoustic sub-model and the vocoder are high, which further improves the accuracy of speech synthesis. Accuracy.
  • the acoustic submodel includes an encoding network, a duration submodel, a Gaussian sampling module, a linear processing module, and a flow-based reversible generative model Glow.
  • the encoding network is configured to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, wherein the representation sequence is encoded in the to-be-synthesized text according to the corresponding phoneme by the encoding of each phoneme in the to-be-synthesized text.
  • the order in the text is arranged.
  • the duration sub-model is used to obtain duration characteristic information corresponding to the text to be synthesized according to the speech characteristic information, wherein the duration characteristic information includes the number of speech frames corresponding to each phoneme in the to-be-synthesized text.
  • the Gaussian sampling module is configured to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration feature information.
  • the linear processing module is configured to perform linear transformation on the semantic representation to obtain first mel spectrum information corresponding to the text to be synthesized.
  • the flow-based reversible generation model Glow is used to generate the second mel spectrum information according to a standard normal distribution.
  • the vocoder is configured to generate predicted waveform point information corresponding to the text to be synthesized according to the first mel spectrum information and the second mel spectrum information.
  • the vocoder is a flow-based generative model Flow.
  • the speech synthesis model is obtained by training a speech synthesis model training device.
  • the speech synthesis model training apparatus 700 includes: a second acquisition module 701 for acquiring marked voice feature information, marked waveform point information and marked mel spectrum information corresponding to the text training samples; and a training module 702, It is used to use the marked voice feature information as the input of the encoding network and the duration sub-model, respectively, and the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module.
  • the output of the Gaussian sampling module is used as the input of the linear processing module
  • the output of the linear processing module is used as the input of the flow-based generation model Flow
  • the marked waveform point information is used as the flow-based generation model
  • the target output of the model Flow, the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target output of the flow-based reversible generation model Glow.
  • the acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  • the speech feature information includes phonemes, tones, participles, and prosodic boundaries.
  • the first obtaining module is used for extracting the input information of the text to be synthesized into a model to obtain speech feature information corresponding to the text to be synthesized.
  • the apparatus 600 further includes: a background music synthesis module, configured to synthesize the audio information obtained by the expansion module 603 and the background music.
  • the present disclosure also provides an apparatus for training a speech synthesis model, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a
  • the reversible generative model of flow Glow, the vocoder is a flow-based generative model Flow. As shown in FIG.
  • the device 700 includes: a second acquisition module 701 for acquiring marked voice feature information, marked waveform point information and marked Mel spectrum information corresponding to the text training samples; and a training module 702 for
  • the marked voice feature information is used as the input of the encoding network and the duration sub-model respectively, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the Gaussian sampling
  • the output of the module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the target of the flow-based generation model Flow output, using the labeled mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow for the acoustic particle
  • the model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  • the labeled waveform point information corresponding to the text training sample can be obtained by performing ⁇ -law compression on the audio information corresponding to the text training sample.
  • the labeled voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and at the same time, the labeled voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples;
  • the semantic representation is input into the linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into the vocoder (ie, the flow-based generation model Flow) to obtain the text training
  • the predicted waveform point information corresponding to the sample then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based re
  • the labelled mel-spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labelled mel-spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
  • the above-mentioned speech synthesis model training apparatus 700 may be integrated into the speech synthesis apparatus 600, or may be independent of the speech synthesis apparatus 600, which is not specifically limited in the present disclosure.
  • the specific manner in which each module performs operations has been described in detail in the embodiments of the method, and will not be described in detail here.
  • the present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, implements the steps of the above-mentioned speech synthesis method or the steps of the speech synthesis model training method provided by the present disclosure.
  • the computer-readable medium is a non-transitory computer-readable medium.
  • FIG. 8 it shows a schematic structural diagram of an electronic device (eg, a terminal device or a server) 800 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 8 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 800 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 801 that may be loaded into random access according to a program stored in a read only memory (ROM) 802 or from a storage device 808 Various appropriate actions and processes are executed by the programs in the memory (RAM) 803 . In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored.
  • the processing device 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • An input/output (I/O) interface 805 is also connected to bus 804 .
  • I/O interface 805 input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 807 of a computer, etc.; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809.
  • Communication means 809 may allow electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 8 shows an electronic device 800 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 809, or from the storage device 808, or from the ROM 802.
  • the processing device 801 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the voice feature information corresponding to the text to be synthesized; and inputs the voice feature information into the voice
  • the synthesis model the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by comparing the acoustic sub-model and the The vocoder is directly obtained through joint training; the predicted waveform point information is expanded by ⁇ -law to obtain the audio information corresponding to the text to be synthesized.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
  • the speech synthesis model includes an acoustic sub-model and a vocoder
  • the acoustic sub-model includes an encoding network, a duration sub-model, and a Gaussian sampling module , a linear processing module and a flow-based reversible generative model Glow
  • the vocoder is a flow-based generative model Flow
  • the encoding network and the output of the duration sub-model are used as the input of the Gaussian sampling module
  • the output of the Gaussian sampling module is used as the input of the linear processing module
  • the output of the linear processing module is used as the input of the linear processing module.
  • the input of the flow-based generation model Flow is used as the target output of the flow-based generation model Flow
  • the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow , using the standard normal distribution as the target output of the flow-based reversible generation model Glow to directly jointly train the acoustic sub-model and the vocoder to obtain the speech synthesis model.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring speech feature information corresponding to the text to be synthesized".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a speech synthesis method, including: obtaining speech feature information corresponding to the text to be synthesized; The predicted waveform point information corresponding to the synthesized text, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder ; Perform ⁇ -law expansion on the predicted waveform point information to obtain the audio information corresponding to the text to be synthesized.
  • Example 2 provides the method of Example 1, the acoustic sub-model including an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generative model Glow;
  • the encoding network is used to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, wherein the representation sequence is determined by the encoding of each phoneme in the text to be synthesized according to the corresponding phoneme in the text to be synthesized.
  • the duration sub-model is used to obtain the duration feature information corresponding to the text to be synthesized according to the speech feature information, wherein the duration feature information includes each of the texts to be synthesized.
  • Example 3 provides the method of Example 2, where the vocoder is a flow-based generation model Flow; the speech synthesis model is obtained by training in the following manner: acquiring corresponding text training samples Labeling voice feature information, labeling waveform point information, and labeling mel spectrum information; by using the labeled voice feature information as the input of the encoding network and the duration sub-model, respectively, the output of the encoding network and the duration
  • the output of the sub-model is used as the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, and the output of the linear processing module is used as the input of the flow-based generation model Flow
  • the marked waveform point information is used as the target output of the flow-based generation model Flow, the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the
  • the acoustic sub-model and the vocoder are directly jointly trained in the following manner: acquiring
  • Example 4 provides the method of any one of Examples 1-3, wherein the speech feature information includes phonemes, tones, word segmentation, and prosodic boundaries; the acquiring speech corresponding to the text to be synthesized
  • the feature information includes: inputting the text to be synthesized into an information extraction model to obtain speech feature information corresponding to the to-be-synthesized text.
  • Example 5 provides the method of any one of Examples 1-3, the method further comprising: synthesizing the audio information with background music.
  • Example 6 provides a speech synthesis model training method, the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes an encoding network, a duration sub-model, A Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow, the vocoder is a flow-based generation model Flow; the method includes: acquiring marked voice feature information corresponding to a text training sample, marked waveform point information, and Labeling mel spectrum information; using the labeling speech feature information as the input of the encoding network and the duration sub-model respectively, and using the output of the encoding network and the output of the duration sub-model as the Gauss sampling module take the output of the Gaussian sampling module as the input of the linear processing module, take the output of the linear processing module as the input of the flow-based generation model Flow, and take the marked waveform point information as the The target output of the flow-based generation model
  • Example 7 provides an apparatus for speech synthesis, including: a first acquisition module for acquiring speech feature information corresponding to the text to be synthesized; a speech synthesis module for The speech feature information obtained by the acquisition module is input into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech The synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder; an expansion module is used to perform ⁇ -law expansion on the predicted waveform point information obtained by the speech synthesis module to obtain the Audio information corresponding to the text to be synthesized.
  • Example 8 provides an apparatus for training a speech synthesis model, where the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes an encoding network, a duration sub-model, A Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow, the vocoder is a flow-based generation model Flow; the device includes: a second acquisition module for acquiring the marked voice features corresponding to the text training samples information, labeling waveform point information and labeling mel spectrum information; a training module is used to use the labeling speech feature information as the input of the encoding network and the duration sub-model, respectively, the output of the encoding network and all The output of the duration sub-model is used as the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, and the output of the linear processing module is used as the flow-based generation model
  • Example 9 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method of any one of Examples 1-6.
  • Example 10 provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the method of any one of Examples 1-5.
  • Example 11 provides an electronic device, including: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the method described in Example 6.
  • Example 12 provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform the method of any one of Examples 1-6 .
  • Example 13 provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the method of any of Examples 1-6 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech synthesis method and apparatus (600), a synthesis model training method and apparatus (700), a medium, and a device (800). The speech synthesis method comprises: obtaining speech feature information corresponding to text to be synthesized (S101); inputting the speech feature information into a speech synthesis model to obtain predicted waveform point information corresponding to said text (S102), the speech synthesis model comprising an acoustic submodel and a vocoder, and the speech synthesis model being obtained by directly performing joint training on the acoustic submodel and the vocoder; and performing μ-law expansion on the predicted waveform point information to obtain audio information (S103). By this way, the efficiency of speech synthesis can be improved, the error accumulation in related art generated by training an acoustic submodel and a vocoder separately is effectively reduced, and the accuracy of speech synthesis is improved. In addition, the problem that generated audio information cannot adapt to special pronunciation requirements due to the fact that acoustic features do not have universality can also be avoided, and the speech synthesis effect is improved. Additionally, a training period of models is short, and the rhythm fidelity of the models is better.

Description

语音合成方法、合成模型训练方法、装置、介质及设备Speech synthesis method, synthesis model training method, apparatus, medium and equipment
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请是以申请号为202110042176.3,申请日为2021年1月13日的中国申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。This application is based on the Chinese application with the application number of 202110042176.3 and the filing date of January 13, 2021, and claims its priority. The disclosure of the Chinese application is hereby incorporated into this application as a whole.
技术领域technical field
本公开涉及语音合成技术领域,具体地,涉及一种语音合成方法、合成模型训练方法、装置、介质及设备。The present disclosure relates to the technical field of speech synthesis, and in particular, to a speech synthesis method, a synthesis model training method, an apparatus, a medium, and a device.
背景技术Background technique
在相关技术中,在进行语音合成时,通常先通过声学子模型提取待合成文本对应的声学特征(例如,梅尔谱、线性谱、基频等),之后,再利用声码器根据声学特征生成待合成文本对应的音频信息。In the related art, when performing speech synthesis, the acoustic features (eg, Mel spectrum, linear spectrum, fundamental frequency, etc.) corresponding to the text to be synthesized are usually extracted first through the acoustic sub-model, and then the vocoder is used to extract the corresponding acoustic features according to the acoustic features. Generate audio information corresponding to the text to be synthesized.
发明内容SUMMARY OF THE INVENTION
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This Summary is provided to introduce concepts in a simplified form that are described in detail in the Detailed Description section that follows. This summary section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
第一方面,本公开提供一种语音合成方法,包括:获取待合成文本对应的语音特征信息;将所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;以及对所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。In a first aspect, the present disclosure provides a speech synthesis method, comprising: acquiring speech feature information corresponding to text to be synthesized; inputting the speech feature information into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized , wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder; and the predicted waveform point The information is expanded by μ-law to obtain audio information corresponding to the text to be synthesized.
在一些实施例中,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow;所述编码网络用于根据所述语音特征信息,生成所述待合成文本对应的表示序列,其中,所述表示序列由所述待合成文本中每一音素的编码按照相应音素在所述待合成文本中的先后顺序排列而成;所述时长子模型用于根据所述语音特征信息,得到所述待合成文本对应的时长特征信息,其中,所述时长特征信息包括所述待合成文本中每一音素对应的语音帧的数量;所述高 斯采样模块用于根据所述表示序列和所述时长特征信息,生成所述待合成文本对应的定长的语义表征;所述线性处理模块用于对所述语义表征进行线性变换,得到所述待合成文本对应的第一梅尔谱信息;所述基于流的可逆生成模型Glow用于根据标准正态分布生成第二梅尔谱信息;所述声码器用于根据所述第一梅尔谱信息和所述第二梅尔谱信息,生成与所述待合成文本对应的预测波形点信息。In some embodiments, the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow; the encoding network is configured to generate the The representation sequence corresponding to the text to be synthesized, wherein the representation sequence is formed by the coding of each phoneme in the text to be synthesized according to the sequence of the corresponding phonemes in the text to be synthesized; the duration sub-model uses According to the voice feature information, the duration feature information corresponding to the text to be synthesized is obtained, wherein the duration feature information includes the number of speech frames corresponding to each phoneme in the text to be synthesized; the Gauss sampling module uses According to the representation sequence and the duration feature information, a fixed-length semantic representation corresponding to the text to be synthesized is generated; the linear processing module is used to perform linear transformation on the semantic representation to obtain the corresponding text to be synthesized. The first Mel spectrum information; the flow-based reversible generation model Glow is used to generate the second Mel spectrum information according to the standard normal distribution; The vocoder is used to generate the second Mel spectrum information according to the first Mel spectrum information and the The second mel spectrum information generates predicted waveform point information corresponding to the text to be synthesized.
在一些实施例中,所述声码器为基于流的生成模型Flow;所述语音合成模型通过以下方式训练得到:获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。In some embodiments, the vocoder is a flow-based generation model Flow; the speech synthesis model is obtained by training in the following ways: acquiring the labeled speech feature information, labeled waveform point information, and labeled mel spectrum corresponding to the text training samples and by using the marked voice feature information as the input of the encoding network and the duration sub-model, respectively, using the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module, The output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model Flow. The method of generating the target output of the model Flow, using the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow The acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
在一些实施例中,所述语音特征信息包括音素、声调、分词及韵律边界;所述获取待合成文本对应的语音特征信息,包括:将所述待合成文本输入信息提取模型,得到与所述待合成文本对应的语音特征信息。In some embodiments, the speech feature information includes phonemes, tones, word segmentation, and prosodic boundaries; the acquiring speech feature information corresponding to the text to be synthesized includes: inputting the text to be synthesized into an information extraction model, and obtaining the Speech feature information corresponding to the text to be synthesized.
在一些实施例中,所述方法还包括:将所述音频信息与背景音乐进行合成。In some embodiments, the method further includes synthesizing the audio information with background music.
第二方面,本公开提供一种语音合成模型训练方法,所述语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;所述方法包括:获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到 所述语音合成模型。In a second aspect, the present disclosure provides a speech synthesis model training method, the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and A flow-based reversible generative model Glow, the vocoder is a flow-based generative model Flow; the method includes: acquiring labeled speech feature information, labeled waveform point information, and labeled mel spectrum information corresponding to the text training samples; and by The marked voice feature information is used as the input of the encoding network and the duration sub-model respectively, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the Gaussian sampling module is used as the input. The output of the sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model Flow. Target output, using the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow for the sound. The student model and the vocoder are directly trained jointly to obtain the speech synthesis model.
第三方面,本公开提供一种语音合成装置,包括:第一获取模块,用于获取待合成文本对应的语音特征信息;语音合成模块,用于将所述第一获取模块获取到的所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;扩展模块,用于对所述语音合成模块得到的所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。In a third aspect, the present disclosure provides a speech synthesis device, comprising: a first acquisition module for acquiring speech feature information corresponding to text to be synthesized; a speech synthesis module for The speech feature information is input into the speech synthesis model, and the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model The student model and the vocoder are directly trained jointly; the expansion module is used to perform μ-law expansion on the predicted waveform point information obtained by the speech synthesis module to obtain the audio information corresponding to the text to be synthesized.
第四方面,本公开提供一种语音合成模型训练装置,所述语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;所述装置包括:第二获取模块,用于获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及训练模块,用于通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。In a fourth aspect, the present disclosure provides a speech synthesis model training device, the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and A flow-based reversible generation model Glow, and the vocoder is a flow-based generation model Flow; the device includes: a second acquisition module, configured to acquire labeled voice feature information, labeled waveform point information, and labeled corresponding to the text training samples Mel spectrum information; and a training module for using the marked speech feature information as the input of the encoding network and the duration sub-model, respectively, using the output of the encoding network and the output of the duration sub-model as The input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the labeled waveform is used as the input of the flow-based generation model Flow. The point information is used as the target output of the flow-based generation model Flow, the labeled mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the flow-based reversible The acoustic sub-model and the vocoder are directly jointly trained in the manner of generating the target output of the model Glow to obtain the speech synthesis model.
第五方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面或第二方面提供的所述方法。In a fifth aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the method provided in the first aspect or the second aspect of the present disclosure.
第六方面,本公开提供一种电子设备,包括:存储装置,其上存储有计算机程序;以及处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面提供的所述方法。In a sixth aspect, the present disclosure provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device, so as to realize the first aspect of the present disclosure. of the method.
第七方面,本公开提供一种电子设备,包括:存储装置,其上存储有计算机程序;以及处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第二方面提供的所述方法。In a seventh aspect, the present disclosure provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device, so as to implement the second aspect of the present disclosure. of the method.
第八方面,本公开提供一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执行本公开第一方面或第二方面提供的所述方法。In an eighth aspect, the present disclosure provides a computer program, comprising: instructions that, when executed by a processor, cause the processor to perform the method provided by the first aspect or the second aspect of the present disclosure.
第九方面,本公开提供一种计算机程序产品,包括指令,所述指令当由处理器执 行时使所述处理器执行本公开第一方面或第二方面提供的所述方法。In a ninth aspect, the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the method provided by the first or second aspect of the present disclosure.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:
图1是根据一示例性实施例示出的一种语音合成方法的流程图。Fig. 1 is a flow chart of a speech synthesis method according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种语音合成过程的示意图。Fig. 2 is a schematic diagram of a speech synthesis process according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种语音合成模型训练方法的流程图。Fig. 3 is a flowchart of a method for training a speech synthesis model according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种语音合成模型训练过程的示意图。Fig. 4 is a schematic diagram illustrating a training process of a speech synthesis model according to an exemplary embodiment.
图5是根据另一示例性实施例示出的一种语音合成方法的流程图。Fig. 5 is a flow chart of a speech synthesis method according to another exemplary embodiment.
图6是根据一示例性实施例示出的一种语音合成装置的框图。Fig. 6 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种语音合成模型训练装置的框图。Fig. 7 is a block diagram of an apparatus for training a speech synthesis model according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种电子设备的框图。Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
本公开的发明人发现,在相关技术中,声学子模型和声码器在协作进行语音合成时,语音合成的速度较慢,且易出现误差累积的现象,从而影响语音合成的准确度。另外,通过声学子模型提取到的声学特征可能不具备普适性,使得基于该声学特征生成的音频信息不能适配特殊的发音需求,例如,高尖女声或者低沉男声。The inventors of the present disclosure found that, in the related art, when the acoustic sub-model and the vocoder cooperate to perform speech synthesis, the speech synthesis speed is slow, and the phenomenon of error accumulation is prone to occur, thereby affecting the accuracy of speech synthesis. In addition, the acoustic features extracted by the acoustic sub-model may not have universality, so that the audio information generated based on the acoustic features cannot be adapted to special pronunciation requirements, for example, a high-pitched female voice or a low-pitched male voice.
鉴于此,本公开提供了一种语音合成方法。下面结合附图详细描述该语音合成方法。In view of this, the present disclosure provides a speech synthesis method. The speech synthesis method will be described in detail below with reference to the accompanying drawings.
图1是根据一示例性实施例示出的一种语音合成方法的流程图。其中,如图1所示,该方法包括步骤S101~S103。Fig. 1 is a flow chart of a speech synthesis method according to an exemplary embodiment. Wherein, as shown in FIG. 1 , the method includes steps S101 to S103.
在步骤S101中,获取待合成文本对应的语音特征信息。In step S101, the speech feature information corresponding to the text to be synthesized is acquired.
例如,待合成文本可以为诸如中文、藏文、维文、泰文、羌文等有声调的语种。语音特征信息可以用于表征待合成文本的音素、语调、停顿等相关信息。另外,待合成文本可以为小说、歌词等各种类型的文本。For example, the text to be synthesized may be a tonal language such as Chinese, Tibetan, Uyghur, Thai, and Qiang. The speech feature information can be used to represent phoneme, intonation, pause and other related information of the text to be synthesized. In addition, the text to be synthesized may be various types of texts such as novels and lyrics.
在步骤S102中,将语音特征信息输入语音合成模型中,得到与待合成文本对应的预测波形点信息。In step S102, the speech feature information is input into the speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized.
在本公开中,语音合成模型包括声学子模型和声码器,其中,语音合成模型是通过对声学子模型和声码器直接进行联合训练得到的。In the present disclosure, the speech synthesis model includes an acoustic sub-model and a vocoder, wherein the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder.
在步骤S103中,对预测波形点信息进行μ律扩展,得到待合成文本对应的音频信息。In step S103, μ-law expansion is performed on the predicted waveform point information to obtain audio information corresponding to the text to be synthesized.
在上述技术方案中,通过语音合成模型可以直接根据待合成文本对应的语音特征信息获得预测波形点信息,之后,对该预测波形点信息进行简单的μ律扩展即可获取到待合成文本对应的音频信息,而无需经过声学子模型和声码器进行协作,从而提升了语音合成的效率,并且,可有效降低相关技术中将声学子模型和声码器分别训练所产生的误差累积,提升了语音合成的准确度。另外,由于根据待合成文本对应的语音特征信息可以直接生成预测波形点信息,而不涉及声学特征,从而可以避免声学特征因不具备普适性导致生成的音频信息不能适配特殊的发音需求的问题,从而提升了语音合成的效果。此外,通过对声学子模型和声码器直接进行联合训练即可得到语音合成模型,从而可以缩短模型训练周期,并且使得该语音合成模型的韵律保真度更好;并且,还可保证语音合成模型中的声学子模型和声码器的匹配度,从而避免即使声学子模型和声码器的准确度都较高,但得出的语音 合成结果准确度低的问题,进一步提升了语音合成的准确度。In the above technical solution, the predicted waveform point information can be directly obtained according to the speech feature information corresponding to the text to be synthesized through the speech synthesis model, and then the predicted waveform point information can be obtained by simple μ-law expansion to obtain the corresponding information of the to-be-synthesized text. Audio information without the cooperation of the acoustic sub-model and the vocoder, thereby improving the efficiency of speech synthesis, and can effectively reduce the accumulation of errors caused by the separate training of the acoustic sub-model and the vocoder in the related art. Accuracy of speech synthesis. In addition, since the predicted waveform point information can be directly generated according to the speech feature information corresponding to the text to be synthesized, without involving the acoustic features, it can avoid the fact that the generated audio information cannot meet the special pronunciation requirements due to the lack of universality of the acoustic features. problem, thereby improving the effect of speech synthesis. In addition, the speech synthesis model can be obtained by directly jointly training the acoustic sub-model and the vocoder, so that the model training period can be shortened, and the prosody fidelity of the speech synthesis model can be better; and the speech synthesis model can also be guaranteed. The matching degree of the acoustic sub-model and the vocoder in the model, so as to avoid the problem that the accuracy of the speech synthesis result obtained is low even if the accuracy of the acoustic sub-model and the vocoder are high, which further improves the accuracy of speech synthesis. Accuracy.
下面针对上述步骤S101中的获取待合成文本对应的语音特征信息的具体实施方式进行详细说明。The specific implementation manner of acquiring the speech feature information corresponding to the text to be synthesized in the above step S101 will be described in detail below.
在本公开中,语音特征信息可以包括音素、声调、分词以及韵律边界。其中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素;音素分为元音与辅音两大类。示例地,对于中文来说,音素包括声母(声母,是使用在韵母前面的辅音,跟韵母一齐构成的一个完整的音节)和韵母(即元音)。声调是指声音的高低升降的变化。示例地,中文中有四个声调:阴平、阳平、上声和去声。韵律边界用于指示在阅读文本时应该在哪些地方进行停顿。示例地,韵律边界分为“#1”、“#2”、“#3”和“#4”四个停顿等级,其停顿程度依次增大。In the present disclosure, the speech feature information may include phonemes, tones, participles, and prosodic boundaries. Among them, a phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable. An action constitutes a phoneme; phonemes are divided into two categories: vowels and consonants. For example, for Chinese, phonemes include initials (initials, which are consonants used in front of finals, and form a complete syllable together with the finals) and finals (ie, vowels). Tone is the change in the pitch of a sound. For example, there are four tones in Chinese: Yinping, Yangping, Shangsheng and Qusheng. Prosodic boundaries are used to indicate where to pause when reading text. For example, the prosodic boundary is divided into four pause levels "#1", "#2", "#3" and "#4", and the degrees of pauses increase in sequence.
具体来说,可以通过多种方式来获取上述语音特征信息。在一种实施方式中,待合成文本对应的语音特征信息可以由用户预先标注好并存储在相应的存储模块,这样,通过访问该存储模块即可获取到待合成文本对应的语音特征信息。Specifically, the above-mentioned voice feature information can be acquired in various ways. In one embodiment, the voice feature information corresponding to the text to be synthesized can be marked in advance by the user and stored in a corresponding storage module, so that the voice feature information corresponding to the text to be synthesized can be obtained by accessing the storage module.
在另一种实施方式中,可以将待合成文本输入信息提取模型,得到与待合成文本对应的语音特征信息,方便快捷,且无需人工参与,节省了人力。In another embodiment, the text to be synthesized can be input into the information extraction model, and the speech feature information corresponding to the text to be synthesized can be obtained, which is convenient and quick, and does not require manual participation, thus saving manpower.
在本公开中,信息提取模型可以包括文本正则化(Text Normalization,TN)模型、字素到音素(Grapheme-to-Phoneme,G2P)模型、分词模型以及韵律模型。其中,可以通过TN模型将待合成文本中的数字、符号、缩写等转换成语言文字,通过G2P模型获取待合成文本中的音素,通过分词模型对待合成文本进行分词,通过韵律模型获取待合成文本的韵律边界以及声调。In the present disclosure, the information extraction model may include a Text Normalization (TN) model, a Grapheme-to-Phoneme (G2P) model, a word segmentation model, and a prosody model. Among them, the numbers, symbols, abbreviations, etc. in the text to be synthesized can be converted into language characters through the TN model, the phonemes in the text to be synthesized can be obtained through the G2P model, the text to be synthesized can be segmented through the word segmentation model, and the text to be synthesized can be obtained through the prosody model. rhythmic boundaries and tones.
示例地,G2P模型可以采用循环神经网络(Recurrent Neural Network,RNN)和长短期记忆网络(Long Short-Term Memory,LSTM)来实现从字素到音素的转化。For example, the G2P model can use a Recurrent Neural Network (RNN) and a Long Short-Term Memory (LSTM) to realize the conversion from graphemes to phonemes.
分词模型可以为n-gram模型、隐马尔可夫模型、朴素贝叶斯分类模型等。The word segmentation model can be an n-gram model, a hidden Markov model, a naive Bayes classification model, and the like.
韵律模型为预训练语言模型BERT(Bidirectional Encoder Representation from Transformers)、双向LSTM-CRF(Conditional Random Field,条件随机场)模型等。The prosody model is a pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), a bidirectional LSTM-CRF (Conditional Random Field) model, etc.
在上述实施方式中,通过提取待合成文本的音素、声调、分词和韵律边界这些语音特征信息,并基于语音特征信息对待合成文本进行语音合成,从而可以更加关注待合成文本的文本内容。这样,可以使得到的待合成文本对应的音频信息能够根据待合成文本的文本内容以及分词进行停顿,提高了音频信息的准确度和可理解性,便于用户快速理解音频信息对应的文本内容。另外,由于语音合成时能够在自然的韵律边界处进行停顿,因此可以 提升待合成文本对应的音频信息的自然度和流畅性。In the above embodiment, by extracting the phoneme, tone, word segmentation and prosodic boundary of the text to be synthesized, and performing speech synthesis on the to-be-synthesized text based on the speech feature information, it is possible to pay more attention to the text content of the to-be-synthesized text. In this way, the audio information corresponding to the text to be synthesized can be paused according to the text content and word segmentation of the text to be synthesized, the accuracy and intelligibility of the audio information are improved, and the user can quickly understand the text content corresponding to the audio information. In addition, since the speech can be paused at the natural prosodic boundary during speech synthesis, the naturalness and fluency of the audio information corresponding to the text to be synthesized can be improved.
下面结合语音合成模型的结构详细说明通过该语音合成模型进行语音合成的过程。如图2所示,声学子模型包括编码网络、时长子模型、高斯采样(Gaussian sampling)模块、线性处理模块以及基于流的可逆生成模型Glow。The following describes the process of speech synthesis by the speech synthesis model in detail with reference to the structure of the speech synthesis model. As shown in Figure 2, the acoustic sub-model includes an encoding network, a duration sub-model, a Gaussian sampling (Gaussian sampling) module, a linear processing module, and a flow-based reversible generative model Glow.
在本公开中,编码网络用于根据待合成文本对应的语音特征信息,生成待合成文本对应的表示序列;时长子模型用于根据待合成文本对应的语音特征信息,得到待合成文本对应的时长特征信息;高斯采样模块用于根据表示序列和时长特征信息,生成待合成文本对应的定长的语义表征;线性处理模块用于对该定长的语义表征进行线性变换,得到待合成文本对应的第一梅尔谱信息;基于流的可逆生成模型Glow用于根据标准正态分布生成第二梅尔谱信息;声码器用于根据第一梅尔谱信息和第二梅尔谱信息,生成与待合成文本对应的预测波形点信息。In the present disclosure, the coding network is used to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information corresponding to the text to be synthesized; the duration sub-model is used to obtain the duration corresponding to the text to be synthesized according to the speech feature information corresponding to the text to be synthesized feature information; the Gauss sampling module is used to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and duration feature information; the linear processing module is used to perform linear transformation on the fixed-length semantic representation to obtain the corresponding text to be synthesized. The first mel spectrum information; the flow-based reversible generation model Glow is used to generate the second mel spectrum information according to the standard normal distribution; the vocoder is used to generate and match the first mel spectrum information and the second mel spectrum information according to the The predicted waveform point information corresponding to the text to be synthesized.
其中,时长特征信息包括待合成文本中每一音素对应的语音帧的数量。表示序列由待合成文本中每一音素的编码按照相应音素在待合成文本中的先后顺序排列而成。The duration feature information includes the number of speech frames corresponding to each phoneme in the text to be synthesized. The representation sequence is formed by arranging the codes of each phoneme in the text to be synthesized according to the sequence of the corresponding phonemes in the text to be synthesized.
示例地,待合成文本对应的音素序列为“ab”,其中,音素“a”的编码为“A”,音素“b”的编码为“B”,则待合成文本对应的表示序列为“AB”。For example, the phoneme sequence corresponding to the text to be synthesized is "ab", wherein the phoneme "a" is encoded as "A", and the phoneme "b" is encoded as "B", then the representation sequence corresponding to the text to be synthesized is "AB" ".
如图2所示,编码网络可以包括预处理网络(Pre-net)子模型和Transformer子模型。首先,将待合成文本对应的语音特征信息输入至Pre-net子模型中,以对该语音特征信息进行非线性变换,从而提升语音合成模型的收敛和泛化能力,然后,通过Transformer子模型根据非线性变换后所得的语音特征信息,获得待合成文本对应的表示序列。As shown in Figure 2, the encoding network may include a pre-processing network (Pre-net) sub-model and a Transformer sub-model. First, input the speech feature information corresponding to the text to be synthesized into the Pre-net sub-model to perform nonlinear transformation on the speech feature information, thereby improving the convergence and generalization capabilities of the speech synthesis model. The speech feature information obtained after nonlinear transformation is used to obtain the representation sequence corresponding to the text to be synthesized.
时长子模型可以例如是CBHG模型、长短时记忆网络(Long Short Term Memory Network,LSTM)模型、LSTM-RNN(Recurrent Neural Network,循环神经网络)模型、深度神经网络(Deep Neural Networks,DNN)模型、Transformer模型、基于流的生成模型Flow等。例如,时长子模型可以采用适合建模语音中的不确定信息的基于流的生成模型Flow,从而进一步提升语音合成的准确度。The duration sub-model can be, for example, a CBHG model, a Long Short Term Memory Network (LSTM) model, an LSTM-RNN (Recurrent Neural Network, Recurrent Neural Network) model, a Deep Neural Network (Deep Neural Networks, DNN) model, Transformer model, flow-based generative model Flow, etc. For example, the duration sub-model can adopt the flow-based generative model Flow suitable for modeling uncertain information in speech, so as to further improve the accuracy of speech synthesis.
另外,上述时长子模型可以通过以下步骤来确定待合成文本中各音素对应的语音帧的数量:(1)获取待合成文本中各音素的发音时长;(2)根据每一音素的发音时长,确定该音素对应的语音帧的数量。In addition, the above-mentioned duration sub-model can determine the number of speech frames corresponding to each phoneme in the text to be synthesized through the following steps: (1) obtain the pronunciation duration of each phoneme in the text to be synthesized; (2) according to the pronunciation duration of each phoneme, Determine the number of speech frames corresponding to this phoneme.
示例地,一音素的发音时长为200ms,一个语音帧的时间长度为5ms,则该音素对应的语音帧的数量为40。For example, the pronunciation duration of a phoneme is 200ms, and the duration of a speech frame is 5ms, then the number of speech frames corresponding to the phoneme is 40.
又示例地,一音素的发音时长为203ms,一个语音帧的时间长度为5ms,则该音素对 应的语音帧的数量为
Figure PCTCN2021139988-appb-000001
即最后一片不足5ms的,按照一帧处理。
In another example, the pronunciation duration of a phoneme is 203ms, and the duration of a speech frame is 5ms, then the number of speech frames corresponding to the phoneme is
Figure PCTCN2021139988-appb-000001
That is, if the last slice is less than 5ms, it will be processed as one frame.
如图2所示,基于流的可逆生成模型Glow包括压缩层、激活标准化(Activation Normalization,ActNorm)层、可逆1*1卷积层、仿射耦合层。As shown in Figure 2, the flow-based reversible generative model Glow includes a compression layer, an activation normalization (Activation Normalization, ActNorm) layer, a reversible 1*1 convolutional layer, and an affine coupling layer.
其中,ActNorm层用于对数据进行规范化处理,采用每个通道的缩放和偏差参数进行激活,使得小批量的数据在激活后具有零均值和单位方差,相当于对数据进行了预处理,避免模型性能的下降。可逆1*1卷积层通过采用矩阵乘法实现对各个维度数据的打乱,使信息混合得更加充分。仿射耦合层通过可逆函数实现数据的可逆转换。Among them, the ActNorm layer is used to normalize the data, using the scaling and bias parameters of each channel for activation, so that the small batch of data has zero mean and unit variance after activation, which is equivalent to preprocessing the data to avoid the model performance degradation. The reversible 1*1 convolution layer uses matrix multiplication to shuffle the data of each dimension, so that the information is mixed more fully. The affine coupling layer realizes the reversible transformation of the data through the reversible function.
在语音合成阶段,将标准正态分布输入仿射耦合层,得到逆转换后的数据;之后,将逆转换后的数据输入可逆1*1卷积层,以对其进行打乱;接下来,将打乱后的数据输入ActNorm层,以进行数据规范化处理,之后经过压缩层解压后得到第二梅尔谱信息,并输出至声码器。In the speech synthesis stage, the standard normal distribution is input into the affine coupling layer to obtain the inversely transformed data; after that, the inversely transformed data is input into the reversible 1*1 convolutional layer to scramble it; next, The scrambled data is input into the ActNorm layer for data normalization, and then decompressed by the compression layer to obtain the second mel spectrum information, which is output to the vocoder.
下面针对上述语音合成模型的训练方法进行详细说明。为了提升语音合成模型的训练效果,声码器可以为基于流的生成模型Flow。具体来说,该训练方法可以通过图3中所示的步骤S301和S302来实现。The training method of the above speech synthesis model will be described in detail below. In order to improve the training effect of the speech synthesis model, the vocoder can be a flow-based generative model Flow. Specifically, the training method can be implemented through steps S301 and S302 shown in FIG. 3 .
在步骤S301中,获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息。In step S301, the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples are obtained.
在本公开中,可以通过对文本训练样本对应的音频信息进行μ律压缩,得到该文本训练样本对应的标注波形点信息。In the present disclosure, the labeled waveform point information corresponding to the text training sample can be obtained by performing μ-law compression on the audio information corresponding to the text training sample.
在步骤S302中,通过将标注语音特征信息分别作为编码网络、时长子模型的输入,将编码网络的输出和时长子模型的输出作为高斯采样模块的输入,将高斯采样模块的输出作为线性处理模块的输入,将线性处理模块的输出作为基于流的生成模型Flow的输入,将标注波形点信息作为基于流的生成模型Flow的目标输出,将标注梅尔谱信息作为基于流的可逆生成模型Glow的输入,将标准正态分布作为基于流的可逆生成模型Glow的目标输出的方式对声学子模型和声码器直接进行联合训练,以得到语音合成模型。In step S302, the marked speech feature information is used as the input of the coding network and the duration sub-model respectively, the output of the coding network and the output of the duration sub-model are used as the input of the Gaussian sampling module, and the output of the Gaussian sampling module is used as the linear processing module The input of the linear processing module is used as the input of the flow-based generative model Flow, the labeled waveform point information is used as the target output of the flow-based generative model Flow, and the labeled mel spectrum information is used as the flow-based reversible generative model Glow. Input, the acoustic sub-model and the vocoder are directly jointly trained by taking the standard normal distribution as the target output of the flow-based reversible generative model Glow to obtain the speech synthesis model.
具体来说,如图4所示,可以将文本训练样本对应的标注语音特征信息输入编码网络,得到文本训练样本对应的表示序列,并且,将文本训练样本对应的标注语音特征信息输入时长子模型,得到文本训练样本对应的时长特征信息;之后,将文本训练样本对应的表示序列和时长特征信息输入高斯采样模块,得到文本训练样本对应的定长的语义表征;接下来,将该定长的语义表征输入线性处理模块,以进行线性变换,生成文本训练样本对应的预测梅尔谱信息;然后,将该预测梅尔谱信息输入声码器(即基于流的生成模型Flow), 得到文本训练样本对应的预测波形点信息;之后,根据文本训练样本对应的预测波形点信息和标注波形点信息的比较结果,对声学子模型中、除基于流的可逆生成模型Glow外的各模块的模型参数进行更新。Specifically, as shown in FIG. 4 , the marked voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and the marked voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples; The semantic representation is input into a linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into a vocoder (ie, a flow-based generation model Flow) to obtain text training. The predicted waveform point information corresponding to the sample; then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based reversible generation model Glow are compared. to update.
再者,将文本训练样本对应的标注梅尔谱信息输入基于流的可逆生成模型Glow中的压缩层,以对该标注梅尔谱信息进行压缩;之后,将压缩后的数据输入ActNorm层,以进行数据规范化处理;接下来,将规范化处理后所得的数据输入可逆1*1卷积层,以对其进行打乱;然后,将打乱后的数据输入仿射耦合层,以进行数据逆转换,得到模拟正态分布;之后,可以根据该模拟正态分布和标准正态分布的比较结果,对基于流的可逆生成模型Glow的模型参数进行更新。Furthermore, the labeled mel spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labeled mel spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
由此,可以得到上述语音合成模型。Thus, the above-mentioned speech synthesis model can be obtained.
另外,为了提升用户体验,在上述步骤103获得待合成文本对应的音频信息后,还可以为音频信息添加背景音乐,这样,用户根据背景音乐和音频信息,更容易理解相应的文本内容。具体来说,如图5所示,上述方法还可以包括步骤S104。In addition, in order to improve user experience, after obtaining the audio information corresponding to the text to be synthesized in the above step 103, background music may be added to the audio information, so that the user can more easily understand the corresponding text content according to the background music and audio information. Specifically, as shown in FIG. 5 , the above method may further include step S104.
在步骤S104中,将音频信息与背景音乐进行合成。In step S104, the audio information and the background music are synthesized.
在一种实施方式中,上述背景音乐可以为预设音乐,即可以是用户设定的任一音乐,也可以是默认的音乐。In an implementation manner, the above background music may be preset music, that is, any music set by the user, or default music.
在另一种实施方式中,在将音频信息与背景音乐进行合成之前,可以先根据待合成文本的文本信息和/或语音特征信息,确定该待合成文本对应的使用场景信息,其中,该使用场景信息包括但不限于新闻播报、军武介绍、童话故事、校园广播等;然后,根据该使用场景信息,确定与该使用场景信息相匹配的背景音乐。In another embodiment, before synthesizing the audio information and the background music, the use scene information corresponding to the text to be synthesized may be determined according to the text information and/or voice feature information of the text to be synthesized, wherein the use The scene information includes but is not limited to news broadcasts, military introductions, fairy tales, campus broadcasts, etc.; then, according to the use scene information, determine the background music that matches the use scene information.
在本公开中,可以通过多种方式来确定上述使用场景信息。在一种实施方式中,可以根据待合成文本的文本信息,确定待合成文本对应的使用场景信息,其中,上述文本信息可以为关键词。示例地,可以通过对待合成文本进行关键字自动识别,以根据关键词智能地预判该待合成文本的使用场景信息。In the present disclosure, the above-mentioned usage scenario information may be determined in various ways. In one embodiment, the usage scene information corresponding to the text to be synthesized may be determined according to the text information of the text to be synthesized, wherein the above text information may be a keyword. For example, automatic keyword recognition can be performed on the text to be synthesized, so as to intelligently predict the usage scene information of the text to be synthesized according to the keywords.
在另一种实施方式中,可以根据待合成文本的语音特征信息,确定待合成文本对应的使用场景信息。具体来说,可以从上述步骤101确定出的语音特征信息中的分词中识别出场景描述词语,其中,可以通过将各分词分别与预先存储的场景描述词语表进行匹配的方式来识别出上述场景描述词语,然后根据该场景描述词语,确定待合成文本的使用场景信息。In another implementation manner, the usage scene information corresponding to the text to be synthesized may be determined according to the speech feature information of the text to be synthesized. Specifically, the scene description words can be identified from the word segmentations in the speech feature information determined in the above step 101, wherein the above scene can be identified by matching each word segmentation with a pre-stored scene description word table. Describe the word, and then determine the usage scene information of the text to be synthesized according to the scene description word.
在又一种实施方式中,可以根据待合成文本的文本信息和语音特征信息,确定待合成 文本对应的使用场景信息。具体来说,可以通过对待合成文本进行关键字自动识别,以及从上述步骤101确定出的语音特征信息中的分词中识别出场景描述词语,之后,根据关键字和场景描述词语共同确定待合成文本的使用场景信息。这样,可以提升使用场景信息的确定精度。In yet another embodiment, the usage scene information corresponding to the text to be synthesized may be determined according to the text information and voice feature information of the text to be synthesized. Specifically, automatic keyword recognition of the text to be synthesized can be performed, and scene description words can be identified from the word segmentation in the speech feature information determined in the above step 101, and then, the text to be synthesized can be jointly determined according to the keywords and the scene description words. usage scenario information. In this way, the determination accuracy of the usage scene information can be improved.
在确定出待合成文本对应的使用场景信息后,可以根据该使用场景信息,利用预先存储的使用场景信息与背景音乐的对应关系,确定与待合成文本对应的使用场景信息匹配的背景音乐。例如,使用场景信息为军武介绍,其对应的背景音乐可以为激昂的音乐;使用场景信息为童话故事,则其对应的背景音乐可以为轻快活泼的音乐。After the usage scene information corresponding to the text to be synthesized is determined, the background music that matches the usage scene information corresponding to the to-be-synthesized text can be determined by using the pre-stored correspondence between the usage scene information and the background music according to the usage scene information. For example, if the scene information is an introduction to military and military, the corresponding background music may be exciting music; if the scene information is fairy tales, the corresponding background music may be lively music.
本公开还提供一种语音合成模型训练方法,其中,该语音合成模型包括声学子模型和声码器,声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,声码器为基于流的生成模型Flow。具体来说,可以通过图3中所示的步骤S301和S302来训练语音合成模型。The present disclosure also provides a speech synthesis model training method, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based invertible The generative model Glow, the vocoder is the flow-based generative model Flow. Specifically, the speech synthesis model can be trained through steps S301 and S302 shown in FIG. 3 .
在步骤S301中,获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息。In step S301, the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples are obtained.
在本公开中,可以通过对文本训练样本对应的音频信息进行μ律压缩,得到该文本训练样本对应的标注波形点信息。In the present disclosure, the labeled waveform point information corresponding to the text training sample can be obtained by performing μ-law compression on the audio information corresponding to the text training sample.
在步骤S302中,通过将标注语音特征信息分别作为编码网络和时长子模型的输入,将编码网络的输出和时长子模型的输出作为高斯采样模块的输入,将高斯采样模块的输出作为线性处理模块的输入,将线性处理模块的输出作为基于流的生成模型Flow的输入,将标注波形点信息作为基于流的生成模型Flow的目标输出,将标注梅尔谱信息作为基于流的可逆生成模型Glow的输入,将标准正态分布作为基于流的可逆生成模型Glow的目标输出的方式对声学子模型和声码器直接进行联合训练,以得到语音合成模型。In step S302, by using the marked voice feature information as the input of the encoding network and the duration sub-model, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the output of the Gaussian sampling module is taken as the linear processing module The input of the linear processing module is used as the input of the flow-based generative model Flow, the labeled waveform point information is used as the target output of the flow-based generative model Flow, and the labeled mel spectrum information is used as the flow-based reversible generative model Glow. Input, the acoustic sub-model and the vocoder are directly jointly trained by taking the standard normal distribution as the target output of the flow-based reversible generative model Glow to obtain the speech synthesis model.
具体来说,如图4所示,可以将文本训练样本对应的标注语音特征信息输入编码网络,得到文本训练样本对应的表示序列,并且,将文本训练样本对应的标注语音特征信息输入时长子模型,得到文本训练样本对应的时长特征信息;之后,将文本训练样本对应的表示序列和时长特征信息输入高斯采样模块,得到文本训练样本对应的定长的语义表征;接下来,将该定长的语义表征输入线性处理模块,以进行线性变换,生成文本训练样本对应的预测梅尔谱信息;然后,将该预测梅尔谱信息输入声码器(即基于流的生成模型Flow),得到文本训练样本对应的预测波形点信息;之后,根据文本训练样本对应的预测波形点信息和标注波形点信息的比较结果,对声学子模型中、除基于流的可逆生成模型Glow外的 各模块的模型参数进行更新。Specifically, as shown in FIG. 4 , the marked voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and the marked voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples; The semantic representation is input into the linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into the vocoder (ie, the flow-based generation model Flow) to obtain the text training The predicted waveform point information corresponding to the sample; then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based reversible generation model Glow are compared. to update.
再者,将文本训练样本对应的标注梅尔谱信息输入基于流的可逆生成模型Glow中的压缩层,以对该标注梅尔谱信息进行压缩;之后,将压缩后的数据输入ActNorm层,以进行数据规范化处理;接下来,将规范化处理后所得的数据输入可逆1*1卷积层,以对其进行打乱;然后,将打乱后的数据输入仿射耦合层,以进行数据逆转换,得到模拟正态分布;之后,可以根据该模拟正态分布和标准正态分布的比较结果,对基于流的可逆生成模型Glow的模型参数进行更新。Furthermore, the labelled mel-spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labelled mel-spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
由此,可以得到上述语音合成模型。Thus, the above-mentioned speech synthesis model can be obtained.
图6是根据一示例性实施例示出的一种语音合成装置的框图。如图6所示,该装置600包括:第一获取模块601,用于获取待合成文本对应的语音特征信息;语音合成模块602,用于将所述第一获取模块601获取到的所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;以及扩展模块603,用于对所述语音合成模块602得到的所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。Fig. 6 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment. As shown in FIG. 6 , the device 600 includes: a first acquisition module 601 for acquiring speech feature information corresponding to the text to be synthesized; a speech synthesis module 602 for acquiring the speech acquired by the first acquisition module 601 The feature information is input into the speech synthesis model, and the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by analyzing the acoustic sub-model. The model and the vocoder are directly trained jointly; and the expansion module 603 is used to perform μ-law expansion on the predicted waveform point information obtained by the speech synthesis module 602 to obtain the audio corresponding to the text to be synthesized. information.
例如,待合成文本可以为诸如中文、藏文、维文、泰文、羌文等有声调的语种。语音特征信息可以用于表征待合成文本的音素、语调、停顿等相关信息。另外,待合成文本可以为小说、歌词等各种类型的文本。For example, the text to be synthesized may be a tonal language such as Chinese, Tibetan, Uyghur, Thai, and Qiang. The speech feature information can be used to represent phoneme, intonation, pause and other related information of the text to be synthesized. In addition, the text to be synthesized may be various types of texts such as novels and lyrics.
在上述技术方案中,通过语音合成模型可以直接根据待合成文本对应的语音特征信息获得预测波形点信息,之后,对该预测波形点信息进行简单的μ律扩展即可获取到待合成文本对应的音频信息,而无需经过声学子模型和声码器进行协作,从而提升了语音合成的效率,并且,可有效降低相关技术中将声学子模型和声码器分别训练所产生的误差累积,提升了语音合成的准确度。另外,由于根据待合成文本对应的语音特征信息可以直接生成预测波形点信息,而不涉及声学特征,从而可以避免声学特征因不具备普适性导致生成的音频信息不能适配特殊的发音需求的问题,从而提升了语音合成的效果。此外,通过对声学子模型和声码器直接进行联合训练即可得到语音合成模型,从而可以缩短模型训练周期,并且使得该语音合成模型的韵律保真度更好;并且,还可保证语音合成模型中的声学子模型和声码器的匹配度,从而避免即使声学子模型和声码器的准确度都较高,但得出的语音合成结果准确度低的问题,进一步提升了语音合成的准确度。In the above technical solution, the predicted waveform point information can be directly obtained according to the speech feature information corresponding to the text to be synthesized through the speech synthesis model, and then the predicted waveform point information can be obtained by simple μ-law expansion to obtain the corresponding information of the to-be-synthesized text. Audio information without the cooperation of the acoustic sub-model and the vocoder, thereby improving the efficiency of speech synthesis, and can effectively reduce the accumulation of errors caused by the separate training of the acoustic sub-model and the vocoder in the related art. Accuracy of speech synthesis. In addition, since the predicted waveform point information can be directly generated according to the speech feature information corresponding to the text to be synthesized, without involving the acoustic features, it can avoid the fact that the generated audio information cannot meet the special pronunciation requirements due to the lack of universality of the acoustic features. problem, thereby improving the effect of speech synthesis. In addition, the speech synthesis model can be obtained by directly jointly training the acoustic sub-model and the vocoder, so that the model training period can be shortened, and the prosody fidelity of the speech synthesis model can be better; and the speech synthesis model can also be guaranteed. The matching degree of the acoustic sub-model and the vocoder in the model, so as to avoid the problem that the accuracy of the speech synthesis result obtained is low even if the accuracy of the acoustic sub-model and the vocoder are high, which further improves the accuracy of speech synthesis. Accuracy.
在一些实施例中,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性 处理模块以及基于流的可逆生成模型Glow。In some embodiments, the acoustic submodel includes an encoding network, a duration submodel, a Gaussian sampling module, a linear processing module, and a flow-based reversible generative model Glow.
所述编码网络用于根据所述语音特征信息,生成所述待合成文本对应的表示序列,其中,所述表示序列由所述待合成文本中每一音素的编码按照相应音素在所述待合成文本中的先后顺序排列而成。The encoding network is configured to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, wherein the representation sequence is encoded in the to-be-synthesized text according to the corresponding phoneme by the encoding of each phoneme in the to-be-synthesized text. The order in the text is arranged.
所述时长子模型用于根据所述语音特征信息,得到所述待合成文本对应的时长特征信息,其中,所述时长特征信息包括所述待合成文本中每一音素对应的语音帧的数量。The duration sub-model is used to obtain duration characteristic information corresponding to the text to be synthesized according to the speech characteristic information, wherein the duration characteristic information includes the number of speech frames corresponding to each phoneme in the to-be-synthesized text.
所述高斯采样模块用于根据所述表示序列和所述时长特征信息,生成所述待合成文本对应的定长的语义表征。The Gaussian sampling module is configured to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration feature information.
所述线性处理模块用于对所述语义表征进行线性变换,得到所述待合成文本对应的第一梅尔谱信息。The linear processing module is configured to perform linear transformation on the semantic representation to obtain first mel spectrum information corresponding to the text to be synthesized.
所述基于流的可逆生成模型Glow用于根据标准正态分布生成第二梅尔谱信息。The flow-based reversible generation model Glow is used to generate the second mel spectrum information according to a standard normal distribution.
所述声码器用于根据所述第一梅尔谱信息和所述第二梅尔谱信息,生成与所述待合成文本对应的预测波形点信息。The vocoder is configured to generate predicted waveform point information corresponding to the text to be synthesized according to the first mel spectrum information and the second mel spectrum information.
在一些实施例中,所述声码器为基于流的生成模型Flow。In some embodiments, the vocoder is a flow-based generative model Flow.
所述语音合成模型通过语音合成模型训练装置训练得到。如图7所示,该语音合成模型训练装置700包括:第二获取模块701,用于获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;和训练模块702,用于通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。The speech synthesis model is obtained by training a speech synthesis model training device. As shown in FIG. 7 , the speech synthesis model training apparatus 700 includes: a second acquisition module 701 for acquiring marked voice feature information, marked waveform point information and marked mel spectrum information corresponding to the text training samples; and a training module 702, It is used to use the marked voice feature information as the input of the encoding network and the duration sub-model, respectively, and the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module. The output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model The target output of the model Flow, the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target output of the flow-based reversible generation model Glow. The acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
在一些实施例中,所述语音特征信息包括音素、声调、分词及韵律边界。In some embodiments, the speech feature information includes phonemes, tones, participles, and prosodic boundaries.
所述第一获取模块用于将所述待合成文本输入信息提取模型,得到与所述待合成文本对应的语音特征信息。The first obtaining module is used for extracting the input information of the text to be synthesized into a model to obtain speech feature information corresponding to the text to be synthesized.
在一些实施例中,所述装置600还包括:背景音乐合成模块,用于将所述扩展模块603得到的所述音频信息与背景音乐进行合成。In some embodiments, the apparatus 600 further includes: a background music synthesis module, configured to synthesize the audio information obtained by the expansion module 603 and the background music.
本公开还提供一种语音合成模型训练装置,其中,所述语音合成模型包括声学子模型 和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow。如图7所示,该装置700包括:第二获取模块701,用于获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;和训练模块702,用于通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。The present disclosure also provides an apparatus for training a speech synthesis model, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a The reversible generative model of flow Glow, the vocoder is a flow-based generative model Flow. As shown in FIG. 7 , the device 700 includes: a second acquisition module 701 for acquiring marked voice feature information, marked waveform point information and marked Mel spectrum information corresponding to the text training samples; and a training module 702 for The marked voice feature information is used as the input of the encoding network and the duration sub-model respectively, the output of the encoding network and the output of the duration sub-model are taken as the input of the Gaussian sampling module, and the Gaussian sampling The output of the module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the target of the flow-based generation model Flow output, using the labeled mel spectrum information as the input of the flow-based reversible generation model Glow, and using the standard normal distribution as the target output of the flow-based reversible generation model Glow for the acoustic particle The model and the vocoder are directly trained jointly to obtain the speech synthesis model.
在本公开中,可以通过对文本训练样本对应的音频信息进行μ律压缩,得到该文本训练样本对应的标注波形点信息。In the present disclosure, the labeled waveform point information corresponding to the text training sample can be obtained by performing μ-law compression on the audio information corresponding to the text training sample.
具体来说,如图4所示,可以将文本训练样本对应的标注语音特征信息输入编码网络,得到文本训练样本对应的表示序列,同时,将文本训练样本对应的标注语音特征信息输入时长子模型,得到文本训练样本对应的时长特征信息;之后,将文本训练样本对应的表示序列和时长特征信息输入高斯采样模块,得到文本训练样本对应的定长的语义表征;接下来,将该定长的语义表征输入线性处理模块,以进行线性变换,生成文本训练样本对应的预测梅尔谱信息;然后,将该预测梅尔谱信息输入声码器(即基于流的生成模型Flow),得到文本训练样本对应的预测波形点信息;之后,根据文本训练样本对应的预测波形点信息和标注波形点信息的比较结果,对声学子模型中、除基于流的可逆生成模型Glow外的各模块的模型参数进行更新。Specifically, as shown in Figure 4, the labeled voice feature information corresponding to the text training samples can be input into the coding network to obtain the representation sequence corresponding to the text training samples, and at the same time, the labeled voice feature information corresponding to the text training samples can be input into the duration sub-model , to obtain the duration feature information corresponding to the text training samples; then, input the representation sequence and duration feature information corresponding to the text training samples into the Gauss sampling module to obtain the fixed-length semantic representations corresponding to the text training samples; The semantic representation is input into the linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training samples; then, the predicted mel spectrum information is input into the vocoder (ie, the flow-based generation model Flow) to obtain the text training The predicted waveform point information corresponding to the sample; then, according to the comparison result of the predicted waveform point information corresponding to the text training sample and the marked waveform point information, the model parameters of each module in the acoustic sub-model except the flow-based reversible generation model Glow are compared. to update.
再者,将文本训练样本对应的标注梅尔谱信息输入基于流的可逆生成模型Glow中的压缩层,以对该标注梅尔谱信息进行压缩;之后,将压缩后的数据输入ActNorm层,以进行数据规范化处理;接下来,将规范化处理后所得的数据输入可逆1*1卷积层,以对其进行打乱;然后,将打乱后的数据输入仿射耦合层,以进行数据逆转换,得到模拟正态分布;之后,可以根据该模拟正态分布和标准正态分布的比较结果,对基于流的可逆生成模型Glow的模型参数进行更新。Furthermore, the labelled mel-spectrum information corresponding to the text training samples is input into the compression layer in the flow-based reversible generation model Glow to compress the labelled mel-spectrum information; after that, the compressed data is input into the ActNorm layer to Perform data normalization processing; next, input the normalized data into the reversible 1*1 convolutional layer to scramble it; then, input the scrambled data into the affine coupling layer for data inverse transformation , to obtain a simulated normal distribution; then, the model parameters of the flow-based reversible generative model Glow can be updated according to the comparison result between the simulated normal distribution and the standard normal distribution.
由此,可以得到上述语音合成模型。Thus, the above-mentioned speech synthesis model can be obtained.
另外,需要说明的是,上述语音合成模型训练装置700可以集成于语音合成装置600 中,也可以独立于该语音合成装置600,在本公开中不作具体限定。另外,关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。In addition, it should be noted that the above-mentioned speech synthesis model training apparatus 700 may be integrated into the speech synthesis apparatus 600, or may be independent of the speech synthesis apparatus 600, which is not specifically limited in the present disclosure. In addition, with regard to the apparatus in the above-mentioned embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments of the method, and will not be described in detail here.
本公开还提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开提供的上述语音合成方法的步骤或者语音合成模型训练方法的步骤。例如,该计算机可读介质为非瞬时性计算机可读介质。The present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, implements the steps of the above-mentioned speech synthesis method or the steps of the speech synthesis model training method provided by the present disclosure. For example, the computer-readable medium is a non-transitory computer-readable medium.
下面参考图8,其示出了适于用来实现本公开实施例的电子设备(例如终端设备或服务器)800的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图8示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring next to FIG. 8 , it shows a schematic structural diagram of an electronic device (eg, a terminal device or a server) 800 suitable for implementing an embodiment of the present disclosure. Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 8 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图8所示,电子设备800可以包括处理装置(例如中央处理器、图形处理器等)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储装置808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有电子设备800操作所需的各种程序和数据。处理装置801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , an electronic device 800 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 801 that may be loaded into random access according to a program stored in a read only memory (ROM) 802 or from a storage device 808 Various appropriate actions and processes are executed by the programs in the memory (RAM) 803 . In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to bus 804 .
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置806;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许电子设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的电子设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 807 of a computer, etc.; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809. Communication means 809 may allow electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 8 shows an electronic device 800 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处理装置801执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 809, or from the storage device 808, or from the ROM 802. When the computer program is executed by the processing device 801, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机 可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待合成文本对应的语音特征信息;将所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;对所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the voice feature information corresponding to the text to be synthesized; and inputs the voice feature information into the voice In the synthesis model, the predicted waveform point information corresponding to the text to be synthesized is obtained, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by comparing the acoustic sub-model and the The vocoder is directly obtained through joint training; the predicted waveform point information is expanded by μ-law to obtain the audio information corresponding to the text to be synthesized.
或者,上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:Alternatively, the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息; 语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。Obtain the labeled voice feature information, labeled waveform point information, and labeled mel spectrum information corresponding to the text training samples; the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes an encoding network, a duration sub-model, and a Gaussian sampling module , a linear processing module and a flow-based reversible generative model Glow, and the vocoder is a flow-based generative model Flow; by using the marked voice feature information as the input of the coding network and the duration sub-model, respectively, the The output of the encoding network and the output of the duration sub-model are used as the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, and the output of the linear processing module is used as the input of the linear processing module. The input of the flow-based generation model Flow, the marked waveform point information is used as the target output of the flow-based generation model Flow, and the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow , using the standard normal distribution as the target output of the flow-based reversible generation model Glow to directly jointly train the acoustic sub-model and the vocoder to obtain the speech synthesis model.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取待合成文本对应的语音特征信息的模块”。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring speech feature information corresponding to the text to be synthesized".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非 限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例1提供了一种语音合成方法,包括:获取待合成文本对应的语音特征信息;将所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;对所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。According to one or more embodiments of the present disclosure, Example 1 provides a speech synthesis method, including: obtaining speech feature information corresponding to the text to be synthesized; The predicted waveform point information corresponding to the synthesized text, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder ; Perform μ-law expansion on the predicted waveform point information to obtain the audio information corresponding to the text to be synthesized.
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow;所述编码网络用于根据所述语音特征信息,生成所述待合成文本对应的表示序列,其中,所述表示序列由所述待合成文本中每一音素的编码按照相应音素在所述待合成文本中的先后顺序排列而成;所述时长子模型用于根据所述语音特征信息,得到所述待合成文本对应的时长特征信息,其中,所述时长特征信息包括所述待合成文本中每一音素对应的语音帧的数量;所述高斯采样模块用于根据所述表示序列和所述时长特征信息,生成所述待合成文本对应的定长的语义表征;所述线性处理模块用于对所述语义表征进行线性变换,得到所述待合成文本对应的第一梅尔谱信息;所述基于流的可逆生成模型Glow用于根据标准正态分布生成第二梅尔谱信息;所述声码器用于根据所述第一梅尔谱信息和所述第二梅尔谱信息,生成与所述待合成文本对应的预测波形点信息。According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, the acoustic sub-model including an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generative model Glow; The encoding network is used to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, wherein the representation sequence is determined by the encoding of each phoneme in the text to be synthesized according to the corresponding phoneme in the text to be synthesized. The duration sub-model is used to obtain the duration feature information corresponding to the text to be synthesized according to the speech feature information, wherein the duration feature information includes each of the texts to be synthesized. The number of speech frames corresponding to phonemes; the Gaussian sampling module is used to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration feature information; the linear processing module is used to The semantic representation is linearly transformed to obtain the first mel spectrum information corresponding to the text to be synthesized; the flow-based reversible generation model Glow is used to generate the second mel spectrum information according to the standard normal distribution; the vocoding The generator is configured to generate predicted waveform point information corresponding to the text to be synthesized according to the first Mel spectrum information and the second Mel spectrum information.
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述声码器为基于流的生成模型Flow;所述语音合成模型通过以下方式训练得到:获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;通过将所述标注语音特征信息分 别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, where the vocoder is a flow-based generation model Flow; the speech synthesis model is obtained by training in the following manner: acquiring corresponding text training samples Labeling voice feature information, labeling waveform point information, and labeling mel spectrum information; by using the labeled voice feature information as the input of the encoding network and the duration sub-model, respectively, the output of the encoding network and the duration The output of the sub-model is used as the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, and the output of the linear processing module is used as the input of the flow-based generation model Flow, The marked waveform point information is used as the target output of the flow-based generation model Flow, the marked mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the The acoustic sub-model and the vocoder are directly jointly trained in the manner of the target output of the flow-based reversible generation model Glow to obtain the speech synthesis model.
根据本公开的一个或多个实施例,示例4提供了示例1-3中任一项的方法,所述语音特征信息包括音素、声调、分词及韵律边界;所述获取待合成文本对应的语音特征信息,包括:将所述待合成文本输入信息提取模型,得到与所述待合成文本对应的语音特征信息。According to one or more embodiments of the present disclosure, Example 4 provides the method of any one of Examples 1-3, wherein the speech feature information includes phonemes, tones, word segmentation, and prosodic boundaries; the acquiring speech corresponding to the text to be synthesized The feature information includes: inputting the text to be synthesized into an information extraction model to obtain speech feature information corresponding to the to-be-synthesized text.
根据本公开的一个或多个实施例,示例5提供了示例1-3中任一项的方法,所述方法还包括:将所述音频信息与背景音乐进行合成。According to one or more embodiments of the present disclosure, Example 5 provides the method of any one of Examples 1-3, the method further comprising: synthesizing the audio information with background music.
根据本公开的一个或多个实施例,示例6提供了一种语音合成模型训练方法,所述语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;所述方法包括:获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。According to one or more embodiments of the present disclosure, Example 6 provides a speech synthesis model training method, the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes an encoding network, a duration sub-model, A Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow, the vocoder is a flow-based generation model Flow; the method includes: acquiring marked voice feature information corresponding to a text training sample, marked waveform point information, and Labeling mel spectrum information; using the labeling speech feature information as the input of the encoding network and the duration sub-model respectively, and using the output of the encoding network and the output of the duration sub-model as the Gauss sampling module take the output of the Gaussian sampling module as the input of the linear processing module, take the output of the linear processing module as the input of the flow-based generation model Flow, and take the marked waveform point information as the The target output of the flow-based generation model Flow, the labeled mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target of the flow-based reversible generation model Glow The acoustic sub-model and the vocoder are directly jointly trained by the output method to obtain the speech synthesis model.
根据本公开的一个或多个实施例,示例7提供了一种语音合成装置,包括:第一获取模块,用于获取待合成文本对应的语音特征信息;语音合成模块,用于将所述第一获取模块获取到的所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;扩展模块,用于对所述语音合成模块得到的所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。According to one or more embodiments of the present disclosure, Example 7 provides an apparatus for speech synthesis, including: a first acquisition module for acquiring speech feature information corresponding to the text to be synthesized; a speech synthesis module for The speech feature information obtained by the acquisition module is input into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech The synthesis model is obtained by directly jointly training the acoustic sub-model and the vocoder; an expansion module is used to perform μ-law expansion on the predicted waveform point information obtained by the speech synthesis module to obtain the Audio information corresponding to the text to be synthesized.
根据本公开的一个或多个实施例,示例8提供了一种语音合成模型训练装置,所述语 音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;所述装置包括:第二获取模块,用于获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;训练模块,用于通过将所述标注语音特征信息分别作为所述编码网络、所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。According to one or more embodiments of the present disclosure, Example 8 provides an apparatus for training a speech synthesis model, where the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes an encoding network, a duration sub-model, A Gaussian sampling module, a linear processing module, and a flow-based reversible generation model Glow, the vocoder is a flow-based generation model Flow; the device includes: a second acquisition module for acquiring the marked voice features corresponding to the text training samples information, labeling waveform point information and labeling mel spectrum information; a training module is used to use the labeling speech feature information as the input of the encoding network and the duration sub-model, respectively, the output of the encoding network and all The output of the duration sub-model is used as the input of the Gaussian sampling module, the output of the Gaussian sampling module is used as the input of the linear processing module, and the output of the linear processing module is used as the flow-based generation model Flow. Input, use the marked waveform point information as the target output of the flow-based generation model Flow, use the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and use the standard normal distribution The acoustic sub-model and the vocoder are directly jointly trained as the target output of the flow-based reversible generative model Glow to obtain the speech synthesis model.
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-6中任一项所述方法。According to one or more embodiments of the present disclosure, Example 9 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method of any one of Examples 1-6.
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-5中任一项所述方法。According to one or more embodiments of the present disclosure, Example 10 provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the method of any one of Examples 1-5.
根据本公开的一个或多个实施例,示例11提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例6所述方法。According to one or more embodiments of the present disclosure, Example 11 provides an electronic device, including: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the method described in Example 6.
根据本公开的一个或多个实施例,示例12提供了一种计算机程序,包括:指令,所述指令当由处理器执行时使所述处理器执行示例1-6中任一项所述方法。According to one or more embodiments of the present disclosure, Example 12 provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform the method of any one of Examples 1-6 .
根据本公开的一个或多个实施例,示例13提供了一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行示例1-6中任一项所述方法。According to one or more embodiments of the present disclosure, Example 13 provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the method of any of Examples 1-6 .
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。 同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims (13)

  1. 一种语音合成方法,包括:A speech synthesis method, comprising:
    获取待合成文本对应的语音特征信息;Obtain the speech feature information corresponding to the text to be synthesized;
    将所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;以及Input the speech feature information into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the speech synthesis model is obtained by pairing The acoustic sub-model and the vocoder are directly trained jointly; and
    对所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。Perform μ-law expansion on the predicted waveform point information to obtain audio information corresponding to the text to be synthesized.
  2. 根据权利要求1所述的方法,其中,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow;The method of claim 1, wherein the acoustic sub-model comprises an encoding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generative model Glow;
    所述编码网络用于根据所述语音特征信息,生成所述待合成文本对应的表示序列,其中,所述表示序列由所述待合成文本中每一音素的编码按照相应音素在所述待合成文本中的先后顺序排列而成;The encoding network is configured to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, wherein the representation sequence is encoded in the to-be-synthesized text according to the corresponding phoneme by the encoding of each phoneme in the to-be-synthesized text. The order in the text is arranged;
    所述时长子模型用于根据所述语音特征信息,得到所述待合成文本对应的时长特征信息,其中,所述时长特征信息包括所述待合成文本中每一音素对应的语音帧的数量;The duration sub-model is used to obtain duration characteristic information corresponding to the text to be synthesized according to the speech characteristic information, wherein the duration characteristic information includes the number of speech frames corresponding to each phoneme in the to-be-synthesized text;
    所述高斯采样模块用于根据所述表示序列和所述时长特征信息,生成所述待合成文本对应的定长的语义表征;The Gaussian sampling module is configured to generate a fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration feature information;
    所述线性处理模块用于对所述语义表征进行线性变换,得到所述待合成文本对应的第一梅尔谱信息;The linear processing module is configured to perform linear transformation on the semantic representation to obtain the first mel spectrum information corresponding to the text to be synthesized;
    所述基于流的可逆生成模型Glow用于根据标准正态分布生成第二梅尔谱信息;The flow-based reversible generation model Glow is used to generate the second mel spectrum information according to a standard normal distribution;
    所述声码器用于根据所述第一梅尔谱信息和所述第二梅尔谱信息,生成与所述待合成文本对应的预测波形点信息。The vocoder is configured to generate predicted waveform point information corresponding to the text to be synthesized according to the first mel spectrum information and the second mel spectrum information.
  3. 根据权利要求2所述的方法,其中,所述声码器为基于流的生成模型Flow;The method of claim 2, wherein the vocoder is a flow-based generative model Flow;
    所述语音合成模型通过以下方式训练得到:The speech synthesis model is obtained by training in the following ways:
    获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及Obtain the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples; and
    通过将所述标注语音特征信息分别作为所述编码网络和所述时长子模型的输入, 将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。By using the marked speech feature information as the input of the encoding network and the duration sub-model respectively, and using the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module, the The output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model Flow The target output of the flow-based reversible generation model Glow is to use the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target output of the flow-based reversible generation model Glow. The acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  4. 根据权利要求1-3中任一项所述的方法,其中,所述语音特征信息包括音素、声调、分词及韵律边界;The method according to any one of claims 1-3, wherein the speech feature information includes phonemes, tones, word segmentation and prosodic boundaries;
    所述获取待合成文本对应的语音特征信息的步骤包括:The step of obtaining the speech feature information corresponding to the text to be synthesized includes:
    将所述待合成文本输入信息提取模型,得到与所述待合成文本对应的语音特征信息。The text to be synthesized is input into an information extraction model to obtain speech feature information corresponding to the to-be-synthesized text.
  5. 根据权利要求1-3中任一项所述的方法,还包括:The method according to any one of claims 1-3, further comprising:
    将所述音频信息与背景音乐进行合成。The audio information is synthesized with background music.
  6. 一种语音合成模型训练方法,其中,所述语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;A speech synthesis model training method, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generation Model Glow, the vocoder is a flow-based generation model Flow;
    所述方法包括:The method includes:
    获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及Obtain the labeled speech feature information, the labeled waveform point information, and the labeled mel spectrum information corresponding to the text training samples; and
    通过将所述标注语音特征信息分别作为所述编码网络和所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。By using the marked speech feature information as the input of the encoding network and the duration sub-model respectively, and using the output of the encoding network and the output of the duration sub-model as the input of the Gaussian sampling module, the The output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the flow-based generation model Flow The target output of the flow-based reversible generation model Glow is to use the marked mel spectrum information as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target output of the flow-based reversible generation model Glow. The acoustic sub-model and the vocoder are directly trained jointly to obtain the speech synthesis model.
  7. 一种语音合成装置,包括:A speech synthesis device, comprising:
    第一获取模块,用于获取待合成文本对应的语音特征信息;a first obtaining module, used for obtaining speech feature information corresponding to the text to be synthesized;
    语音合成模块,用于将所述第一获取模块获取到的所述语音特征信息输入语音合成模型中,得到与所述待合成文本对应的预测波形点信息,其中,所述语音合成模型包括声学子模型和声码器,所述语音合成模型是通过对所述声学子模型和所述声码器直接进行联合训练得到的;以及A speech synthesis module, configured to input the speech feature information obtained by the first obtaining module into a speech synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the speech synthesis model includes a speech synthesis model. a student sub-model and a vocoder, the speech synthesis model obtained by direct joint training of the acoustic sub-model and the vocoder; and
    扩展模块,用于对所述语音合成模块得到的所述预测波形点信息进行μ律扩展,得到所述待合成文本对应的音频信息。An expansion module, configured to perform μ-law expansion on the predicted waveform point information obtained by the speech synthesis module to obtain audio information corresponding to the text to be synthesized.
  8. 一种语音合成模型训练装置,其中,所述语音合成模型包括声学子模型和声码器,所述声学子模型包括编码网络、时长子模型、高斯采样模块、线性处理模块以及基于流的可逆生成模型Glow,所述声码器为基于流的生成模型Flow;A speech synthesis model training device, wherein the speech synthesis model includes an acoustic sub-model and a vocoder, and the acoustic sub-model includes a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module, and a flow-based reversible generation Model Glow, the vocoder is a flow-based generation model Flow;
    所述装置包括:The device includes:
    第二获取模块,用于获取文本训练样本对应的标注语音特征信息、标注波形点信息以及标注梅尔谱信息;以及The second acquisition module is used to acquire the marked voice feature information, the marked waveform point information and the marked Mel spectrum information corresponding to the text training samples; and
    训练模块,用于通过将所述标注语音特征信息分别作为所述编码网络和所述时长子模型的输入,将所述编码网络的输出和所述时长子模型的输出作为所述高斯采样模块的输入,将所述高斯采样模块的输出作为所述线性处理模块的输入,将所述线性处理模块的输出作为所述基于流的生成模型Flow的输入,将所述标注波形点信息作为所述基于流的生成模型Flow的目标输出,将所述标注梅尔谱信息作为所述基于流的可逆生成模型Glow的输入,将所述标准正态分布作为所述基于流的可逆生成模型Glow的目标输出的方式对所述声学子模型和所述声码器直接进行联合训练,以得到所述语音合成模型。The training module is used to use the marked voice feature information as the input of the encoding network and the duration sub-model respectively, and use the output of the encoding network and the output of the duration sub-model as the output of the Gaussian sampling module. Input, the output of the Gaussian sampling module is used as the input of the linear processing module, the output of the linear processing module is used as the input of the flow-based generation model Flow, and the marked waveform point information is used as the input of the flow-based generation model Flow. The target output of the flow generation model Flow, the labeled mel spectrum information is used as the input of the flow-based reversible generation model Glow, and the standard normal distribution is used as the target output of the flow-based reversible generation model Glow The acoustic sub-model and the vocoder are directly jointly trained in a manner to obtain the speech synthesis model.
  9. 一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现权利要求1-6中任一项所述方法。A computer-readable medium having a computer program stored thereon, the program implementing the method of any one of claims 1-6 when executed by a processing device.
  10. 一种电子设备,包括:An electronic device comprising:
    存储装置,其上存储有计算机程序;以及a storage device having a computer program stored thereon; and
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-5中任一项所述方法。A processing device, configured to execute the computer program in the storage device, so as to implement the method of any one of claims 1-5.
  11. 一种电子设备,包括:An electronic device comprising:
    存储装置,其上存储有计算机程序;以及a storage device having a computer program stored thereon; and
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求6所述方法。A processing device for executing the computer program in the storage device to implement the method of claim 6 .
  12. 一种计算机程序,包括:A computer program comprising:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-6中任一项所述的方法。Instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-6.
  13. 一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-6中任一项所述的方法。A computer program product comprising instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-6.
PCT/CN2021/139988 2021-01-13 2021-12-21 Speech synthesis method and apparatus, synthesis model training method and apparatus, medium, and device WO2022151931A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110042176.3A CN112786006B (en) 2021-01-13 2021-01-13 Speech synthesis method, synthesis model training method, device, medium and equipment
CN202110042176.3 2021-01-13

Publications (1)

Publication Number Publication Date
WO2022151931A1 true WO2022151931A1 (en) 2022-07-21

Family

ID=75755742

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/139988 WO2022151931A1 (en) 2021-01-13 2021-12-21 Speech synthesis method and apparatus, synthesis model training method and apparatus, medium, and device

Country Status (2)

Country Link
CN (1) CN112786006B (en)
WO (1) WO2022151931A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292672A (en) * 2023-11-27 2023-12-26 厦门大学 High-quality speech synthesis method based on correction flow model
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786006B (en) * 2021-01-13 2024-05-17 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, device, medium and equipment
CN113299270B (en) * 2021-05-20 2024-05-31 平安科技(深圳)有限公司 Method, device, equipment and storage medium for generating voice synthesis system
CN113450761B (en) * 2021-06-17 2023-09-22 清华大学深圳国际研究生院 Parallel voice synthesis method and device based on variation self-encoder
CN113823257B (en) * 2021-06-18 2024-02-09 腾讯科技(深圳)有限公司 Speech synthesizer construction method, speech synthesis method and device
CN113421550A (en) * 2021-06-25 2021-09-21 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN113571047B (en) * 2021-07-20 2024-07-23 杭州海康威视数字技术股份有限公司 Audio data processing method, device and equipment
CN113593527B (en) * 2021-08-02 2024-02-20 北京有竹居网络技术有限公司 Method and device for generating acoustic features, training voice model and recognizing voice
CN113838452B (en) * 2021-08-17 2022-08-23 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and computer storage medium
CN114330328B (en) * 2021-12-13 2023-10-10 电子科技大学 Tibetan word segmentation method based on Transformer-CRF
CN115905819B (en) * 2023-03-09 2023-05-12 中国民用航空飞行学院 rPPG signal generation method and device based on generation countermeasure network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3151239A1 (en) * 2015-09-29 2017-04-05 Yandex Europe AG Method and system for text-to-speech synthesis
CN111292719A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
WO2020190050A1 (en) * 2019-03-19 2020-09-24 휴멜로 주식회사 Speech synthesis apparatus and method therefor
CN111739508A (en) * 2020-08-07 2020-10-02 浙江大学 End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN111954903A (en) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 Multi-speaker neural text-to-speech synthesis
CN112786006A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369971B (en) * 2020-03-11 2023-08-04 北京字节跳动网络技术有限公司 Speech synthesis method, device, storage medium and electronic equipment
CN111402856B (en) * 2020-03-23 2023-04-14 北京字节跳动网络技术有限公司 Voice processing method and device, readable medium and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3151239A1 (en) * 2015-09-29 2017-04-05 Yandex Europe AG Method and system for text-to-speech synthesis
CN111954903A (en) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 Multi-speaker neural text-to-speech synthesis
WO2020190050A1 (en) * 2019-03-19 2020-09-24 휴멜로 주식회사 Speech synthesis apparatus and method therefor
CN111292719A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
CN111739508A (en) * 2020-08-07 2020-10-02 浙江大学 End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN112786006A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292672A (en) * 2023-11-27 2023-12-26 厦门大学 High-quality speech synthesis method based on correction flow model
CN117292672B (en) * 2023-11-27 2024-01-30 厦门大学 High-quality speech synthesis method based on correction flow model
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence
CN117672182B (en) * 2024-02-02 2024-06-07 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN112786006B (en) 2024-05-17
CN112786006A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2022151931A1 (en) Speech synthesis method and apparatus, synthesis model training method and apparatus, medium, and device
CN111402855B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
WO2022151930A1 (en) Speech synthesis method and apparatus, synthesis model training method and apparatus, and medium and device
CN111899719B (en) Method, apparatus, device and medium for generating audio
CN111292720B (en) Speech synthesis method, device, computer readable medium and electronic equipment
WO2022156544A1 (en) Speech synthesis method and apparatus, and readable medium and electronic device
CN111583900B (en) Song synthesis method and device, readable medium and electronic equipment
CN111369971B (en) Speech synthesis method, device, storage medium and electronic equipment
CN112489620B (en) Speech synthesis method, device, readable medium and electronic equipment
CN111369967B (en) Virtual character-based voice synthesis method, device, medium and equipment
US20230317055A1 (en) Method, apparatus, storage medium and electronic device for speech synthesis
WO2022156464A1 (en) Speech synthesis method and apparatus, readable medium, and electronic device
CN111292719A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111368559A (en) Voice translation method and device, electronic equipment and storage medium
WO2022105553A1 (en) Speech synthesis method and apparatus, readable medium, and electronic device
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
WO2022156413A1 (en) Speech style migration method and apparatus, readable medium and electronic device
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
CN111489735B (en) Voice recognition model training method and device
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
WO2022237665A1 (en) Speech synthesis method and apparatus, electronic device, and storage medium
WO2023160553A1 (en) Speech synthesis method and apparatus, and computer-readable medium and electronic device
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN111161695A (en) Song generation method and device
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21919118

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21919118

Country of ref document: EP

Kind code of ref document: A1