WO2021127821A1 - 语音合成模型的训练方法、装置、计算机设备及存储介质 - Google Patents

语音合成模型的训练方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021127821A1
WO2021127821A1 PCT/CN2019/127339 CN2019127339W WO2021127821A1 WO 2021127821 A1 WO2021127821 A1 WO 2021127821A1 CN 2019127339 W CN2019127339 W CN 2019127339W WO 2021127821 A1 WO2021127821 A1 WO 2021127821A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
feature
speech synthesis
text data
synthesis model
Prior art date
Application number
PCT/CN2019/127339
Other languages
English (en)
French (fr)
Inventor
钱程浩
黄东延
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003169.3A priority Critical patent/CN111133506A/zh
Priority to PCT/CN2019/127339 priority patent/WO2021127821A1/zh
Publication of WO2021127821A1 publication Critical patent/WO2021127821A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the field of computer processing, in particular to a training method, device, computer equipment and storage medium of a speech synthesis model.
  • a speech synthesis model is a system that processes text input and generates human speech.
  • deep neural network technology is widely used in the training tasks of speech synthesis models. Since the training of the neural network-based speech synthesis model requires a large amount of text data, and such data sets are usually difficult to obtain, the neural network training is not sufficient in the case of limited data sets, and the synthesized speech quality is not good.
  • an embodiment of the present invention provides a method for training a speech synthesis model, and the method includes:
  • the training text data and the training phoneme data are used as the input of the speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • an embodiment of the present invention provides an apparatus for training a speech synthesis model, the apparatus including:
  • the text acquisition module is used to acquire training text data and training voice features corresponding to the training text data
  • the training module is configured to use the training text data and the training phoneme data as the input of the speech synthesis model, and use the training speech feature corresponding to the training text data as the expected output of the speech synthesis model to train the speech synthesis model , Get the target speech synthesis model.
  • an embodiment of the present invention provides a computer device, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the training text data and the training phoneme data are used as the input of the speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • an embodiment of the present invention provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the training text data and the training phoneme data are used as the input of the speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • the training method of the speech synthesis model mentioned above enriches the training data set for training the speech synthesis model by using the training text data and the training phoneme data as the input of the speech synthesis model at the same time. This can solve the problem of lack of training data and help improve the speech
  • the quality of synthesis and the introduction of phoneme information can eliminate possible mispronunciations in speech synthesis, thereby improving the accuracy of speech synthesis.
  • Fig. 1 is a flowchart of a method for training a speech synthesis model in an embodiment
  • Figure 2 is a schematic structural diagram of a speech synthesis model in an embodiment
  • Fig. 3 is a schematic diagram of a process of training a speech synthesis model in an embodiment
  • Fig. 4 is a schematic diagram of a prediction process of a target speech synthesis model in an embodiment
  • Figure 5 is a structural block diagram of a training device for a speech synthesis model in an embodiment
  • Fig. 6 is a structural block diagram of a training device for a speech synthesis model in another embodiment
  • Fig. 7 is an internal structure diagram of a computer device in an embodiment.
  • the training method of the speech synthesis model can be applied to a terminal or a server.
  • the application to the terminal is taken as an example.
  • the training method of the model specifically includes the following steps:
  • Step 102 Obtain training text data and training voice features corresponding to the training text data.
  • the training text data refers to the text data used to train the speech synthesis model.
  • Voice features are features used to express voice.
  • the voice feature can be converted into voice by using a vocoder.
  • the training voice feature refers to labeling the voice feature corresponding to the training text data.
  • the Mel spectrum feature can be used for training speech features.
  • Step 104 Obtain training phoneme data corresponding to the training text data according to the training text data.
  • phoneme is a different phonetic unit, which distinguishes a word (or word element) from another word in a specific language. You can generally think of phonemes as a summary of the basic representation of a word.
  • training phoneme data is introduced here.
  • the training phoneme data is used as the supplementary input of the speech synthesis model, which is beneficial to improve the synthesized speech quality output by the speech synthesis model.
  • a phoneme converter may be used to convert training text data into training phoneme data, that is, a text sequence is converted into a corresponding phoneme sequence.
  • Step 106 The training text data and the training phoneme data are used as the input of the speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • the training text data and the training phoneme data are used as the training input of the speech synthesis model at the same time, and the training speech feature corresponding to the training text data is used as the desired output to train the speech synthesis model.
  • the training sample set includes multiple training samples, and each training sample includes: training text data and corresponding training voice features.
  • the process of training the speech synthesis model is a process of continuously updating the weight parameters in the speech synthesis model.
  • the actual output and expected output are calculated using the preset loss function to obtain the loss value, and then the speech synthesis is updated according to the loss value
  • the weight parameters in the model then continue to train the updated speech synthesis model, until the calculated loss value reaches the convergence condition, stop updating, and use the finally trained speech synthesis model as the target speech synthesis model.
  • the training method of the speech synthesis model mentioned above enriches the training data set for training the speech synthesis model by using the training text data and the training phoneme data as the input of the speech synthesis model at the same time. This can solve the problem of lack of training data and help improve the speech
  • the quality of synthesis and the introduction of phoneme information can eliminate possible mispronunciations in speech synthesis, thereby improving the accuracy of speech synthesis.
  • the speech synthesis model includes: an encoder 202, a decoder 204, and an attention mechanism 206 connecting the encoder and the decoder; the encoder 202 is used for training text data and training phoneme data.
  • the encoder 202 is used for training text data and training phoneme data.
  • the decoder 204 is used to obtain the decoding feature according to the predicted speech feature in the previous time step;
  • the attention mechanism 206 is used to obtain the fixed-length vector according to the coding feature and the decoding feature, and the fixed-length vector is used as the input of the decoder;
  • the device 204 is also used to obtain the actual output voice features according to the fixed-length vector.
  • the training text data and training phoneme data are used as the input of the speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model, including: according to the training speech feature Calculate the loss value with the actual voice feature, and update the weight parameter in the speech synthesis model according to the loss value.
  • the speech synthesis model can be trained using a deep neural network model (DNN).
  • the speech synthesis model can be divided into three parts, the encoder 202, the decoder 204, and the attention mechanism 206.
  • the function of the encoder is to perform a series of coding processing on the input training text data or training phoneme data to obtain coding features.
  • the encoding feature can be understood as the encoding feature vector obtained after encoding.
  • Part of the function of the decoder 204 is to perform a series of decoding processing on the input voice features predicted in the previous time step to obtain the decoded features.
  • the function of the attention mechanism 206 is to obtain the fixed-length context vector (that is, the fixed-length vector) required by the decoder according to the input encoding feature and decoding feature.
  • the decoder 204 is also used to predict the voice features according to the fixed-length vector, and output the actual voice features.
  • the encoder 202, the decoder 204, and the attention mechanism 206 are all implemented using neural networks.
  • the encoder includes an embedding layer, a convolutional layer, and an encoding LSTM layer.
  • the encoder is used to obtain encoding features based on the training text data and the training phoneme data, including: the embedding layer is used to combine the training text data and the training phoneme The data is converted into text feature vector and phoneme feature vector, and one of the text feature vector and phone feature vector is randomly selected as the input of the convolutional layer; the convolution layer is used to convolve according to the input text feature vector or phoneme feature vector The convolution feature is obtained by calculation, and the convolution feature is used as the input of the coded LSTM layer, and the coded LSTM layer is used to calculate the coding feature according to the convolution feature.
  • the embedding layer (ie embedding layer) is used to convert the training text data and the training phoneme data into vector representations, for example, into a 512-dimensional feature vector, and obtain the text feature vector and phoneme feature vector accordingly, and then Any one of the converted text feature vector and phoneme feature vector is used as the input of the convolution layer; the convolution layer is used to perform convolution processing on the input text feature vector or phoneme feature vector to extract the convolution feature.
  • the convolutional layer can be one layer or multiple layers (for example, 3 layers).
  • the LSTM layer in the encoder is called “encoded LSTM layer", LSTM (Long Short-Term Memory) is a long and short-term memory network, a time recurrent neural network.
  • the coded LSTM layer is used to process the input convolutional features to obtain the coded features, and the coded LSTM layer can adopt a bidirectional LSTM.
  • the decoder includes: a pre-network layer and a decoded LSTM layer.
  • the decoder is used to obtain the decoding feature according to the speech feature predicted in the previous time step, including: the pre-network layer is used to predict the input according to the previous time step Non-linear mapping is performed on the voice features to obtain the mapped voice features, and the mapped voice features are used as the input of the decoded LSTM layer; the decoded LSTM layer is used to calculate the decoded features according to the mapped voice features.
  • the input of the pre-net layer is the speech feature predicted in the previous time step (for example, the Mel spectrum).
  • the pre-net layer is used to perform non-linear mapping on the voice features predicted in the previous time step to obtain the mapped voice features.
  • the pre-net layer is composed of Relu units, and Relu is a nonlinear activation function used for nonlinear mapping.
  • the decoding LSTM layer is used to process the input mapped voice features to obtain the decoded features.
  • the pre-network layer is composed of 256 fully connected small pre-networks, and each small pre-network is composed of Relu units.
  • the decoder further includes: a post-network layer; the decoder is also used to obtain the actual output voice features according to the fixed-length vector, including: obtaining the mapped voice features output by the pre-network layer, and combine the mapped voice features with the fixed-length
  • the vector is spliced to obtain the feature vector; the feature vector is used as the input of the decoded LSTM layer to obtain the predicted voice feature output from the decoded LSTM layer; the predicted voice feature is used as the input of the backnet layer, according to the output of the backnet layer and the output of the decoded LSTM layer The predicted voice features of the obtained actual voice features.
  • the output of the pre-net layer and the output of the attention mechanism are used as the input of the decoder.
  • the mapped voice feature output by the pre-net layer and the fixed-length vector output by the attention mechanism are spliced to obtain a feature vector, and the feature vector is used as the input of the decoded LSTM layer to obtain the predicted voice feature output by the decoded LSTM layer.
  • the voice feature refers to the Mel spectrum
  • the decoding LSTM layer mainly performs linear projection operations on the feature vector, and predicts the Mel spectrum by frame.
  • a post-net layer is added after the decoded LSTM layer.
  • the post-net layer is used to process the predicted voice features output by the decoded LSTM layer to obtain the prediction residual, and then according to the prediction residual sum Predict the voice feature to get the actual voice feature output.
  • the back network layer is composed of 5 convolutional layers, and the predicted language features (for example, Mel spectrum) can be used to enhance the prediction ability of the network after being convolved in 5 layers.
  • the decoding LSTM layer adopts a single LSTM layer, and the decoding LSTM layer may include one single LSTM layer or multiple single LSTM layers. Each single LSTM layer can include 1024 units.
  • obtaining training phoneme data corresponding to the training text data according to the training text data includes: inputting the training text data into a phoneme converter, and the phoneme converter is used to normalize the training text data into multiple normalized words, The phoneme corresponding to each normalized word is searched separately, and the training phoneme data corresponding to the training text data is obtained.
  • the phoneme converter is used to convert training text data into training phoneme data.
  • the input text is first standardized.
  • the standardized processing includes uniformly converting uppercase letters into lowercase letters, and converting abbreviations into complete Words, expand numbers into text words, etc. For example, “Mr.” is converted to “mr.”, “mr.” is converted to "mister”, and "20" is converted to "twenty”. That is, all words in the text are converted into normalized word forms.
  • the machine-readable pronunciation dictionary is used as a lookup table, and the phoneme corresponding to each normalized word is found according to the lookup table, so as to obtain the training phoneme data obtained from the training text data.
  • the method further includes: obtaining speech data to be synthesized, which is text data to be synthesized or phoneme data to be synthesized; and using the data of speech to be synthesized as the input of the target speech synthesis model to obtain the target speech synthesis model Output target voice features; use a vocoder to convert the target voice features into target voice.
  • the target speech synthesis model is a trained speech synthesis model. Since the target speech synthesis model uses both text data and phoneme data as input during training, when the target speech synthesis model is used for prediction, the input speech data to be synthesized can be text data to be synthesized or phoneme data to be synthesized . After the target speech synthesis model outputs the target speech features, a vocoder is used to convert the target speech features into target speech.
  • FIG. 3 it is a schematic diagram of a process of training a speech synthesis model in an embodiment.
  • obtain the training text data copy the training text data into two copies, one is input to the embedding layer of the encoder as it is, and the other is converted into training phoneme data by the phoneme converter, and the training phoneme data is also Input to the embedding layer of the encoder.
  • the embedding layer is used to convert training text data and training phoneme data into text feature vectors and phoneme feature vectors, respectively.
  • the attention mechanism calculates the fixed-length vector based on the decoding feature and the encoding feature, and then uses the fixed-length vector as the input of the decoding LSTM layer.
  • the mapped voice feature) and the fixed-length vector are spliced, and then the spliced feature vector is processed to obtain the predicted voice feature.
  • the predicted voice feature is used as the input of the back network layer, and the predicted residual output of the back network layer is obtained, and then According to the predicted voice feature and the predicted residual, the output actual voice feature is obtained.
  • the loss value is calculated according to the training voice feature and the actual voice feature, and the weight parameter in the speech synthesis model is updated sequentially from back to front by using the gradient descent method according to the loss value.
  • the training is continuously repeated until the obtained loss value reaches the convergence condition to stop, or the maximum number of cyclic training is set at the initial training.
  • the training is stopped, and the trained speech synthesis model is finally obtained.
  • FIG. 4 it is a schematic diagram of the prediction process of the target speech synthesis model in an embodiment.
  • obtain the speech data to be synthesized and use the speech data to be synthesized (text data to be synthesized or phoneme data to be synthesized) as the input of the embedding layer of the encoder to obtain the text feature vector or phoneme feature vector of the speech data to be synthesized, and then the text
  • the feature vector or phoneme feature vector is used as the input of the convolutional layer in the encoder
  • the output of the convolutional layer is used as the input of the encoded LSTM layer
  • the encoding feature output by the encoded LSTM layer is used as the input of the attention mechanism.
  • the attention mechanism is based on the coding feature Calculate the fixed-length vector, use the predicted voice feature (Meier map) in the previous time step as the input of the pre-network layer in the decoder, and then use the output of the pre-network layer as the input of the decoded LSTM layer, and output the decoded LSTM layer
  • the decoding feature is used as the input of the attention mechanism.
  • the attention mechanism calculates the fixed-length vector based on the decoding feature and the encoding feature, and then uses the fixed-length vector as the input of the decoded LSTM layer.
  • the decoded LSTM layer outputs the above-mentioned pre-net layer ( The above-mentioned mapping voice feature) is spliced with the fixed-length vector, and then the spliced feature vector is processed to obtain the predicted voice feature.
  • the predicted voice feature is used as the input of the back network layer, and the output of the back network layer is obtained according to the back network.
  • the output of the layer and the predicted voice feature are used to obtain the output target voice feature. After that, the target voice feature is used as the input of the vocoder, and the target voice output by the vocoder is obtained.
  • a training device for a speech synthesis model includes:
  • the training acquisition module 502 is configured to acquire training text data and training voice features corresponding to the training text data;
  • the phoneme conversion module 504 is configured to obtain training phoneme data corresponding to the training text data according to the training text data;
  • the training module 506 is configured to use the training text data and the training phoneme data as the input of the speech synthesis model, and use the training speech feature corresponding to the training text data as the expected output of the speech synthesis model to perform the speech synthesis model. Train to get the target speech synthesis model.
  • the speech synthesis model includes: an encoder, a decoder, and an attention mechanism connecting the encoder and the decoder;
  • the encoder is used to obtain data from the training text data and the training phoneme data Encoding feature;
  • the decoder is used to obtain a decoding feature according to the training speech feature corresponding to the training text data;
  • the attention mechanism is used to obtain a fixed-length vector according to the encoding feature and the decoding feature, and the fixed-length The vector is used as the input of the decoder; the decoder is also used to obtain the actual output voice features according to the fixed-length vector;
  • the training module is further configured to calculate a loss value according to the training voice feature and the actual voice feature, and update the weight parameter in the speech synthesis model according to the loss value.
  • the encoder includes an embedding layer, a convolutional layer, and an encoding LSTM layer.
  • the encoder is configured to obtain encoding features according to the training text data and the training phoneme data, including: the embedding
  • the layer is used to convert the training text data and the training phoneme data into text feature vectors and phoneme feature vectors, respectively, and randomly select one of the text feature vectors and phoneme feature vectors as the input of the convolutional layer;
  • the convolution layer is used to perform a convolution operation according to the input text feature vector or the phoneme feature vector to obtain a convolution feature, and use the convolution feature as the input to the coded LSTM layer, and the coded LSTM layer is used for
  • the coding feature is calculated according to the convolution feature.
  • the decoder includes: a pre-network layer and a decoded LSTM layer.
  • the decoder is used to obtain a decoding feature according to the voice feature predicted in the previous time step, and includes: the pre-network layer is used to obtain a decoding feature according to the input.
  • the voice features predicted in the previous time step are non-linearly mapped to obtain the mapped voice feature, and the mapped voice feature is used as the input of the decoded LSTM layer; the decoded LSTM layer is used to calculate the decoded feature according to the mapped voice feature .
  • the decoder further includes: a back network layer; the decoder is further configured to obtain the actual output voice features according to the fixed-length vector, including: obtaining the mapped voice features output by the pre-network layer , Splicing the mapped voice feature with the fixed-length vector to obtain a feature vector; using the feature vector as the input of the decoded LSTM layer to obtain the predicted voice feature output by the decoded LSTM layer; The voice feature is used as the input of the backnet layer, and the actual voice feature is obtained according to the output of the backnet layer and the predicted voice feature output by the decoded LSTM layer.
  • the conversion module is further used to input the training text data into a phoneme converter, and the phoneme converter is used to normalize the training text data into a plurality of normalized words, and search for each normalized word separately.
  • the phoneme corresponding to the word obtains training phoneme data corresponding to the training text data output by the phoneme converter.
  • the training device for the aforementioned speech synthesis model further includes:
  • the prediction acquisition module 508 is configured to acquire speech data to be synthesized, where the speech data to be synthesized is text data to be synthesized or phoneme data to be synthesized;
  • the prediction module 510 is configured to use the to-be-synthesized speech data as the input of the target speech synthesis model, and obtain the target speech features output by the target speech synthesis model;
  • the voice conversion module 512 is configured to use a vocoder to convert the target voice feature into a target voice.
  • Fig. 7 shows an internal structure diagram of a computer device in an embodiment.
  • the computer equipment can be a terminal or a server.
  • the computer device includes a processor, a memory, and a network interface connected by a model bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operation model and may also store a computer program.
  • the processor can realize the training method of the speech synthesis model.
  • a computer program may also be stored in the internal memory.
  • the processor can execute the training method of the speech synthesis model.
  • the network interface is used to communicate with the outside world.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the training method of the speech synthesis model provided by the present application can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 7.
  • the memory of the computer device can store various program templates that make up the training device of the speech synthesis model. For example, the training acquisition module 502, the phoneme conversion module 504, and the training module 506.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps: acquiring training text data and the training text data Corresponding training voice features; obtain training phoneme data corresponding to the training text data according to the training text data; use the training text data and the training phoneme data as the input of the speech synthesis model, and the training text data The corresponding training speech feature is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • the speech synthesis model includes: an encoder, a decoder, and an attention mechanism connecting the encoder and the decoder;
  • the encoder is used to obtain data from the training text data and the training phoneme data Encoding features;
  • the decoder is used to obtain decoding features based on the predicted speech features in the previous time step;
  • the attention mechanism is used to obtain a fixed-length vector based on the encoding feature and the decoding feature, and use the fixed-length vector as The input of the decoder;
  • the decoder is also used to obtain the actual output voice features according to the fixed-length vector;
  • the training text data and the training phoneme data are used as the input of a speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain a target
  • the speech synthesis model includes: calculating a loss value according to the training speech feature and the actual speech feature, and updating the weight parameter in the speech synthesis model according to the loss value.
  • the encoder includes an embedding layer, a convolutional layer, and an encoding LSTM layer.
  • the encoder is configured to obtain encoding features according to the training text data and the training phoneme data, including: the embedding
  • the layer is used to convert the training text data and the training phoneme data into text feature vectors and phoneme feature vectors, respectively, and randomly select one of the text feature vectors and phoneme feature vectors as the input of the convolutional layer;
  • the convolutional layer is used to perform a convolution operation according to the input text feature vector or the phoneme feature vector to obtain a convolution feature, and use the convolution feature as the input of the encoded LSTM layer, and the encoded LSTM layer is used for
  • the coding feature is calculated according to the convolution feature.
  • the decoder includes: a pre-network layer and a decoded LSTM layer.
  • the decoder is used to obtain a decoding feature according to the voice feature predicted in the previous time step, and includes: the pre-network layer is used to obtain a decoding feature according to the input.
  • the voice features predicted in the previous time step are non-linearly mapped to obtain the mapped voice feature, and the mapped voice feature is used as the input of the decoded LSTM layer; the decoded LSTM layer is used to calculate the decoded feature according to the mapped voice feature .
  • the decoder further includes: a back network layer; the decoder is further configured to obtain the actual output voice features according to the fixed-length vector, including: obtaining the mapped voice features output by the pre-network layer , Splicing the mapped voice feature with the fixed-length vector to obtain a feature vector; using the feature vector as the input of the decoded LSTM layer to obtain the predicted voice feature output by the decoded LSTM layer; The voice feature is used as the input of the backnet layer, and the actual voice feature is obtained according to the output of the backnet layer and the predicted voice feature output by the decoded LSTM layer.
  • the obtaining training phoneme data corresponding to the training text data according to the training text data includes: inputting the training text data to a phoneme converter, and the phoneme converter is used to transfer the training text data to the phoneme converter.
  • the training text data is normalized into a plurality of normalized words, and the phoneme corresponding to each normalized word is respectively searched to obtain training phoneme data corresponding to the training text data.
  • the computer program when executed by the processor, it is also used to perform the following steps: acquiring speech data to be synthesized, where the speech data to be synthesized is text data to be synthesized or phoneme data to be synthesized;
  • the speech data to be synthesized is used as the input of the target speech synthesis model, and the target speech feature output by the target speech synthesis model is obtained; the target speech feature is converted into the target speech by using a vocoder.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • training text data and training speech features corresponding to the training text data Obtain training text data and training speech features corresponding to the training text data; Obtain training phoneme data corresponding to the training text data according to the training text data; Use the training text data and the training phoneme data as speech synthesis
  • the input of the model, the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain the target speech synthesis model.
  • the speech synthesis model includes: an encoder, a decoder, and an attention mechanism connecting the encoder and the decoder; the encoder is used to obtain data from the training text data and the training phoneme data Encoding features; the decoder is used to obtain decoding features based on the predicted speech features in the previous time step; the attention mechanism is used to obtain a fixed-length vector based on the encoding feature and the decoding feature, and use the fixed-length vector as The input of the decoder; the decoder is also used to obtain the actual output voice features according to the fixed-length vector;
  • the training text data and the training phoneme data are used as the input of a speech synthesis model, and the training speech feature corresponding to the training text data is used as the expected output of the speech synthesis model to train the speech synthesis model to obtain a target
  • the speech synthesis model includes: calculating a loss value according to the training speech feature and the actual speech feature, and updating the weight parameter in the speech synthesis model according to the loss value.
  • the encoder includes an embedding layer, a convolutional layer, and an encoding LSTM layer.
  • the encoder is configured to obtain encoding features according to the training text data and the training phoneme data, including: the embedding
  • the layer is used to convert the training text data and the training phoneme data into text feature vectors and phoneme feature vectors, respectively, and randomly select one of the text feature vectors and phoneme feature vectors as the input of the convolutional layer;
  • the convolution layer is used to perform a convolution operation according to the input text feature vector or the phoneme feature vector to obtain a convolution feature, and use the convolution feature as the input to the coded LSTM layer, and the coded LSTM layer is used for
  • the coding feature is calculated according to the convolution feature.
  • the decoder includes: a pre-network layer and a decoded LSTM layer.
  • the decoder is used to obtain a decoding feature according to the voice feature predicted in the previous time step, and includes: the pre-network layer is used to obtain a decoding feature according to the input.
  • the voice features predicted in the previous time step are non-linearly mapped to obtain the mapped voice feature, and the mapped voice feature is used as the input of the decoded LSTM layer; the decoded LSTM layer is used to calculate the decoded feature according to the mapped voice feature .
  • the decoder further includes: a back network layer; the decoder is further configured to obtain the actual output voice features according to the fixed-length vector, including: obtaining the mapped voice features output by the pre-network layer , Splicing the mapped voice feature with the fixed-length vector to obtain a feature vector; using the feature vector as the input of the decoded LSTM layer to obtain the predicted voice feature output by the decoded LSTM layer; The voice feature is used as the input of the backnet layer, and the actual voice feature is obtained according to the output of the backnet layer and the predicted voice feature output by the decoded LSTM layer.
  • the obtaining training phoneme data corresponding to the training text data according to the training text data includes: inputting the training text data to a phoneme converter, and the phoneme converter is used to transfer the training text data to the phoneme converter.
  • the training text data is normalized into a plurality of normalized words, and the phoneme corresponding to each normalized word is respectively searched to obtain training phoneme data corresponding to the training text data.
  • the computer program when executed by the processor, it is also used to perform the following steps: obtain speech data to be synthesized, where the speech data to be synthesized is text data to be synthesized or phoneme data to be synthesized;
  • the speech data to be synthesized is used as the input of the target speech synthesis model, and the target speech feature output by the target speech synthesis model is obtained; the target speech feature is converted into the target speech by using a vocoder.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

提供了一种语音合成模型的训练方法、语音合成模型的训练装置、计算机设备及存储介质,其中该方法包括:获取训练文本数据和该训练文本数据对应的训练语音特征(102);根据该训练文本数据得到与该训练文本数据对应的训练音素数据(104);将该训练文本数据和训练音素数据作为语音合成模型的输入,将该训练文本数据对应的训练语音特征作为语音合成模型期望的输出对语音合成模型进行训练,得到目标语音合成模型(106)。通过将训练文本数据和训练音素数据同时作为语音合成模型的输入,丰富了训练语音合成模型的训练数据集,提高了合成语音的质量和准确度。

Description

语音合成模型的训练方法、装置、计算机设备及存储介质 技术领域
本发明涉及计算机处理领域,尤其是涉及一种语音合成模型的训练方法、装置、计算机设备及存储介质。
背景技术
语音合成模型是处理文本输入并生成如人类语音的***。随着深度学习技术的成熟以及计算机性能的提升,深度神经网络技术广泛运用于语音合成模型的训练任务中。由于基于神经网络的语音合成模型训练需要大量的文本数据,而通常这样的数据集难以获得,导致在有限数据集的情况下,神经网络训练不够充分,合成语音质量不佳。
技术问题
因此,亟待提出一种合成语音质量佳的语音合成模型的训练方法。
技术解决方案
基于此,有必要针对上述问题,提供了一种实体识别准确率高的语音合成模型的训练方法、装置、计算机设备及存储介质。
第一方面,本发明实施例提供一种语音合成模型的训练方法,所述方法包括:
获取训练文本数据和所述训练文本数据对应的训练语音特征;
根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
第二方面,本发明实施例提供一种语音合成模型的训练装置,所述装置包括:
文获取模块,用于获取训练文本数据和所述训练文本数据对应的训练语音特征;
转换模块,用于根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
训练模块,用于将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
第三方面,本发明实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:
获取训练文本数据和所述训练文本数据对应的训练语音特征;
根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
第四方面,本发明实施例提供一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如下步骤:
获取训练文本数据和所述训练文本数据对应的训练语音特征;
根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
有益效果
上述语音合成模型的训练方法,通过将训练文本数据和训练音素数据同时作为语音合成模型的输入,丰富了训练语音合成模型的训练数据集,这样可以解决缺少训练数据的问题,从而有利于提高语音合成的质量,并且通过引入音素信息,可以消除语音合成中可能出现的错误发音,从而提高了语音合成的准确度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。
图1为一个实施例中语音合成模型的训练方法的流程图;
图2为一个实施例中语音合成模型的结构示意图;
图3为一个实施例中训练语音合成模型的流程示意图;
图4为一个实施例中目标语音合成模型的预测流程示意图;
图5为一个实施例中语音合成模型的训练装置的结构框图;
图6为另一个实施例中语音合成模型的训练装置的结构框图;
图7为一个实施例中计算机设备的内部结构图。
本发明的实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
如图1所示,提出了一种语音合成模型的训练方法,该语音合成模型的训练方法可以应用于终端,也可以应用于服务器,本实施例中以应用于终端为例说明,该语音合成模型的训练方法具体包括以下步骤:
步骤102,获取训练文本数据和训练文本数据对应的训练语音特征。
其中,训练文本数据是指用于对语音合成模型进行训练的文本数据。语音特征是用于表示语音的特征。在已知语音特征的情况下,采用声码器即可将语音特征转换为语音。训练语音特征是指对训练文本数据对应的语音特征标注。训练语音特征可以采用梅尔频谱特征。
步骤104,根据训练文本数据得到与训练文本数据对应的训练音素数据。
其中,音素是一种不同的语音单位,它以特定语言将一个单词(或单词元素)与另一个单词区分开。通常可以将音素看作是单词的基本表示形式的摘要。为了丰富语音合成模型的训练数据集,这里引入了训练音素数据。训练音素数据作为语音合成模型的补充输入,有利于提高语音合成模型输出的合成语音质量。
在一个实施例中,可以采用音素转换器将训练文本数据转换为训练音素数据,即将文本序列转换为相应的音素序列。
步骤106,将训练文本数据和训练音素数据作为语音合成模型的输入,将训练文本数据对应的训练语音特征作为语音合成模型期望的输出对语音合成模型进行训练,得到目标语音合成模型。
其中,为了使得训练得到的目标语音合成模型能够提高合成语音的质量。在对语音合成模型进行训练时,同时将训练文本数据和训练音素数据作为语音合成模型的训练输入,将训练文本数据对应的训练语音特征作为期望的输出对语音合成模型进行训练。
为了对语音合成模型进行有监督的训练,需要获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练文本数据和对应的训练语音特征。
其中,对语音合成模型进行训练的过程是不断更新语音合成模型中权重参数的过程。通过将训练文本数据和训练音素数据作为语音合成模型的输入,然后获取语音合成模型的实际输出,将实际输出和期望的输出采用预设的损失函数计算得到损失值,然后根据损失值更新语音合成模型中的权重参数,之后继续对更新后的语音合成模型进行训练,直到计算得到的损失值达到收敛条件时停止更新,将最后训练得到的语音合成模型作为目标语音合成模型。
上述语音合成模型的训练方法,通过将训练文本数据和训练音素数据同时作为语音合成模型的输入,丰富了训练语音合成模型的训练数据集,这样可以解决缺少训练数据的问题,从而有利于提高语音合成的质量,并且通过引入音素信息,可以消除语音合成中可能出现的错误发音,从而提高了语音合成的准确度。
如图2所示,在一个实施例中,语音合成模型包括:编码器202、解码器204和连接编码器和解码器的注意力机制206;编码器202用于根据训练文本数据和训练音素数据得到编码特征;解码器204用于根据上一时间步骤预测的语音特征得到解码特征;注意力机制206用于根据编码特征和解码特征得到定长向量,将定长向量作为解码器的输入;解码器204还用于根据定长向量得到输出的实际语音特征。
将训练文本数据和训练音素数据作为语音合成模型的输入,将训练文本数据对应的训练语音特征作为语音合成模型期望的输出对语音合成模型进行训练,得到目标语音合成模型,包括:根据训练语音特征和实际语音特征计算得到损失值,根据损失值更新语音合成模型中的权重参数。
其中,语音合成模型可以采用深度神经网络模型(DNN)训练得到。语音合成模型可以分为三个部分,编码器202、解码器204和注意力机制206。编码器作用是对输入的训练文本数据或训练音素数据进行一些列编码处理得到编码特征。编码特征可以理解为进行编码后得到的编码特征向量。解码器204有一部分作用是:将输入的上一时间步骤预测的语音特征进行一些列解码处理得到解码特征。通过将上一时间步骤预测的语音特征作为解码器的输入,可以将上一时间步骤预测的语音特征作为参考,通过与上文进行关联,有利于提高后续预测的准确度。
注意力机制206的作用是:根据输入的编码特征和解码特征得到解码器所需要的定长的上下文向量(即定长向量)。解码器204还用于根据定长向量进行语音特征的预测,输出实际语音特征。编码器202、解码器204和注意力机制206都是用神经网络来实现的。
在一个实施例中,编码器中包括:嵌入层、卷积层和编码LSTM层,编码器用于根据训练文本数据和训练音素数据得到编码特征,包括:嵌入层用于将训练文本数据和训练音素数据分别转换为文本特征向量和音素特征向量,并随机从文本特征向量和音素特征向量中选择一个作为卷积层的输入;卷积层用于根据输入的文本特征向量或音素特征向量进行卷积运算得到卷积特征,将卷积特征作为编码LSTM层的输入,编码LSTM层用于根据卷积特征计算得到编码特征。
其中,嵌入层(即embedding层)用于将训练文本数据以及训练音素数据分别转换为向量的表示形式,比如,转换为512维的特征向量,相应地得到文本特征向量和音素特征向量,然后将转换得到的文本特征向量和音素特征向量中的任意一个作为卷积层的输入;卷积层用于对输入的文本特征向量或音素特征向量进行卷积处理,提取得到卷积特征。卷积层可以是一层,也可以是多层(比如,3层)。为了进行区分,将编码器中的LSTM层称为“编码LSTM层”,LSTM(Long Short-Term Memory)是长短期记忆网络,是一种时间递归神经网络。编码LSTM层用于对输入的卷积特征进行处理得到编码特征,编码LSTM层可以采用双向LSTM。
在一个实施例中,解码器中包括:预网层、解码LSTM层,解码器用于根据上一时间步骤预测的语音特征得到解码特征,包括:预网层用于根据输入的上一时间步骤预测的语音特征进行非线性映射得到映射语音特征,将映射语音特征作为解码LSTM层的输入;解码LSTM层用于根据映射语音特征计算得到解码特征。
其中,预网层的输入是上一时间步骤预测的语音特征(比如,梅尔频谱)。预网层用于对上一时间步骤预测的语音特征进行非线性映射得到映射语音特征。在一个实施例中,预网层是由Relu单元构成的,Relu为非线性激活函数,用于进行非线性映射。解码LSTM层用于对输入的映射语音特征进行处理得到解码特征。在一个实施例中,预网层是由256个全连接的小型预网组成的,每个小型预网是由Relu单元组成的。
在一个实施例中,解码器还包括:后网层;解码器还用于根据定长向量得到输出的实际语音特征,包括:获取预网层输出的映射语音特征,将映射语音特征与定长向量进行拼接,得到特征向量;将特征向量作为解码LSTM层的输入,获取解码LSTM层输出的预测语音特征;将预测语音特征作为后网层的输入,根据后网层的输出和解码LSTM层输出的预测语音特征得到实际语音特征。
其中,为了提高语音合成质量,将预网层的输出与注意力机制的输出一起作为解码器的输入。具体地,将预网层输出的映射语音特征与注意力机制输出的定长向量进行拼接,得到特征向量,将特征向量作为解码LSTM层的输入,获取解码LSTM层输出的预测语音特征。在一个实施例中,语音特征是指梅尔频谱,解码LSTM层对特征向量主要是进行线性投射的运算,按帧预测梅尔频谱。
为了进一步增强语音合成模型的预测能力,在解码LSTM层后面又加入了一个后网层,后网层用于根据解码LSTM层输出的预测语音特征进行处理得到预测残差,然后根据预测残差和预测语音特征得到输出的实际语音特征。在一个实施例中,后网层是由5个卷积层组成的,预测得到的语言特征(比如,梅尔频谱)通过5层卷积后可以用来增强网络的预测能力。
在一个实施例中,解码LSTM层采用单项LSTM层,解码LSTM层可以包括一个单项LSTM层,也可以包括多个单项LSTM层。每个单项LSTM层可以包括1024个单元。
在一个实施例中,根据训练文本数据得到与训练文本数据对应的训练音素数据,包括:将训练文本数据输入到音素转换器,音素转换器用于将训练文本数据进行规范化处理为多个规范化单词,分别查找与每个规范化单词对应的音素,得到与训练文本数据对应的训练音素数据。
其中,音素转换器用于将训练文本数据转换为训练音素数据,具体地,在音素转换器内部,首先将输入文本进行规范化处理,规范化处理包括将大写字母统一转换为小写字母,缩写词转换为完整单词,将数字扩展为文本单词等。例如,“Mr.”转为“mr.”,“mr.”转为“mister”,“20”转为“twenty”。即将文本中的单词都转换为规范化的单词形式。然后将机器可读的发音词典作为查找表,根据查找表找到与每个规范化单词对应的音素,从而得到与训练文本数据得到的训练音素数据。
在一个实施例中,方法还包括:获取待合成语音数据,待合成语音数据为待合成文本数据或待合成音素数据;将待合成语音的数据作为目标语音合成模型的输入,获取目标语音合成模型输出的目标语音特征;采用声码器将目标语音特征转换为目标语音。
其中,目标语音合成模型为训练好的语音合成模型。由于目标语音合成模型在训练时候的输入同时采用了文本数据和音素数据,所以在采用目标语音合成模型进行预测时,输入的待合成语音数据可以是待合成文本数据,也可以是待合成音素数据。目标语音合成模型输出目标语音特征后,采用声码器将目标语音特征转换为目标语音。
如图3所示,为一个实施例中,训练语音合成模型的流程示意图。首先,获取训练文本数据,将训练文本数据复制为两份,一份保持原样输入到编码器的嵌入层,另一份通过音素转换器将训练文本数据转换为训练音素数据,将训练音素数据也输入到编码器的嵌入层。嵌入层用于将训练文本数据、训练音素数据分别转换为文本特征向量、音素特征向量。然后随机将文本特征向量和音素特征向量中的一个作为编码器中卷积层的输入,将卷积层的输出作为编码LSTM层的输入,将编码LSTM层输出的编码特征作为注意力机制的输入。在另外一端,将上一时间步骤预测的语音特征(梅尔图谱)作为解码器中预网层的输入,然后将预网层的输出作为解码LSTM层的输入,并将解码LSTM层输出的解码特征作为注意力机制的输入,注意力机制根据解码特征和编码特征进行计算得到定长向量,然后将定长向量又作为解码LSTM层的输入,解码LSTM层将上述预网层的输出(上文中的映射语音特征)与定长向量进行拼接,然后对拼接后得到的特征向量进行处理,得到预测语音特征,将预测语音特征作为后网层的输入,获取后网层输出的预测残差,然后根据预测语音特征和预测残差得到输出的实际语音特征。之后,根据训练语音特征和实际语音特征计算得到损失值,根据损失值采用梯度下降法从后向前依次更新语音合成模型中的权重参数。通过上述过程不断地重复训练,直到得到的损失值达到收敛条件停止,或者在训练初始设置最大循环训练次数,当达到最大循环训练次数时,停止训练,最后得到训练好的语音合成模型。
如图4所示,为一个实施例中,目标语音合成模型的预测流程示意图。首先,获取待合成语音数据,将待合成语音数据(待合成文本数据或待合成音素数据)作为编码器的嵌入层的输入,得到待合成语音数据的文本特征向量或音素特征向量,然后将文本特征向量或音素特征向量作为编码器中卷积层的输入,将卷积层的输出作为编码LSTM层的输入,将编码LSTM层输出的编码特征作为注意力机制的输入,注意力机制根据编码特征计算得到定长向量,将上一时间步骤预测的语音特征(梅尔图谱)作为解码器中预网层的输入,然后将预网层的输出作为解码LSTM层的输入,并将解码LSTM层输出的解码特征作为注意力机制的输入,注意力机制根据解码特征和编码特征进行计算得到定长向量,然后将定长向量又作为解码LSTM层的输入,解码LSTM层将上述预网层的输出(上文中的映射语音特征)与定长向量进行拼接,然后对拼接后得到的特征向量进行处理,得到预测语音特征,将预测语音特征作为后网层的输入,获取后网层输出,根据后网层的输出和和预测语音特征得到输出的目标语音特征。之后,将目标语音特征作为声码器的输入,获取声码器输出的目标语音。
如图5所示,在一个实施例中,提出了一种语音合成模型的训练装置,该装置包括:
训练获取模块502,用于获取训练文本数据和所述训练文本数据对应的训练语音特征;
音素转换模块504,用于根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
训练模块506,用于将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
在一个实施例中,所述语音合成模型包括:编码器、解码器和连接所述编码器和解码器的注意力机制;所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征;所述解码器用于根据所述训练文本数据对应的训练语音特征得到解码特征;所述注意力机制用于根据所述编码特征和所述解码特征得到定长向量,将所述定长向量作为所述解码器的输入;所述解码器还用于根据所述定长向量得到输出的实际语音特征;
所述训练模块还用于根据所述训练语音特征和所述实际语音特征计算得到损失值,根据所述损失值更新所述语音合成模型中的权重参数。
在一个实施例中,所述编码器中包括:嵌入层、卷积层和编码LSTM层,所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征,包括:所述嵌入层用于将所述训练文本数据和所述训练音素数据分别转换为文本特征向量和音素特征向量,并随机从所述文本特征向量和音素特征向量中选择一个作为卷积层的输入;所述卷积层用于根据输入的所述文本特征向量或所述音素特征向量进行卷积运算得到卷积特征,将所述卷积特征作为所述编码LSTM层的输入,所述编码LSTM层用于根据所述卷积特征计算得到编码特征。
在一个实施例中,所述解码器中包括:预网层、解码LSTM层,所述解码器用于根据上一时间步骤预测的语音特征得到解码特征,包括:所述预网层用于根据输入的上一时间步骤预测的语音特征进行非线性映射得到映射语音特征,将所述映射语音特征作为所述解码LSTM层的输入;所述解码LSTM层用于根据所述映射语音特征计算得到解码特征。
在一个实施例中,所述解码器还包括:后网层;所述解码器还用于根据所述定长向量得到输出的实际语音特征,包括:获取所述预网层输出的映射语音特征,将所述映射语音特征与所述定长向量进行拼接,得到特征向量;将所述特征向量作为所述解码LSTM层的输入,获取所述解码LSTM层输出的预测语音特征;将所述预测语音特征作为所述后网层的输入,根据所述后网层的输出和所述解码LSTM层输出的预测语音特征得到所述实际语音特征。
在一个实施例中,转换模块还用于将所述训练文本数据输入到音素转换器,所述音素转换器用于将所述训练文本数据进行规范化处理为多个规范化单词,分别查找与每个规范化单词对应的音素,获取音素转换器输出的与所述训练文本数据对应的训练音素数据。
如图6所示,在一个实施例中,上述语音合成模型的训练装置还包括:
预测获取模块508,用于获取待合成语音数据,所述待合成语音数据为待合成文本数据或待合成音素数据;
预测模块510,用于将所述待合成语音数据作为所述目标语音合成模型的输入,获取所述目标语音合成模型输出的目标语音特征;
语音转换模块512,用于采用声码器将所述目标语音特征转换为目标语音。
图7示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是终端,也可以是服务器。如图7所示,该计算机设备包括通过模型总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作模型,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音合成模型的训练方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音合成模型的训练方法。网络接口用于与外界进行通信。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的语音合成模型的训练方法可以实现为一种计算机程序的形式,计算机程序可在如图7所示的计算机设备上运行。计算机设备的存储器中可存储组成该语音合成模型的训练装置的各个程序模板。比如,训练获取模块502,音素转换模块504和训练模块506。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:获取训练文本数据和所述训练文本数据对应的训练语音特征;根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
在一个实施例中,所述语音合成模型包括:编码器、解码器和连接所述编码器和解码器的注意力机制;所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征;所述解码器用于根据上一时间步骤预测的语音特征得到解码特征;所述注意力机制用于根据所述编码特征和所述解码特征得到定长向量,将所述定长向量作为所述解码器的输入;所述解码器还用于根据所述定长向量得到输出的实际语音特征;
所述将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型,包括:根据所述训练语音特征和所述实际语音特征计算得到损失值,根据所述损失值更新所述语音合成模型中的权重参数。
在一个实施例中,所述编码器中包括:嵌入层、卷积层和编码LSTM层,所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征,包括:所述嵌入层用于将所述训练文本数据和所述训练音素数据分别转换为文本特征向量和音素特征向量,并随机从所述文本特征向量和音素特征向量中选择一个作为卷积层的输入;所述卷积层用于根据输入的所述文本特征向量或所述音素特征向量进行卷积运算得到卷积特征,将所述卷积特征作为所述编码LSTM层的输入,所述编码LSTM层用于根据所述卷积特征计算得到编码特征。
在一个实施例中,所述解码器中包括:预网层、解码LSTM层,所述解码器用于根据上一时间步骤预测的语音特征得到解码特征,包括:所述预网层用于根据输入的上一时间步骤预测的语音特征进行非线性映射得到映射语音特征,将所述映射语音特征作为所述解码LSTM层的输入;所述解码LSTM层用于根据所述映射语音特征计算得到解码特征。
在一个实施例中,所述解码器还包括:后网层;所述解码器还用于根据所述定长向量得到输出的实际语音特征,包括:获取所述预网层输出的映射语音特征,将所述映射语音特征与所述定长向量进行拼接,得到特征向量;将所述特征向量作为所述解码LSTM层的输入,获取所述解码LSTM层输出的预测语音特征;将所述预测语音特征作为所述后网层的输入,根据所述后网层的输出和所述解码LSTM层输出的预测语音特征得到所述实际语音特征。
在一个实施例中,所述根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据,包括:将所述训练文本数据输入到音素转换器,所述音素转换器用于将所述训练文本数据进行规范化处理为多个规范化单词,分别查找与每个规范化单词对应的音素,得到与所述训练文本数据对应的训练音素数据。
在一个实施例中,所述计算机程序被所述处理器执行时,还用于执行以下步骤:获取待合成语音数据,所述待合成语音数据为待合成文本数据或待合成音素数据;将所述待合成语音数据作为所述目标语音合成模型的输入,获取所述目标语音合成模型输出的目标语音特征;采用声码器将所述目标语音特征转换为目标语音。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如下步骤:
获取训练文本数据和所述训练文本数据对应的训练语音特征;根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
在一个实施例中,所述语音合成模型包括:编码器、解码器和连接所述编码器和解码器的注意力机制;所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征;所述解码器用于根据上一时间步骤预测的语音特征得到解码特征;所述注意力机制用于根据所述编码特征和所述解码特征得到定长向量,将所述定长向量作为所述解码器的输入;所述解码器还用于根据所述定长向量得到输出的实际语音特征;
所述将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型,包括:根据所述训练语音特征和所述实际语音特征计算得到损失值,根据所述损失值更新所述语音合成模型中的权重参数。
在一个实施例中,所述编码器中包括:嵌入层、卷积层和编码LSTM层,所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征,包括:所述嵌入层用于将所述训练文本数据和所述训练音素数据分别转换为文本特征向量和音素特征向量,并随机从所述文本特征向量和音素特征向量中选择一个作为卷积层的输入;所述卷积层用于根据输入的所述文本特征向量或所述音素特征向量进行卷积运算得到卷积特征,将所述卷积特征作为所述编码LSTM层的输入,所述编码LSTM层用于根据所述卷积特征计算得到编码特征。
在一个实施例中,所述解码器中包括:预网层、解码LSTM层,所述解码器用于根据上一时间步骤预测的语音特征得到解码特征,包括:所述预网层用于根据输入的上一时间步骤预测的语音特征进行非线性映射得到映射语音特征,将所述映射语音特征作为所述解码LSTM层的输入;所述解码LSTM层用于根据所述映射语音特征计算得到解码特征。
在一个实施例中,所述解码器还包括:后网层;所述解码器还用于根据所述定长向量得到输出的实际语音特征,包括:获取所述预网层输出的映射语音特征,将所述映射语音特征与所述定长向量进行拼接,得到特征向量;将所述特征向量作为所述解码LSTM层的输入,获取所述解码LSTM层输出的预测语音特征;将所述预测语音特征作为所述后网层的输入,根据所述后网层的输出和所述解码LSTM层输出的预测语音特征得到所述实际语音特征。
在一个实施例中,所述根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据,包括:将所述训练文本数据输入到音素转换器,所述音素转换器用于将所述训练文本数据进行规范化处理为多个规范化单词,分别查找与每个规范化单词对应的音素,得到与所述训练文本数据对应的训练音素数据。
在一个实施例中,所述计算机程序被所述处理器执行时,还用于执行以下步骤:获取待合成语音数据,所述待合成语音数据为待合成文本数据或待合成音素数据;将所述待合成语音数据作为所述目标语音合成模型的输入,获取所述目标语音合成模型输出的目标语音特征;采用声码器将所述目标语音特征转换为目标语音。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种语音合成模型的训练方法,其特征在于,所述方法包括:
    获取训练文本数据和所述训练文本数据对应的训练语音特征;
    根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
    将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
  2. 根据权利要求1所述的方法,其特征在于,所述语音合成模型包括:编码器、解码器和连接所述编码器和解码器的注意力机制;所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征;所述解码器用于根据上一时间步骤预测的语音特征得到解码特征;所述注意力机制用于根据所述编码特征和所述解码特征得到定长向量,将所述定长向量作为所述解码器的输入;所述解码器还用于根据所述定长向量得到输出的实际语音特征;
    所述将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型,包括:
    根据所述训练语音特征和所述实际语音特征计算得到损失值,根据所述损失值更新所述语音合成模型中的权重参数。
  3. 根据权利要求2所述的方法,其特征在于,所述编码器中包括:嵌入层、卷积层和编码LSTM层,所述编码器用于根据所述训练文本数据和所述训练音素数据得到编码特征,包括:
        所述嵌入层用于将所述训练文本数据和所述训练音素数据分别转换为文本特征向量和音素特征向量,并随机从所述文本特征向量和音素特征向量中选择一个作为卷积层的输入;所述卷积层用于根据输入的所述文本特征向量或所述音素特征向量进行卷积运算得到卷积特征,将所述卷积特征作为所述编码LSTM层的输入,所述编码LSTM层用于根据所述卷积特征计算得到编码特征。
  4. 根据权利要求2所述的方法,其特征在于,所述解码器中包括:预网层、解码LSTM层,所述解码器用于根据上一时间步骤预测的语音特征得到解码特征,包括:
    所述预网层用于根据输入的上一时间步骤预测的语音特征进行非线性映射得到映射语音特征,将所述映射语音特征作为所述解码LSTM层的输入;所述解码LSTM层用于根据所述映射语音特征计算得到解码特征。
  5. 根据权利要求4所述的方法,其特征在于,所述解码器还包括:后网层;所述解码器还用于根据所述定长向量得到输出的实际语音特征,包括:
    获取所述预网层输出的映射语音特征,将所述映射语音特征与所述定长向量进行拼接,得到特征向量;
    将所述特征向量作为所述解码LSTM层的输入,获取所述解码LSTM层输出的预测语音特征;
    将所述预测语音特征作为所述后网层的输入,根据所述后网层的输出和所述解码LSTM层输出的预测语音特征得到所述实际语音特征。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据,包括:
    将所述训练文本数据输入到音素转换器,所述音素转换器用于将所述训练文本数据进行规范化处理为多个规范化单词,分别查找与每个规范化单词对应的音素,得到与所述训练文本数据对应的训练音素数据。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取待合成语音数据,所述待合成语音数据为待合成文本数据或待合成音素数据;
    将所述待合成语音数据作为所述目标语音合成模型的输入,获取所述目标语音合成模型输出的目标语音特征;
    采用声码器将所述目标语音特征转换为目标语音。
  8. 一种语音合成模型的训练装置,其特征在于,所述装置包括:
    获取模块,用于获取训练文本数据和所述训练文本数据对应的训练语音特征;
    转换模块,用于根据所述训练文本数据得到与所述训练文本数据对应的训练音素数据;
    训练模块,用于将所述训练文本数据和所述训练音素数据作为语音合成模型的输入,将所述训练文本数据对应的训练语音特征作为语音合成模型期望的输出对所述语音合成模型进行训练,得到目标语音合成模型。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至7中任一项所述方法的步骤。
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至7中任一项所述方法的步骤。
PCT/CN2019/127339 2019-12-23 2019-12-23 语音合成模型的训练方法、装置、计算机设备及存储介质 WO2021127821A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980003169.3A CN111133506A (zh) 2019-12-23 2019-12-23 语音合成模型的训练方法、装置、计算机设备及存储介质
PCT/CN2019/127339 WO2021127821A1 (zh) 2019-12-23 2019-12-23 语音合成模型的训练方法、装置、计算机设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127339 WO2021127821A1 (zh) 2019-12-23 2019-12-23 语音合成模型的训练方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021127821A1 true WO2021127821A1 (zh) 2021-07-01

Family

ID=70507764

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127339 WO2021127821A1 (zh) 2019-12-23 2019-12-23 语音合成模型的训练方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111133506A (zh)
WO (1) WO2021127821A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488021A (zh) * 2021-08-09 2021-10-08 杭州小影创新科技股份有限公司 一种提高语音合成自然度的方法
CN116092474A (zh) * 2023-04-07 2023-05-09 北京边锋信息技术有限公司 一种语音合成方法、装置

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583902B (zh) * 2020-05-14 2023-07-04 携程计算机技术(上海)有限公司 语音合成***、方法、电子设备及介质
CN111667814B (zh) * 2020-05-26 2023-09-12 北京声智科技有限公司 一种多语种的语音合成方法及装置
CN111696517A (zh) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及计算机可读存储介质
CN111916054B (zh) * 2020-07-08 2024-04-26 标贝(青岛)科技有限公司 基于唇形的语音生成方法、装置和***及存储介质
CN112002305B (zh) * 2020-07-29 2024-06-18 北京大米科技有限公司 语音合成方法、装置、存储介质及电子设备
CN112951200B (zh) * 2021-01-28 2024-03-12 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN113035228A (zh) * 2021-03-23 2021-06-25 广州酷狗计算机科技有限公司 声学特征提取方法、装置、设备及存储介质
CN113327578B (zh) * 2021-06-10 2024-02-02 平安科技(深圳)有限公司 一种声学模型训练方法、装置、终端设备及存储介质
CN113689844B (zh) * 2021-07-22 2022-05-27 北京百度网讯科技有限公司 语音合成模型的确定方法、装置、设备和存储介质
CN117765926B (zh) * 2024-02-19 2024-05-14 上海蜜度科技股份有限公司 语音合成方法、***、电子设备及介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118498A (zh) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 语音合成模型的训练方法及装置
CN107452369A (zh) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 语音合成模型生成方法和装置
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
CN108763190A (zh) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 基于语音的口型动画合成装置、方法及可读存储介质
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN109036377A (zh) * 2018-07-26 2018-12-18 ***股份有限公司 一种语音合成方法及装置
CN109326278A (zh) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 一种声学模型构建方法及装置、电子设备
CN110136692A (zh) * 2019-04-30 2019-08-16 北京小米移动软件有限公司 语音合成方法、装置、设备及存储介质
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN110619867A (zh) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 语音合成模型的训练方法、装置、电子设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
CN105185372B (zh) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN106652995A (zh) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 文本语音播报方法及***
CN108630190B (zh) * 2018-05-18 2019-12-10 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN108766413B (zh) * 2018-05-25 2020-09-25 北京云知声信息技术有限公司 语音合成方法及***
CN109859736B (zh) * 2019-01-23 2021-05-25 北京光年无限科技有限公司 语音合成方法及***
CN109767752B (zh) * 2019-02-27 2023-05-26 平安科技(深圳)有限公司 一种基于注意力机制的语音合成方法及装置

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118498A (zh) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 语音合成模型的训练方法及装置
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN109326278A (zh) * 2017-07-31 2019-02-12 科大讯飞股份有限公司 一种声学模型构建方法及装置、电子设备
CN107452369A (zh) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 语音合成模型生成方法和装置
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
CN108763190A (zh) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 基于语音的口型动画合成装置、方法及可读存储介质
CN109036377A (zh) * 2018-07-26 2018-12-18 ***股份有限公司 一种语音合成方法及装置
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN110136692A (zh) * 2019-04-30 2019-08-16 北京小米移动软件有限公司 语音合成方法、装置、设备及存储介质
CN110619867A (zh) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 语音合成模型的训练方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488021A (zh) * 2021-08-09 2021-10-08 杭州小影创新科技股份有限公司 一种提高语音合成自然度的方法
CN116092474A (zh) * 2023-04-07 2023-05-09 北京边锋信息技术有限公司 一种语音合成方法、装置

Also Published As

Publication number Publication date
CN111133506A (zh) 2020-05-08

Similar Documents

Publication Publication Date Title
WO2021127821A1 (zh) 语音合成模型的训练方法、装置、计算机设备及存储介质
CN109271646B (zh) 文本翻译方法、装置、可读存储介质和计算机设备
US11705107B2 (en) Real-time neural text-to-speech
US11289069B2 (en) Statistical parameter model establishing method, speech synthesis method, server and storage medium
CN109446534B (zh) 机器翻译方法及装置
CN108170686B (zh) 文本翻译方法及装置
CN110603583B (zh) 语音识别***和用于语音识别的方法
CN109785824B (zh) 一种语音翻译模型的训练方法及装置
Kastner et al. Representation mixing for tts synthesis
WO2022141678A1 (zh) 语音合成方法、装置、设备及存储介质
CN111222347B (zh) 语句翻译模型的训练方法及装置、语句翻译方法及装置
WO2021127817A1 (zh) 一种多语言文本合成语音方法、装置、设备及存储介质
CN110288972B (zh) 语音合成模型训练方法、语音合成方法及装置
WO2021051765A1 (zh) 一种语音合成方法及装置、存储介质
WO2022141842A1 (zh) 基于深度学习的语音训练方法、装置、设备以及存储介质
CN112382272B (zh) 可控制语音速度的语音合成方法、装置、设备及存储介质
Wu et al. Encoding linear models as weighted finite-state transducers.
WO2021134581A1 (zh) 基于韵律特征预测的语音合成方法、装置、终端及介质
CN113178188B (zh) 语音合成方法、装置、设备及存储介质
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN113450765A (zh) 语音合成方法、装置、设备及存储介质
CN114464162B (zh) 语音合成方法、神经网络模型训练方法、和语音合成模型
WO2022141870A1 (zh) 基于人工智能的语音合成方法、装置、计算机设备和介质
CN113449529A (zh) 一种翻译模型的训练方法及装置、翻译方法及装置
CN114743539A (zh) 语音合成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957625

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957625

Country of ref document: EP

Kind code of ref document: A1