CN113129862A

CN113129862A - World-tacontron-based voice synthesis method and system and server

Info

Publication number: CN113129862A
Application number: CN202110436317.XA
Authority: CN
Inventors: 卫星; 杜燕妮; 翟琰; 赵冲; 李航; 帅竞贤; 沈奥; 陆阳
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-16
Anticipated expiration: 2041-04-22
Also published as: CN113129862B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a world-tacon-based speech synthesis method, a system and a server, wherein rhythm information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacon model, a double-task learning framework is introduced, a main task is an improved tacon model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.

Description

World-tacontron-based voice synthesis method and system and server

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a world-tacontron-based voice synthesis method, a world-tacontron-based voice synthesis system and a world-tacontron-based voice synthesis server.

Background

Speech synthesis, also called text-to-speech, is a technology for converting text to natural speech, and plays an important role in man-machine communication.

The common speech synthesis method in deep learning is end-to-end speech synthesis, which directly establishes synthesis from text to speech, simplifies the intervention of human beings on intermediate links, and reduces the difficulty of speech synthesis research.

The current end-to-end voice synthesis model is a Seq2Seq model with attention mechanism established based on an encoder-decoder framework, a Tacotron model derived by Google in 2017 is a first real end-to-end voice synthesis model, can realize inputting text or phonetic notation strings and outputting linear frequency spectrum, and converts the linear frequency spectrum into audio through Griffin-Lim algorithm; however, the Tacotron model employs the Griffin-Lim algorithm, which requires acoustic features with high dimensionality and synthesizes low quality speech

In 2018, Google also introduced a Tacotron2 model, which is an improvement on the Tacotron model, a complex CBHG structure and GRU units are removed, and instead LSTM and convolutional layers are used for substitution, the model outputs a Mel spectrum, and then the Mel spectrum is converted into audio through WaveNet. Tacotron2 adopts WaveNet to replace Griffin-Lim algorithm, so that the quality of voice synthesis is greatly improved, but WaveNet needs to be trained by a large amount of data, and the synthesis speed is low.

In addition, various research projects have shown that the above models are capable of generating speech similar to human voice in english and other languages, but for languages like chinese, there are serious prosodic phrase errors, although there have been recent prosodic modeling studies on end-to-end TTS models, such as improving prosodic phrases using contextual information and syntactic characteristics. But they are incorporated in the text pre-processing stage and are not optimized as part of the synthesis process, resulting in poor quality of the final synthesized speech.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method, a system and a server for speech synthesis based on world-tacontron, which are used to solve the problems of low speech synthesis quality, slow synthesis speed and prosodic phrase errors in the prior art.

The invention provides a voice synthesis method based on world-tacontron, which is characterized by comprising the following steps:

obtaining a sample text, and respectively converting the sample text into a word sequence and a character sequence;

coding the character sequence to obtain a coded representation;

performing phrase interruption prediction on the word sequence to obtain a prosody vector;

connecting the prosody vector and the coding representation into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence;

calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;

and inputting the text to be processed into the voice synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a voice waveform.

In an embodiment of the present invention, the step of converting the sample text into a word sequence and a character sequence respectively includes:

performing word segmentation on the sample text to obtain a word sequence;

and converting the sample text into a pinyin form sequence with tones, and converting the pinyin form sequence into a character sequence.

In an embodiment of the present invention, the step of encoding the character sequence to obtain an encoded representation includes:

inputting the character sequence into an encoder, and obtaining an encoded representation at an output end of the encoder; wherein the encoder is a tacontron model-based encoder.

In an embodiment of the present invention, the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:

and inputting the word sequence into a prosody generator, and obtaining a prosody vector at the output end of the prosody generator.

In one embodiment of the present invention, the step of concatenating the prosody vector and the coded representation into a joint vector comprises:

distributing the prosody vector corresponding to each word in the word sequence to all characters of each word to obtain a character-level prosody vector;

and splicing the coded representation and the character-level prosody vector to obtain a joint vector.

In an embodiment of the present invention, the decoding the joint vector to obtain the acoustic feature sequence includes:

inputting the joint vector into a decoder, and obtaining an acoustic feature sequence at the output end of the decoder; wherein the decoder is a decoder with attention mechanism based on a tacotron model.

In an embodiment of the present invention, the step of calculating a prosodic parameter prediction loss according to the prosodic vector, calculating an acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model includes:

calculating prosodic parameter prediction loss by taking the real prosodic vector as a label; the real rhythm is obtained by carrying out single hot coding on the word sequence according to the voice audio corresponding to the sample text;

recording the mean square error of the first acoustic feature sequence and the multi-dimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio;

and weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model.

In an embodiment of the present invention, the step of synthesizing the second acoustic signature sequence into the acoustic waveform includes:

inputting the second acoustic feature sequence into a vocoder, and obtaining a corresponding sound waveform at the output end of the vocoder; wherein the vocoder is a world vocoder.

A second aspect of the present invention provides a world-tacontron-based speech synthesis system, including:

the input module is used for acquiring a sample text and a text to be processed;

the text preprocessing module is used for converting the sample text or the text to be processed into a word sequence and a character sequence;

the coder is used for coding the character sequence to obtain coded representation;

the prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;

a processing module for concatenating the prosody vector and the encoded representation into a joint vector;

the prosody parameter prediction loss training module is further used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to a first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;

the decoder is used for decoding the joint vector to obtain a first acoustic characteristic sequence or a second acoustic characteristic sequence;

a vocoder for synthesizing the second acoustic signature sequence into a sound waveform.

A third aspect of the present invention provides a server, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any one of the world-tacontron-based speech synthesis methods of the first aspect of the present invention when executing the program.

As described above, the world-tacontron-based speech synthesis method, system and server of the present invention have the following beneficial effects:

on the basis of the existing tacotron model, prosodic information is integrated into an end-to-end acoustic modeling process, a double-task learning framework is introduced, a main task is an improved tacotron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.

Drawings

Fig. 1 is a flowchart illustrating a world-tacontron-based speech synthesis method according to a first embodiment of the present invention.

Fig. 2 is a block diagram illustrating a world-tacontron-based speech synthesis system according to a second embodiment of the present invention.

Fig. 3 is a schematic diagram of a server according to a third embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in practical implementation, and the type, quantity and proportion of the components in practical implementation can be changed freely, and the layout of the components can be more complicated.

Referring to fig. 1, a first embodiment of the present invention relates to a world-tacontron-based speech synthesis method, which includes the following steps:

step 101, obtaining a sample text, and converting the sample text into a word sequence and a character sequence respectively.

Specifically, the step of converting the sample text into a word sequence and a character sequence respectively comprises the following steps:

taking the sample text as training data, and performing word segmentation on the sample text to obtain a word sequence; the word sequence is manually marked, the position of the break introduced by the speaker is extracted according to the stopping mode of the speaker, and the words in the sample text are marked as break (break) and non-break (non-break) depending on whether the break appears after a certain word. The spaces, punctuation marks and stop mark labels appear naturally in the sample text.

It should be noted that the sample text in this embodiment carries corresponding voice audio.

Further, the sample text is converted into a pinyin form sequence with tones, and the pinyin form sequence is converted into a character sequence.

Specifically, the Chinese text is converted into a pinyin form with tones. Such as: "hello" is converted to the pinyin form of "ni 3hao3de 5", where the pinyin numerals 1-4 represent the first, second, third, and fourth tones, and the unvoiced sounds are labeled with numeral 5. Then the pinyin form is converted into a character sequence.

Optionally, the sample text is regularized before word segmentation, in the real use process, the sample text contains a large number of non-standard words, such as arabic numerals, english characters and various symbols, and the text regularization is to convert the non-chinese characters into corresponding chinese characters.

It should be understood that the sample text and the corresponding voice audio can be obtained from the public voice database, and in practical application, the user can select the appropriate database according to the requirement.

Step 102, coding the character sequence to obtain a coded representation.

Specifically, the character sequence C ═ C₁,c₂,...,c_m]The encoding is performed by an encoder, and a hidden state sequence H, i.e., an encoded representation in the present embodiment, is obtained at the output end of the encoder, and the encoded representation is represented as a high-level character vector and is denoted as CE ═ CE₁',ce₂',...,ce_m']The formula is represented as CE' ═ H ═ encoder (c).

Optionally, the encoder used in this embodiment is an encoder based on a tacotron model, and includes a character embedding layer (character embedding), a pre-network (pre-net), and a CBHG module, which are connected in sequence.

Further, a character sequence is used as input of a character embedding layer, wherein each character is represented as a one-hot encoding vector, an initial parameter and a trainable embedding matrix are set, and the one-hot vector is mapped to a character embedding vector, that is, each input character is represented as a 256-dimensional character embedding vector. In the model training, as with other network layer parameters, the embedded matrix is trained through back propagation to obtain the embedded matrix capable of representing each character set text.

The character embedding vector is subjected to a series of nonlinear transformation through a pre-network, wherein the pre-network comprises two fully-connected layers, a ReLU activation function is used behind each fully-connected layer, and the neuron deactivation rate (dropout) of each fully-connected layer is set to be 0.5 so as to enhance the generalization capability of the model; the number of output channels of the first fully connected layer is set to 256 and the number of output channels of the second fully connected layer is set to 128.

The non-linearly transformed character embedding vector is input into the CBHG module, so that an encoded representation of the character sequence is available at the output. The CBHG module consists of a one-dimensional convolution filter bank, a pooling layer, an expressway network and a Recurrent Neural Network (RNN) of a bidirectional Gated Recurrent Unit (GRU).

The one-dimensional convolution filter bank can effectively model the current and context information. The input sequence first passes through a convolutional layer, which has K1-dimensional filters with different sizes, wherein the size of the filter is 1,2,3 … K; these convolution kernels of different sizes extract context information of different lengths. The length kernel _ size of the one-dimensional convolution window used by the CBHG module is [1,2, 3.., 16], and padding is set to 'same' during the convolution operation. To ensure alignment, the outputs from the k convolution kernels of different sizes are stacked together, and the result of each one-dimensional convolution layer is 128 x 16 dimensions after being spliced together according to axis-1.

The next layer is the largest pooling layer, the purpose of pooling is to increase local invariance, stride is set to be 1, and width is set to be 2. After pooling, the film passes through two one-dimensional convolution layers. The size of the filter of the first convolutional layer is 3, stride is 1, a ReLu activation function is adopted, and an output channel is 128; the second convolution layer has a filter size of 3, stride of 1, no activation function, output channel of 128, and BN (batch normalization) between the two one-dimensional convolution layers.

After the convolutional layer, residual error connection is performed, that is, the output of the convolutional layer and the sequence after the embedding layer (embedding) are added, and then a multi-layer highway network is input to extract the high-level features. The expressway network is a 4-layer full-connection layer, a linear rectification function is adopted as an activation function, and the dimensionality of an output characteristic is 128-dimensional; and finally, inputting the output of the expressway network into a bidirectional RNN (radio network unit), wherein the bidirectional RNN comprises 128 GRU units to extract sequence characteristics in the forward and backward contexts, and 256 are obtained after bidirectional splicing, and finally, the coded representation of the character sequence is obtained.

103, performing phrase interruption prediction on the word sequence to obtain a prosody vector;

in particular, the word sequence is input into a prosody generator to predict the phrase interruption,obtaining a prosody vector PE ═ PE at the output of the prosody generator₁,pe₂,...,pe_t]。

Optionally, the prosody generator comprises an input layer, a bi-directional LSTM layer, and an output layer.

To further illustrate, the input layer converts the input word sequence into a word vector representation as its feature representation, assuming that the input word sequence W ═ W₁,...,w_t,...,w_T]Is T, and is converted into a corresponding characteristic vector V ═ V by looking up a word list according to a word vector trained in advance₁,...,v_t,...,v_T]。

The bi-directional LSTM layer reads the feature vector V and converts it to a high-level feature representation. And reading the characteristic vector sequence by using a forward LSTM and a backward LSTM, then respectively outputting hidden state vectors, splicing the forward hidden state vectors and the backward hidden state vectors to obtain a final hidden state H, and sending the final hidden state H to an output layer for decoding to obtain a final prosodic label corresponding to the word.

The output layer decodes the hidden state H output by the BilSTM layer by using a Softmax function, and outputs a prosody vector PE (PE) corresponding to each word₁,...,pe_t,...,pe_T]Wherein pe_tIs a 5-dimensional vector, pe_t＝[p_t[1],...,p_t[k],...,p_t[5]]；t∈[1,T]Denotes the probability of k occurrence; k represents the target phrase interrupt pattern.

It should be noted that the present embodiment defines the PE (prosody embedding) as a vector composed of five elements, namely break (break), non-break (non-break), space (blank), punctuation (puntation), and stop symbol (stop token), which represent five break phrase patterns.

And 104, connecting the prosody vector and the coding expression into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence.

Specifically, because the word sequence and the character sequence are not equal in length, the present embodiment performs upsampling on the prosody vector, that is, the prosody vector corresponding to each word in the word sequence is allocated to all characters of each word to obtain a character-level prosody vector PE ', and concatenates the coding representation CE' output by the encoder and the character-level prosody vector PE 'to obtain a final joint vector JE', where the expression is:

JE'＝[PE'；CE']。

continuing, the joint vector JE 'is input into the Decoder, and an acoustic feature sequence Y is obtained at the output end of the Decoder, and the expression is Y ═ Decoder (JE').

Optionally, the decoder is a decoder with an attention mechanism based on a tacotron model, and includes an attention mechanism, a preprocessing network, a long-time and short-time memory neural network, and a linear mapping layer.

Further illustratively, the decoder is an autoregressive recurrent neural network. In the decoding process, the decoding result of the previous step is taken as input and passes through a pre-network formed by 2 layers of full connection, and each layer is formed by 256 hidden ReLU units. The pre-network acts as an information bottleneck layer (boottlenic) that is necessary for learning attention.

The output of the pre-network is spliced with the attention context vector calculated by the attention mechanism and input into 2-layer LSTM for decoding, and each LSTM layer contains 1024 units.

Specifically, in this embodiment, a position-sensitive attention mechanism in a model Tacotron2 is used, a content-based attention mechanism is extended, a position feature is obtained by convolving 32 1-dimensional convolution kernels with the length of 31, and then an input sequence sum is projected to a 128-dimensional hidden layer feature as a position feature to calculate an attention weight, where the specific formula is as follows:

f_i＝F*cα_i-1

wherein S is_iIs current decoder implicitState h_jIs the current encoder hidden state, W, V, U is the weight matrix corresponding to the state, b is the offset value, and is initially a 0 vector. F_i,jIs the position feature of the previous attention weight matrix obtained by convolution. Position feature f_iFrom the cumulative attention weight ca_iIs performed.

The output of the LSTM layer, which is regularized using zoneout with a probability of 0.1, is again stitched together with the attention context vector and then subjected to a linear transformation projection to predict the acoustic feature sequence.

And 105, calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model.

Specifically, the real prosody vector is used as a label, and the prosody parameter prediction loss is calculated; the real rhythm is obtained by manually marking a word sequence according to the multidimensional acoustic features and then carrying out unique hot coding; the multidimensional acoustic features are extracted from voice audio corresponding to sample text, specifically, the multidimensional acoustic features are 38-dimensional, and include: 1-dimensional fundamental frequency parameters, 32-dimensional spectral envelope parameters and 5-dimensional aperiodic parameters; the extraction steps are as follows:

1. inputting a waveform, and estimating a fundamental frequency (F0 constant) through a DIO algorithm;

2. f0 and the waveform are used as input, and a 513-dimensional spectral envelope (spectral envelope) is estimated by the cheap trim;

3. inputting F0, SP and a waveform, and estimating the extracted signal by using a D4C algorithm to obtain a 513-dimensional non-periodic parameter (aperiodic parameter);

4. and (3) performing dimensionality reduction processing on the spectral envelope and the aperiodic parameters by adopting a DCT (discrete cosine transformation) algorithm, and respectively compressing the spectral envelope and the aperiodic parameters into 32 dimensions and 5 dimensions.

Continuing, the cross-entropy loss in the training process is taken as the loss function of the prosodic parameter prediction, and the loss function is calculated as follows:

k represents the target phrase interrupt pattern.

Recording the mean square error of the first acoustic feature sequence and the multidimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio; the acoustic feature loss calculation is as follows:

therein, Loss_wavRepresenting a loss of acoustic features of the audio; y represents a true acoustic feature; y is_t' denotes the predicted acoustic characteristics.

Weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model; the specific calculation method is as follows:

Loss_total＝Loss_wav+w*Loss_pe

therein, Loss_totalRepresents a global penalty; ω represents a weight parameter; loss_peRepresenting prosodic parameter prediction loss.

The training data sample number (batch _ size) is set to 32, the model is trained using Adam optimizer with learning rate decay, with β 1 being 0.9 and β 2 being 0.999, where the learning rate starts at step 50k and starts at step 10^-3Dynamic damping to 10^-5And the acoustic model and the prosody generation model are jointly trained for 200k steps to serve as a final speech synthesis model.

By adopting the scheme, the network parameters are updated by back propagation training according to the global loss function, two tasks are simultaneously trained by adopting a multi-task learning framework, the prosody generator and other Tacotron modules are trained and updated together, the prosody information is integrated into the end-to-end acoustic modeling process, mutual optimization is performed in the model training process, the precision of a prosody generation model is finally improved under the help of the end-to-end acoustic model, so that more accurate prosody vectors are obtained, and the expression of voice synthesis is finally improved.

And 106, inputting the text to be processed into the speech synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a sound waveform.

Specifically, the text to be processed may be a sample text or other texts, and the text to be processed is input into the speech synthesis model, and the corresponding second acoustic feature sequence is finally obtained through the processing of the above steps.

The second acoustic signature sequence is input into the vocoder, and the corresponding acoustic waveform is obtained at the output of the vocoder.

Optionally, the vocoder is a world vocoder.

Therefore, the prosodic information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacotron model, a double-task learning framework is introduced, the main task is an improved tacotron model, and the acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, through the joint training of the double tasks, more displayed prosodic knowledge can be learned in model training, so that the quality of output voice is optimized; meanwhile, a world vocoder is used at the output end, so that the size of the model is greatly reduced, and the speed of voice synthesis is improved.

A second embodiment of the invention relates to a world-tacontron based speech synthesis system.

Please refer to fig. 2, which specifically includes:

the input module is used for acquiring a sample text and acquiring a text to be processed; wherein the sample text carries corresponding voice audio.

And the text preprocessing module is used for converting the sample text into a word sequence and a character sequence.

And the coder is used for coding the character sequence to obtain a coded representation.

Specifically, an input character sequence passes through a character embedding layer, and a character embedding vector with a fixed dimension is output; the pre-network carries out a series of nonlinear transformation on the character embedding vector; the CBHG module converts the non-linearly transformed character embedding vector into an encoded representation of the character embedding vector.

Specifically, the input layer converts the input word sequence into a word vector, and the word vector is used as a feature representation; the bidirectional LSTM layer reads the input word vector and converts the word vector into high-level feature representation; the output layer decodes the hidden state output by the bidirectional LSTM layer by using a Softmax function and outputs a prosody vector corresponding to each word.

A processing module for concatenating the prosody vector and the coding representation into a joint vector; the prosody parameter prediction loss training device is also used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model;

and the decoder is used for decoding the joint vector to obtain a first acoustic feature sequence or a second acoustic feature sequence.

Specifically, the decoder is an autoregressive recurrent neural network. The long-short term memory neural network is used for splicing the output of the preprocessing network with the attention context vector calculated by the attention mechanism and outputting a new context vector to the linear mapping layer. And the linear mapping layer performs linear projection on the new context vector and outputs an acoustic feature sequence.

Optionally, the vocoder is a world vocoder.

It should be noted that the Processing module is usually a Central Processing Unit (CPU) of the whole Digital display sensor processor system, and may be configured with a corresponding operating system and a control interface, and specifically may be a Digital logic processor such as a single chip microcomputer, a DSP (Digital Signal Processing), an ARM (Advanced risc machines, ARM processor) and the like, which can be used for automatic control, and may load a control instruction into a memory at any time for storage and execution, and at the same time, may be built-in with units such as a CPU instruction and data memory, an input/output Unit, a power module, and a Digital analog Unit, and may be specifically set according to an actual use condition, and this scheme is not limited thereto.

Therefore, the embodiment combines the prosody generator, the world vocoder and the optimized tacotron model on the basis of the existing tacotron model, does not participate in rich training of WaveNet and a longer synthesis period, has a lower acoustic characteristic size, not only improves the speech synthesis speed, but also improves the speech synthesis quality of the tacotron model.

Referring to fig. 3, a third embodiment of the present invention relates to a server, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the processor implements any one of the methods described in the first embodiment.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

In summary, according to the voice synthesis method, system and server based on world-tacontron, prosodic information is integrated into an end-to-end acoustic modeling process on the basis of an existing tacontron model, a double-task learning framework is introduced, a main task is an improved tacontron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for synthesizing voice based on world-tacontron is characterized by comprising the following steps:

coding the character sequence to obtain a coded representation;

2. The world-tacontron based speech synthesis method of claim 1, wherein: the step of converting the sample text into a word sequence and a character sequence respectively comprises:

performing word segmentation on the sample text to obtain a word sequence;

3. The world-tacontron based speech synthesis method of claim 1, wherein: the step of encoding the character sequence to obtain an encoded representation comprises:

4. The world-tacontron based speech synthesis method of claim 1, wherein: the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:

5. The world-tacontron based speech synthesis method of claim 1, wherein: said step of concatenating said prosody vector and said coded representation into a joint vector comprises:

6. The world-tacontron based speech synthesis method of claim 5, wherein: the step of decoding the joint vector to obtain an acoustic feature sequence includes:

7. The world-tacontron based speech synthesis method of claim 1, wherein: the step of calculating the prosodic parameter prediction loss according to the prosodic vector, calculating the acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model comprises the following steps:

8. The world-tacontron based speech synthesis method of claim 1, wherein: the step of synthesizing the second acoustic feature sequence into a sound waveform comprises:

9. A world-tacontron based speech synthesis system, comprising:

10. A server comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the world-tacontron based speech synthesis method of any of claims 1-8.