CN113129862A - World-tacontron-based voice synthesis method and system and server - Google Patents
World-tacontron-based voice synthesis method and system and server Download PDFInfo
- Publication number
- CN113129862A CN113129862A CN202110436317.XA CN202110436317A CN113129862A CN 113129862 A CN113129862 A CN 113129862A CN 202110436317 A CN202110436317 A CN 202110436317A CN 113129862 A CN113129862 A CN 113129862A
- Authority
- CN
- China
- Prior art keywords
- sequence
- vector
- prosody
- loss
- acoustic feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 33
- 238000000034 method Methods 0.000 claims abstract description 14
- 230000033764 rhythmic process Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 96
- 230000015572 biosynthetic process Effects 0.000 claims description 39
- 238000003786 synthesis reaction Methods 0.000 claims description 39
- 238000012545 processing Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 11
- 230000002194 synthesizing effect Effects 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 10
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000002457 bidirectional effect Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and provides a world-tacon-based speech synthesis method, a system and a server, wherein rhythm information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacon model, a double-task learning framework is introduced, a main task is an improved tacon model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a world-tacontron-based voice synthesis method, a world-tacontron-based voice synthesis system and a world-tacontron-based voice synthesis server.
Background
Speech synthesis, also called text-to-speech, is a technology for converting text to natural speech, and plays an important role in man-machine communication.
The common speech synthesis method in deep learning is end-to-end speech synthesis, which directly establishes synthesis from text to speech, simplifies the intervention of human beings on intermediate links, and reduces the difficulty of speech synthesis research.
The current end-to-end voice synthesis model is a Seq2Seq model with attention mechanism established based on an encoder-decoder framework, a Tacotron model derived by Google in 2017 is a first real end-to-end voice synthesis model, can realize inputting text or phonetic notation strings and outputting linear frequency spectrum, and converts the linear frequency spectrum into audio through Griffin-Lim algorithm; however, the Tacotron model employs the Griffin-Lim algorithm, which requires acoustic features with high dimensionality and synthesizes low quality speech
In 2018, Google also introduced a Tacotron2 model, which is an improvement on the Tacotron model, a complex CBHG structure and GRU units are removed, and instead LSTM and convolutional layers are used for substitution, the model outputs a Mel spectrum, and then the Mel spectrum is converted into audio through WaveNet. Tacotron2 adopts WaveNet to replace Griffin-Lim algorithm, so that the quality of voice synthesis is greatly improved, but WaveNet needs to be trained by a large amount of data, and the synthesis speed is low.
In addition, various research projects have shown that the above models are capable of generating speech similar to human voice in english and other languages, but for languages like chinese, there are serious prosodic phrase errors, although there have been recent prosodic modeling studies on end-to-end TTS models, such as improving prosodic phrases using contextual information and syntactic characteristics. But they are incorporated in the text pre-processing stage and are not optimized as part of the synthesis process, resulting in poor quality of the final synthesized speech.
Disclosure of Invention
In view of the above drawbacks of the prior art, an object of the present invention is to provide a method, a system and a server for speech synthesis based on world-tacontron, which are used to solve the problems of low speech synthesis quality, slow synthesis speed and prosodic phrase errors in the prior art.
The invention provides a voice synthesis method based on world-tacontron, which is characterized by comprising the following steps:
obtaining a sample text, and respectively converting the sample text into a word sequence and a character sequence;
coding the character sequence to obtain a coded representation;
performing phrase interruption prediction on the word sequence to obtain a prosody vector;
connecting the prosody vector and the coding representation into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence;
calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
and inputting the text to be processed into the voice synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a voice waveform.
In an embodiment of the present invention, the step of converting the sample text into a word sequence and a character sequence respectively includes:
performing word segmentation on the sample text to obtain a word sequence;
and converting the sample text into a pinyin form sequence with tones, and converting the pinyin form sequence into a character sequence.
In an embodiment of the present invention, the step of encoding the character sequence to obtain an encoded representation includes:
inputting the character sequence into an encoder, and obtaining an encoded representation at an output end of the encoder; wherein the encoder is a tacontron model-based encoder.
In an embodiment of the present invention, the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:
and inputting the word sequence into a prosody generator, and obtaining a prosody vector at the output end of the prosody generator.
In one embodiment of the present invention, the step of concatenating the prosody vector and the coded representation into a joint vector comprises:
distributing the prosody vector corresponding to each word in the word sequence to all characters of each word to obtain a character-level prosody vector;
and splicing the coded representation and the character-level prosody vector to obtain a joint vector.
In an embodiment of the present invention, the decoding the joint vector to obtain the acoustic feature sequence includes:
inputting the joint vector into a decoder, and obtaining an acoustic feature sequence at the output end of the decoder; wherein the decoder is a decoder with attention mechanism based on a tacotron model.
In an embodiment of the present invention, the step of calculating a prosodic parameter prediction loss according to the prosodic vector, calculating an acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model includes:
calculating prosodic parameter prediction loss by taking the real prosodic vector as a label; the real rhythm is obtained by carrying out single hot coding on the word sequence according to the voice audio corresponding to the sample text;
recording the mean square error of the first acoustic feature sequence and the multi-dimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio;
and weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model.
In an embodiment of the present invention, the step of synthesizing the second acoustic signature sequence into the acoustic waveform includes:
inputting the second acoustic feature sequence into a vocoder, and obtaining a corresponding sound waveform at the output end of the vocoder; wherein the vocoder is a world vocoder.
A second aspect of the present invention provides a world-tacontron-based speech synthesis system, including:
the input module is used for acquiring a sample text and a text to be processed;
the text preprocessing module is used for converting the sample text or the text to be processed into a word sequence and a character sequence;
the coder is used for coding the character sequence to obtain coded representation;
the prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
a processing module for concatenating the prosody vector and the encoded representation into a joint vector;
the prosody parameter prediction loss training module is further used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to a first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
the decoder is used for decoding the joint vector to obtain a first acoustic characteristic sequence or a second acoustic characteristic sequence;
a vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
A third aspect of the present invention provides a server, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any one of the world-tacontron-based speech synthesis methods of the first aspect of the present invention when executing the program.
As described above, the world-tacontron-based speech synthesis method, system and server of the present invention have the following beneficial effects:
on the basis of the existing tacotron model, prosodic information is integrated into an end-to-end acoustic modeling process, a double-task learning framework is introduced, a main task is an improved tacotron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.
Drawings
Fig. 1 is a flowchart illustrating a world-tacontron-based speech synthesis method according to a first embodiment of the present invention.
Fig. 2 is a block diagram illustrating a world-tacontron-based speech synthesis system according to a second embodiment of the present invention.
Fig. 3 is a schematic diagram of a server according to a third embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in practical implementation, and the type, quantity and proportion of the components in practical implementation can be changed freely, and the layout of the components can be more complicated.
Referring to fig. 1, a first embodiment of the present invention relates to a world-tacontron-based speech synthesis method, which includes the following steps:
step 101, obtaining a sample text, and converting the sample text into a word sequence and a character sequence respectively.
Specifically, the step of converting the sample text into a word sequence and a character sequence respectively comprises the following steps:
taking the sample text as training data, and performing word segmentation on the sample text to obtain a word sequence; the word sequence is manually marked, the position of the break introduced by the speaker is extracted according to the stopping mode of the speaker, and the words in the sample text are marked as break (break) and non-break (non-break) depending on whether the break appears after a certain word. The spaces, punctuation marks and stop mark labels appear naturally in the sample text.
It should be noted that the sample text in this embodiment carries corresponding voice audio.
Further, the sample text is converted into a pinyin form sequence with tones, and the pinyin form sequence is converted into a character sequence.
Specifically, the Chinese text is converted into a pinyin form with tones. Such as: "hello" is converted to the pinyin form of "ni 3hao3de 5", where the pinyin numerals 1-4 represent the first, second, third, and fourth tones, and the unvoiced sounds are labeled with numeral 5. Then the pinyin form is converted into a character sequence.
Optionally, the sample text is regularized before word segmentation, in the real use process, the sample text contains a large number of non-standard words, such as arabic numerals, english characters and various symbols, and the text regularization is to convert the non-chinese characters into corresponding chinese characters.
It should be understood that the sample text and the corresponding voice audio can be obtained from the public voice database, and in practical application, the user can select the appropriate database according to the requirement.
Specifically, the character sequence C ═ C1,c2,...,cm]The encoding is performed by an encoder, and a hidden state sequence H, i.e., an encoded representation in the present embodiment, is obtained at the output end of the encoder, and the encoded representation is represented as a high-level character vector and is denoted as CE ═ CE1',ce2',...,cem']The formula is represented as CE' ═ H ═ encoder (c).
Optionally, the encoder used in this embodiment is an encoder based on a tacotron model, and includes a character embedding layer (character embedding), a pre-network (pre-net), and a CBHG module, which are connected in sequence.
Further, a character sequence is used as input of a character embedding layer, wherein each character is represented as a one-hot encoding vector, an initial parameter and a trainable embedding matrix are set, and the one-hot vector is mapped to a character embedding vector, that is, each input character is represented as a 256-dimensional character embedding vector. In the model training, as with other network layer parameters, the embedded matrix is trained through back propagation to obtain the embedded matrix capable of representing each character set text.
The character embedding vector is subjected to a series of nonlinear transformation through a pre-network, wherein the pre-network comprises two fully-connected layers, a ReLU activation function is used behind each fully-connected layer, and the neuron deactivation rate (dropout) of each fully-connected layer is set to be 0.5 so as to enhance the generalization capability of the model; the number of output channels of the first fully connected layer is set to 256 and the number of output channels of the second fully connected layer is set to 128.
The non-linearly transformed character embedding vector is input into the CBHG module, so that an encoded representation of the character sequence is available at the output. The CBHG module consists of a one-dimensional convolution filter bank, a pooling layer, an expressway network and a Recurrent Neural Network (RNN) of a bidirectional Gated Recurrent Unit (GRU).
The one-dimensional convolution filter bank can effectively model the current and context information. The input sequence first passes through a convolutional layer, which has K1-dimensional filters with different sizes, wherein the size of the filter is 1,2,3 … K; these convolution kernels of different sizes extract context information of different lengths. The length kernel _ size of the one-dimensional convolution window used by the CBHG module is [1,2, 3.., 16], and padding is set to 'same' during the convolution operation. To ensure alignment, the outputs from the k convolution kernels of different sizes are stacked together, and the result of each one-dimensional convolution layer is 128 x 16 dimensions after being spliced together according to axis-1.
The next layer is the largest pooling layer, the purpose of pooling is to increase local invariance, stride is set to be 1, and width is set to be 2. After pooling, the film passes through two one-dimensional convolution layers. The size of the filter of the first convolutional layer is 3, stride is 1, a ReLu activation function is adopted, and an output channel is 128; the second convolution layer has a filter size of 3, stride of 1, no activation function, output channel of 128, and BN (batch normalization) between the two one-dimensional convolution layers.
After the convolutional layer, residual error connection is performed, that is, the output of the convolutional layer and the sequence after the embedding layer (embedding) are added, and then a multi-layer highway network is input to extract the high-level features. The expressway network is a 4-layer full-connection layer, a linear rectification function is adopted as an activation function, and the dimensionality of an output characteristic is 128-dimensional; and finally, inputting the output of the expressway network into a bidirectional RNN (radio network unit), wherein the bidirectional RNN comprises 128 GRU units to extract sequence characteristics in the forward and backward contexts, and 256 are obtained after bidirectional splicing, and finally, the coded representation of the character sequence is obtained.
103, performing phrase interruption prediction on the word sequence to obtain a prosody vector;
in particular, the word sequence is input into a prosody generator to predict the phrase interruption,obtaining a prosody vector PE ═ PE at the output of the prosody generator1,pe2,...,pet]。
Optionally, the prosody generator comprises an input layer, a bi-directional LSTM layer, and an output layer.
To further illustrate, the input layer converts the input word sequence into a word vector representation as its feature representation, assuming that the input word sequence W ═ W1,...,wt,...,wT]Is T, and is converted into a corresponding characteristic vector V ═ V by looking up a word list according to a word vector trained in advance1,...,vt,...,vT]。
The bi-directional LSTM layer reads the feature vector V and converts it to a high-level feature representation. And reading the characteristic vector sequence by using a forward LSTM and a backward LSTM, then respectively outputting hidden state vectors, splicing the forward hidden state vectors and the backward hidden state vectors to obtain a final hidden state H, and sending the final hidden state H to an output layer for decoding to obtain a final prosodic label corresponding to the word.
The output layer decodes the hidden state H output by the BilSTM layer by using a Softmax function, and outputs a prosody vector PE (PE) corresponding to each word1,...,pet,...,peT]Wherein petIs a 5-dimensional vector, pet=[pt[1],...,pt[k],...,pt[5]];t∈[1,T]Denotes the probability of k occurrence; k represents the target phrase interrupt pattern.
It should be noted that the present embodiment defines the PE (prosody embedding) as a vector composed of five elements, namely break (break), non-break (non-break), space (blank), punctuation (puntation), and stop symbol (stop token), which represent five break phrase patterns.
And 104, connecting the prosody vector and the coding expression into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence.
Specifically, because the word sequence and the character sequence are not equal in length, the present embodiment performs upsampling on the prosody vector, that is, the prosody vector corresponding to each word in the word sequence is allocated to all characters of each word to obtain a character-level prosody vector PE ', and concatenates the coding representation CE' output by the encoder and the character-level prosody vector PE 'to obtain a final joint vector JE', where the expression is:
JE'=[PE';CE']。
continuing, the joint vector JE 'is input into the Decoder, and an acoustic feature sequence Y is obtained at the output end of the Decoder, and the expression is Y ═ Decoder (JE').
Optionally, the decoder is a decoder with an attention mechanism based on a tacotron model, and includes an attention mechanism, a preprocessing network, a long-time and short-time memory neural network, and a linear mapping layer.
Further illustratively, the decoder is an autoregressive recurrent neural network. In the decoding process, the decoding result of the previous step is taken as input and passes through a pre-network formed by 2 layers of full connection, and each layer is formed by 256 hidden ReLU units. The pre-network acts as an information bottleneck layer (boottlenic) that is necessary for learning attention.
The output of the pre-network is spliced with the attention context vector calculated by the attention mechanism and input into 2-layer LSTM for decoding, and each LSTM layer contains 1024 units.
Specifically, in this embodiment, a position-sensitive attention mechanism in a model Tacotron2 is used, a content-based attention mechanism is extended, a position feature is obtained by convolving 32 1-dimensional convolution kernels with the length of 31, and then an input sequence sum is projected to a 128-dimensional hidden layer feature as a position feature to calculate an attention weight, where the specific formula is as follows:
fi=F*cαi-1
wherein S isiIs current decoder implicitState hjIs the current encoder hidden state, W, V, U is the weight matrix corresponding to the state, b is the offset value, and is initially a 0 vector. Fi,jIs the position feature of the previous attention weight matrix obtained by convolution. Position feature fiFrom the cumulative attention weight caiIs performed.
The output of the LSTM layer, which is regularized using zoneout with a probability of 0.1, is again stitched together with the attention context vector and then subjected to a linear transformation projection to predict the acoustic feature sequence.
And 105, calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model.
Specifically, the real prosody vector is used as a label, and the prosody parameter prediction loss is calculated; the real rhythm is obtained by manually marking a word sequence according to the multidimensional acoustic features and then carrying out unique hot coding; the multidimensional acoustic features are extracted from voice audio corresponding to sample text, specifically, the multidimensional acoustic features are 38-dimensional, and include: 1-dimensional fundamental frequency parameters, 32-dimensional spectral envelope parameters and 5-dimensional aperiodic parameters; the extraction steps are as follows:
1. inputting a waveform, and estimating a fundamental frequency (F0 constant) through a DIO algorithm;
2. f0 and the waveform are used as input, and a 513-dimensional spectral envelope (spectral envelope) is estimated by the cheap trim;
3. inputting F0, SP and a waveform, and estimating the extracted signal by using a D4C algorithm to obtain a 513-dimensional non-periodic parameter (aperiodic parameter);
4. and (3) performing dimensionality reduction processing on the spectral envelope and the aperiodic parameters by adopting a DCT (discrete cosine transformation) algorithm, and respectively compressing the spectral envelope and the aperiodic parameters into 32 dimensions and 5 dimensions.
Continuing, the cross-entropy loss in the training process is taken as the loss function of the prosodic parameter prediction, and the loss function is calculated as follows:
Recording the mean square error of the first acoustic feature sequence and the multidimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio; the acoustic feature loss calculation is as follows:
therein, LosswavRepresenting a loss of acoustic features of the audio; y represents a true acoustic feature; y ist' denotes the predicted acoustic characteristics.
Weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model; the specific calculation method is as follows:
Losstotal=Losswav+w*Losspe
therein, LosstotalRepresents a global penalty; ω represents a weight parameter; losspeRepresenting prosodic parameter prediction loss.
The training data sample number (batch _ size) is set to 32, the model is trained using Adam optimizer with learning rate decay, with β 1 being 0.9 and β 2 being 0.999, where the learning rate starts at step 50k and starts at step 10-3Dynamic damping to 10-5And the acoustic model and the prosody generation model are jointly trained for 200k steps to serve as a final speech synthesis model.
By adopting the scheme, the network parameters are updated by back propagation training according to the global loss function, two tasks are simultaneously trained by adopting a multi-task learning framework, the prosody generator and other Tacotron modules are trained and updated together, the prosody information is integrated into the end-to-end acoustic modeling process, mutual optimization is performed in the model training process, the precision of a prosody generation model is finally improved under the help of the end-to-end acoustic model, so that more accurate prosody vectors are obtained, and the expression of voice synthesis is finally improved.
And 106, inputting the text to be processed into the speech synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a sound waveform.
Specifically, the text to be processed may be a sample text or other texts, and the text to be processed is input into the speech synthesis model, and the corresponding second acoustic feature sequence is finally obtained through the processing of the above steps.
The second acoustic signature sequence is input into the vocoder, and the corresponding acoustic waveform is obtained at the output of the vocoder.
Optionally, the vocoder is a world vocoder.
Therefore, the prosodic information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacotron model, a double-task learning framework is introduced, the main task is an improved tacotron model, and the acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, through the joint training of the double tasks, more displayed prosodic knowledge can be learned in model training, so that the quality of output voice is optimized; meanwhile, a world vocoder is used at the output end, so that the size of the model is greatly reduced, and the speed of voice synthesis is improved.
A second embodiment of the invention relates to a world-tacontron based speech synthesis system.
Please refer to fig. 2, which specifically includes:
the input module is used for acquiring a sample text and acquiring a text to be processed; wherein the sample text carries corresponding voice audio.
And the text preprocessing module is used for converting the sample text into a word sequence and a character sequence.
And the coder is used for coding the character sequence to obtain a coded representation.
Optionally, the encoder used in this embodiment is an encoder based on a tacotron model, and includes a character embedding layer (character embedding), a pre-network (pre-net), and a CBHG module, which are connected in sequence.
Specifically, an input character sequence passes through a character embedding layer, and a character embedding vector with a fixed dimension is output; the pre-network carries out a series of nonlinear transformation on the character embedding vector; the CBHG module converts the non-linearly transformed character embedding vector into an encoded representation of the character embedding vector.
The prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
optionally, the prosody generator comprises an input layer, a bi-directional LSTM layer, and an output layer.
Specifically, the input layer converts the input word sequence into a word vector, and the word vector is used as a feature representation; the bidirectional LSTM layer reads the input word vector and converts the word vector into high-level feature representation; the output layer decodes the hidden state output by the bidirectional LSTM layer by using a Softmax function and outputs a prosody vector corresponding to each word.
A processing module for concatenating the prosody vector and the coding representation into a joint vector; the prosody parameter prediction loss training device is also used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model;
and the decoder is used for decoding the joint vector to obtain a first acoustic feature sequence or a second acoustic feature sequence.
Optionally, the decoder is a decoder with an attention mechanism based on a tacotron model, and includes an attention mechanism, a preprocessing network, a long-time and short-time memory neural network, and a linear mapping layer.
Specifically, the decoder is an autoregressive recurrent neural network. The long-short term memory neural network is used for splicing the output of the preprocessing network with the attention context vector calculated by the attention mechanism and outputting a new context vector to the linear mapping layer. And the linear mapping layer performs linear projection on the new context vector and outputs an acoustic feature sequence.
A vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
Optionally, the vocoder is a world vocoder.
It should be noted that the Processing module is usually a Central Processing Unit (CPU) of the whole Digital display sensor processor system, and may be configured with a corresponding operating system and a control interface, and specifically may be a Digital logic processor such as a single chip microcomputer, a DSP (Digital Signal Processing), an ARM (Advanced risc machines, ARM processor) and the like, which can be used for automatic control, and may load a control instruction into a memory at any time for storage and execution, and at the same time, may be built-in with units such as a CPU instruction and data memory, an input/output Unit, a power module, and a Digital analog Unit, and may be specifically set according to an actual use condition, and this scheme is not limited thereto.
Therefore, the embodiment combines the prosody generator, the world vocoder and the optimized tacotron model on the basis of the existing tacotron model, does not participate in rich training of WaveNet and a longer synthesis period, has a lower acoustic characteristic size, not only improves the speech synthesis speed, but also improves the speech synthesis quality of the tacotron model.
Referring to fig. 3, a third embodiment of the present invention relates to a server, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the processor implements any one of the methods described in the first embodiment.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
In summary, according to the voice synthesis method, system and server based on world-tacontron, prosodic information is integrated into an end-to-end acoustic modeling process on the basis of an existing tacontron model, a double-task learning framework is introduced, a main task is an improved tacontron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A method for synthesizing voice based on world-tacontron is characterized by comprising the following steps:
obtaining a sample text, and respectively converting the sample text into a word sequence and a character sequence;
coding the character sequence to obtain a coded representation;
performing phrase interruption prediction on the word sequence to obtain a prosody vector;
connecting the prosody vector and the coding representation into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence;
calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
and inputting the text to be processed into the voice synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a voice waveform.
2. The world-tacontron based speech synthesis method of claim 1, wherein: the step of converting the sample text into a word sequence and a character sequence respectively comprises:
performing word segmentation on the sample text to obtain a word sequence;
and converting the sample text into a pinyin form sequence with tones, and converting the pinyin form sequence into a character sequence.
3. The world-tacontron based speech synthesis method of claim 1, wherein: the step of encoding the character sequence to obtain an encoded representation comprises:
inputting the character sequence into an encoder, and obtaining an encoded representation at an output end of the encoder; wherein the encoder is a tacontron model-based encoder.
4. The world-tacontron based speech synthesis method of claim 1, wherein: the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:
and inputting the word sequence into a prosody generator, and obtaining a prosody vector at the output end of the prosody generator.
5. The world-tacontron based speech synthesis method of claim 1, wherein: said step of concatenating said prosody vector and said coded representation into a joint vector comprises:
distributing the prosody vector corresponding to each word in the word sequence to all characters of each word to obtain a character-level prosody vector;
and splicing the coded representation and the character-level prosody vector to obtain a joint vector.
6. The world-tacontron based speech synthesis method of claim 5, wherein: the step of decoding the joint vector to obtain an acoustic feature sequence includes:
inputting the joint vector into a decoder, and obtaining an acoustic feature sequence at the output end of the decoder; wherein the decoder is a decoder with attention mechanism based on a tacotron model.
7. The world-tacontron based speech synthesis method of claim 1, wherein: the step of calculating the prosodic parameter prediction loss according to the prosodic vector, calculating the acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model comprises the following steps:
calculating prosodic parameter prediction loss by taking the real prosodic vector as a label; the real rhythm is obtained by carrying out single hot coding on the word sequence according to the voice audio corresponding to the sample text;
recording the mean square error of the first acoustic feature sequence and the multi-dimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio;
and weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model.
8. The world-tacontron based speech synthesis method of claim 1, wherein: the step of synthesizing the second acoustic feature sequence into a sound waveform comprises:
inputting the second acoustic feature sequence into a vocoder, and obtaining a corresponding sound waveform at the output end of the vocoder; wherein the vocoder is a world vocoder.
9. A world-tacontron based speech synthesis system, comprising:
the input module is used for acquiring a sample text and a text to be processed;
the text preprocessing module is used for converting the sample text or the text to be processed into a word sequence and a character sequence;
the coder is used for coding the character sequence to obtain coded representation;
the prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
a processing module for concatenating the prosody vector and the encoded representation into a joint vector;
the prosody parameter prediction loss training module is further used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to a first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
the decoder is used for decoding the joint vector to obtain a first acoustic characteristic sequence or a second acoustic characteristic sequence;
a vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
10. A server comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the world-tacontron based speech synthesis method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110436317.XA CN113129862B (en) | 2021-04-22 | 2021-04-22 | Voice synthesis method, system and server based on world-tacotron |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110436317.XA CN113129862B (en) | 2021-04-22 | 2021-04-22 | Voice synthesis method, system and server based on world-tacotron |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113129862A true CN113129862A (en) | 2021-07-16 |
CN113129862B CN113129862B (en) | 2024-03-12 |
Family
ID=76779099
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110436317.XA Active CN113129862B (en) | 2021-04-22 | 2021-04-22 | Voice synthesis method, system and server based on world-tacotron |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113129862B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115662435A (en) * | 2022-10-24 | 2023-01-31 | 福建网龙计算机网络信息技术有限公司 | Virtual teacher simulation voice generation method and terminal |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6961704B1 (en) * | 2003-01-31 | 2005-11-01 | Speechworks International, Inc. | Linguistic prosodic model-based text to speech |
US20120290302A1 (en) * | 2011-05-10 | 2012-11-15 | Yang Jyh-Her | Chinese speech recognition system and method |
CN105989833A (en) * | 2015-02-28 | 2016-10-05 | 讯飞智元信息科技有限公司 | Multilingual mixed-language text character-pronunciation conversion method and system |
CN110459202A (en) * | 2019-09-23 | 2019-11-15 | 浙江同花顺智能科技有限公司 | A kind of prosodic labeling method, apparatus, equipment, medium |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
CN110797006A (en) * | 2020-01-06 | 2020-02-14 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
CN111326136A (en) * | 2020-02-13 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN111354333A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Chinese prosody hierarchy prediction method and system based on self-attention |
CN111640418A (en) * | 2020-05-29 | 2020-09-08 | 数据堂(北京)智能科技有限公司 | Prosodic phrase identification method and device and electronic equipment |
CN111667816A (en) * | 2020-06-15 | 2020-09-15 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, apparatus, device and storage medium |
KR20200111608A (en) * | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
US20200394998A1 (en) * | 2018-08-02 | 2020-12-17 | Neosapience, Inc. | Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature |
CN112331177A (en) * | 2020-11-05 | 2021-02-05 | 携程计算机技术(上海)有限公司 | Rhythm-based speech synthesis method, model training method and related equipment |
CN112365880A (en) * | 2020-11-05 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN112669810A (en) * | 2020-12-16 | 2021-04-16 | 平安科技(深圳)有限公司 | Speech synthesis effect evaluation method and device, computer equipment and storage medium |
-
2021
- 2021-04-22 CN CN202110436317.XA patent/CN113129862B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6961704B1 (en) * | 2003-01-31 | 2005-11-01 | Speechworks International, Inc. | Linguistic prosodic model-based text to speech |
US20120290302A1 (en) * | 2011-05-10 | 2012-11-15 | Yang Jyh-Her | Chinese speech recognition system and method |
CN105989833A (en) * | 2015-02-28 | 2016-10-05 | 讯飞智元信息科技有限公司 | Multilingual mixed-language text character-pronunciation conversion method and system |
US20200394998A1 (en) * | 2018-08-02 | 2020-12-17 | Neosapience, Inc. | Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature |
CN111354333A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Chinese prosody hierarchy prediction method and system based on self-attention |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
CN110459202A (en) * | 2019-09-23 | 2019-11-15 | 浙江同花顺智能科技有限公司 | A kind of prosodic labeling method, apparatus, equipment, medium |
KR20200111608A (en) * | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
CN110797006A (en) * | 2020-01-06 | 2020-02-14 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
CN111326136A (en) * | 2020-02-13 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN111640418A (en) * | 2020-05-29 | 2020-09-08 | 数据堂(北京)智能科技有限公司 | Prosodic phrase identification method and device and electronic equipment |
CN111667816A (en) * | 2020-06-15 | 2020-09-15 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, apparatus, device and storage medium |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN112331177A (en) * | 2020-11-05 | 2021-02-05 | 携程计算机技术(上海)有限公司 | Rhythm-based speech synthesis method, model training method and related equipment |
CN112365880A (en) * | 2020-11-05 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN112669810A (en) * | 2020-12-16 | 2021-04-16 | 平安科技(深圳)有限公司 | Speech synthesis effect evaluation method and device, computer equipment and storage medium |
Non-Patent Citations (9)
Title |
---|
DIAOHAN LUO: "On End-to-End Chinese Speech Synthesis Based on World-Tacotron", 2020 INTERNATIONAL CONFERENCE ON CULTURE-ORIENTED SCIENCE & TECHNOLOGY * |
党建成;周晶;: "语音合成技术及其应用", 计算机与信息技术, no. 06 * |
崔鑫彤;: "语音合成技术专利分析", 电子技术与软件工程, no. 04 * |
张斌等: "语音合成方法和发展综述", 小型微型计算机*** * |
张鹏远;卢春晖;王睿敏;: "基于预训练语言表示模型的汉语韵律结构预测", 天津大学学报(自然科学与工程技术版), no. 03 * |
朱维彬;: "支持重音合成的汉语语音合成***", 中文信息学报, no. 03, 15 May 2007 (2007-05-15) * |
林举;解焱陆;张劲松;张微;: "基于声调核参数及DNN建模的韵律边界检测研究", 中文信息学报, no. 06 * |
王国梁: "一种基于Tac ot r o n2 的端到端中文语音合成方案", 华东师范大学学报( 自然科学版) * |
谢永斌: "基于少量数据集的端到端语音合成技术研究", 中国优秀硕士学位论文全文数据库 (信息科技辑) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115662435A (en) * | 2022-10-24 | 2023-01-31 | 福建网龙计算机网络信息技术有限公司 | Virtual teacher simulation voice generation method and terminal |
US11727915B1 (en) | 2022-10-24 | 2023-08-15 | Fujian TQ Digital Inc. | Method and terminal for generating simulated voice of virtual teacher |
Also Published As
Publication number | Publication date |
---|---|
CN113129862B (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11881205B2 (en) | Speech synthesis method, device and computer readable storage medium | |
CN110264991A (en) | Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model | |
CN110782870A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN112863483A (en) | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm | |
CN110767213A (en) | Rhythm prediction method and device | |
CN111276120A (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
US8626510B2 (en) | Speech synthesizing device, computer program product, and method | |
CN112352275A (en) | Neural text-to-speech synthesis with multi-level textual information | |
CN111354333B (en) | Self-attention-based Chinese prosody level prediction method and system | |
JP7112075B2 (en) | Front-end training method for speech synthesis, computer program, speech synthesis system, and front-end processing method for speech synthesis | |
CN111967334B (en) | Human body intention identification method, system and storage medium | |
CN112599113B (en) | Dialect voice synthesis method, device, electronic equipment and readable storage medium | |
CN113257220B (en) | Training method and device of speech synthesis model, electronic equipment and storage medium | |
CN113205792A (en) | Mongolian speech synthesis method based on Transformer and WaveNet | |
CN114882862A (en) | Voice processing method and related equipment | |
Lazaridis et al. | Improving phone duration modelling using support vector regression fusion | |
CN113129862B (en) | Voice synthesis method, system and server based on world-tacotron | |
CN115240713B (en) | Voice emotion recognition method and device based on multi-modal characteristics and contrast learning | |
CN114999447B (en) | Speech synthesis model and speech synthesis method based on confrontation generation network | |
CN112802451B (en) | Prosodic boundary prediction method and computer storage medium | |
CN112735379B (en) | Speech synthesis method, device, electronic equipment and readable storage medium | |
CN115547293A (en) | Multi-language voice synthesis method and system based on layered prosody prediction | |
CN113823259B (en) | Method and device for converting text data into phoneme sequence | |
Carson-Berndsen | Multilingual time maps: portable phonotactic models for speech technology | |
CN117524193B (en) | Training method, device, equipment and medium for Chinese-English mixed speech recognition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |