CN113129862A - World-tacontron-based voice synthesis method and system and server - Google Patents

World-tacontron-based voice synthesis method and system and server Download PDF

Info

Publication number
CN113129862A
CN113129862A CN202110436317.XA CN202110436317A CN113129862A CN 113129862 A CN113129862 A CN 113129862A CN 202110436317 A CN202110436317 A CN 202110436317A CN 113129862 A CN113129862 A CN 113129862A
Authority
CN
China
Prior art keywords
sequence
vector
prosody
loss
acoustic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110436317.XA
Other languages
Chinese (zh)
Other versions
CN113129862B (en
Inventor
卫星
杜燕妮
翟琰
赵冲
李航
帅竞贤
沈奥
陆阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110436317.XA priority Critical patent/CN113129862B/en
Publication of CN113129862A publication Critical patent/CN113129862A/en
Application granted granted Critical
Publication of CN113129862B publication Critical patent/CN113129862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a world-tacon-based speech synthesis method, a system and a server, wherein rhythm information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacon model, a double-task learning framework is introduced, a main task is an improved tacon model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.

Description

World-tacontron-based voice synthesis method and system and server
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a world-tacontron-based voice synthesis method, a world-tacontron-based voice synthesis system and a world-tacontron-based voice synthesis server.
Background
Speech synthesis, also called text-to-speech, is a technology for converting text to natural speech, and plays an important role in man-machine communication.
The common speech synthesis method in deep learning is end-to-end speech synthesis, which directly establishes synthesis from text to speech, simplifies the intervention of human beings on intermediate links, and reduces the difficulty of speech synthesis research.
The current end-to-end voice synthesis model is a Seq2Seq model with attention mechanism established based on an encoder-decoder framework, a Tacotron model derived by Google in 2017 is a first real end-to-end voice synthesis model, can realize inputting text or phonetic notation strings and outputting linear frequency spectrum, and converts the linear frequency spectrum into audio through Griffin-Lim algorithm; however, the Tacotron model employs the Griffin-Lim algorithm, which requires acoustic features with high dimensionality and synthesizes low quality speech
In 2018, Google also introduced a Tacotron2 model, which is an improvement on the Tacotron model, a complex CBHG structure and GRU units are removed, and instead LSTM and convolutional layers are used for substitution, the model outputs a Mel spectrum, and then the Mel spectrum is converted into audio through WaveNet. Tacotron2 adopts WaveNet to replace Griffin-Lim algorithm, so that the quality of voice synthesis is greatly improved, but WaveNet needs to be trained by a large amount of data, and the synthesis speed is low.
In addition, various research projects have shown that the above models are capable of generating speech similar to human voice in english and other languages, but for languages like chinese, there are serious prosodic phrase errors, although there have been recent prosodic modeling studies on end-to-end TTS models, such as improving prosodic phrases using contextual information and syntactic characteristics. But they are incorporated in the text pre-processing stage and are not optimized as part of the synthesis process, resulting in poor quality of the final synthesized speech.
Disclosure of Invention
In view of the above drawbacks of the prior art, an object of the present invention is to provide a method, a system and a server for speech synthesis based on world-tacontron, which are used to solve the problems of low speech synthesis quality, slow synthesis speed and prosodic phrase errors in the prior art.
The invention provides a voice synthesis method based on world-tacontron, which is characterized by comprising the following steps:
obtaining a sample text, and respectively converting the sample text into a word sequence and a character sequence;
coding the character sequence to obtain a coded representation;
performing phrase interruption prediction on the word sequence to obtain a prosody vector;
connecting the prosody vector and the coding representation into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence;
calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
and inputting the text to be processed into the voice synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a voice waveform.
In an embodiment of the present invention, the step of converting the sample text into a word sequence and a character sequence respectively includes:
performing word segmentation on the sample text to obtain a word sequence;
and converting the sample text into a pinyin form sequence with tones, and converting the pinyin form sequence into a character sequence.
In an embodiment of the present invention, the step of encoding the character sequence to obtain an encoded representation includes:
inputting the character sequence into an encoder, and obtaining an encoded representation at an output end of the encoder; wherein the encoder is a tacontron model-based encoder.
In an embodiment of the present invention, the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:
and inputting the word sequence into a prosody generator, and obtaining a prosody vector at the output end of the prosody generator.
In one embodiment of the present invention, the step of concatenating the prosody vector and the coded representation into a joint vector comprises:
distributing the prosody vector corresponding to each word in the word sequence to all characters of each word to obtain a character-level prosody vector;
and splicing the coded representation and the character-level prosody vector to obtain a joint vector.
In an embodiment of the present invention, the decoding the joint vector to obtain the acoustic feature sequence includes:
inputting the joint vector into a decoder, and obtaining an acoustic feature sequence at the output end of the decoder; wherein the decoder is a decoder with attention mechanism based on a tacotron model.
In an embodiment of the present invention, the step of calculating a prosodic parameter prediction loss according to the prosodic vector, calculating an acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model includes:
calculating prosodic parameter prediction loss by taking the real prosodic vector as a label; the real rhythm is obtained by carrying out single hot coding on the word sequence according to the voice audio corresponding to the sample text;
recording the mean square error of the first acoustic feature sequence and the multi-dimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio;
and weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model.
In an embodiment of the present invention, the step of synthesizing the second acoustic signature sequence into the acoustic waveform includes:
inputting the second acoustic feature sequence into a vocoder, and obtaining a corresponding sound waveform at the output end of the vocoder; wherein the vocoder is a world vocoder.
A second aspect of the present invention provides a world-tacontron-based speech synthesis system, including:
the input module is used for acquiring a sample text and a text to be processed;
the text preprocessing module is used for converting the sample text or the text to be processed into a word sequence and a character sequence;
the coder is used for coding the character sequence to obtain coded representation;
the prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
a processing module for concatenating the prosody vector and the encoded representation into a joint vector;
the prosody parameter prediction loss training module is further used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to a first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
the decoder is used for decoding the joint vector to obtain a first acoustic characteristic sequence or a second acoustic characteristic sequence;
a vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
A third aspect of the present invention provides a server, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any one of the world-tacontron-based speech synthesis methods of the first aspect of the present invention when executing the program.
As described above, the world-tacontron-based speech synthesis method, system and server of the present invention have the following beneficial effects:
on the basis of the existing tacotron model, prosodic information is integrated into an end-to-end acoustic modeling process, a double-task learning framework is introduced, a main task is an improved tacotron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice.
Drawings
Fig. 1 is a flowchart illustrating a world-tacontron-based speech synthesis method according to a first embodiment of the present invention.
Fig. 2 is a block diagram illustrating a world-tacontron-based speech synthesis system according to a second embodiment of the present invention.
Fig. 3 is a schematic diagram of a server according to a third embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in practical implementation, and the type, quantity and proportion of the components in practical implementation can be changed freely, and the layout of the components can be more complicated.
Referring to fig. 1, a first embodiment of the present invention relates to a world-tacontron-based speech synthesis method, which includes the following steps:
step 101, obtaining a sample text, and converting the sample text into a word sequence and a character sequence respectively.
Specifically, the step of converting the sample text into a word sequence and a character sequence respectively comprises the following steps:
taking the sample text as training data, and performing word segmentation on the sample text to obtain a word sequence; the word sequence is manually marked, the position of the break introduced by the speaker is extracted according to the stopping mode of the speaker, and the words in the sample text are marked as break (break) and non-break (non-break) depending on whether the break appears after a certain word. The spaces, punctuation marks and stop mark labels appear naturally in the sample text.
It should be noted that the sample text in this embodiment carries corresponding voice audio.
Further, the sample text is converted into a pinyin form sequence with tones, and the pinyin form sequence is converted into a character sequence.
Specifically, the Chinese text is converted into a pinyin form with tones. Such as: "hello" is converted to the pinyin form of "ni 3hao3de 5", where the pinyin numerals 1-4 represent the first, second, third, and fourth tones, and the unvoiced sounds are labeled with numeral 5. Then the pinyin form is converted into a character sequence.
Optionally, the sample text is regularized before word segmentation, in the real use process, the sample text contains a large number of non-standard words, such as arabic numerals, english characters and various symbols, and the text regularization is to convert the non-chinese characters into corresponding chinese characters.
It should be understood that the sample text and the corresponding voice audio can be obtained from the public voice database, and in practical application, the user can select the appropriate database according to the requirement.
Step 102, coding the character sequence to obtain a coded representation.
Specifically, the character sequence C ═ C1,c2,...,cm]The encoding is performed by an encoder, and a hidden state sequence H, i.e., an encoded representation in the present embodiment, is obtained at the output end of the encoder, and the encoded representation is represented as a high-level character vector and is denoted as CE ═ CE1',ce2',...,cem']The formula is represented as CE' ═ H ═ encoder (c).
Optionally, the encoder used in this embodiment is an encoder based on a tacotron model, and includes a character embedding layer (character embedding), a pre-network (pre-net), and a CBHG module, which are connected in sequence.
Further, a character sequence is used as input of a character embedding layer, wherein each character is represented as a one-hot encoding vector, an initial parameter and a trainable embedding matrix are set, and the one-hot vector is mapped to a character embedding vector, that is, each input character is represented as a 256-dimensional character embedding vector. In the model training, as with other network layer parameters, the embedded matrix is trained through back propagation to obtain the embedded matrix capable of representing each character set text.
The character embedding vector is subjected to a series of nonlinear transformation through a pre-network, wherein the pre-network comprises two fully-connected layers, a ReLU activation function is used behind each fully-connected layer, and the neuron deactivation rate (dropout) of each fully-connected layer is set to be 0.5 so as to enhance the generalization capability of the model; the number of output channels of the first fully connected layer is set to 256 and the number of output channels of the second fully connected layer is set to 128.
The non-linearly transformed character embedding vector is input into the CBHG module, so that an encoded representation of the character sequence is available at the output. The CBHG module consists of a one-dimensional convolution filter bank, a pooling layer, an expressway network and a Recurrent Neural Network (RNN) of a bidirectional Gated Recurrent Unit (GRU).
The one-dimensional convolution filter bank can effectively model the current and context information. The input sequence first passes through a convolutional layer, which has K1-dimensional filters with different sizes, wherein the size of the filter is 1,2,3 … K; these convolution kernels of different sizes extract context information of different lengths. The length kernel _ size of the one-dimensional convolution window used by the CBHG module is [1,2, 3.., 16], and padding is set to 'same' during the convolution operation. To ensure alignment, the outputs from the k convolution kernels of different sizes are stacked together, and the result of each one-dimensional convolution layer is 128 x 16 dimensions after being spliced together according to axis-1.
The next layer is the largest pooling layer, the purpose of pooling is to increase local invariance, stride is set to be 1, and width is set to be 2. After pooling, the film passes through two one-dimensional convolution layers. The size of the filter of the first convolutional layer is 3, stride is 1, a ReLu activation function is adopted, and an output channel is 128; the second convolution layer has a filter size of 3, stride of 1, no activation function, output channel of 128, and BN (batch normalization) between the two one-dimensional convolution layers.
After the convolutional layer, residual error connection is performed, that is, the output of the convolutional layer and the sequence after the embedding layer (embedding) are added, and then a multi-layer highway network is input to extract the high-level features. The expressway network is a 4-layer full-connection layer, a linear rectification function is adopted as an activation function, and the dimensionality of an output characteristic is 128-dimensional; and finally, inputting the output of the expressway network into a bidirectional RNN (radio network unit), wherein the bidirectional RNN comprises 128 GRU units to extract sequence characteristics in the forward and backward contexts, and 256 are obtained after bidirectional splicing, and finally, the coded representation of the character sequence is obtained.
103, performing phrase interruption prediction on the word sequence to obtain a prosody vector;
in particular, the word sequence is input into a prosody generator to predict the phrase interruption,obtaining a prosody vector PE ═ PE at the output of the prosody generator1,pe2,...,pet]。
Optionally, the prosody generator comprises an input layer, a bi-directional LSTM layer, and an output layer.
To further illustrate, the input layer converts the input word sequence into a word vector representation as its feature representation, assuming that the input word sequence W ═ W1,...,wt,...,wT]Is T, and is converted into a corresponding characteristic vector V ═ V by looking up a word list according to a word vector trained in advance1,...,vt,...,vT]。
The bi-directional LSTM layer reads the feature vector V and converts it to a high-level feature representation. And reading the characteristic vector sequence by using a forward LSTM and a backward LSTM, then respectively outputting hidden state vectors, splicing the forward hidden state vectors and the backward hidden state vectors to obtain a final hidden state H, and sending the final hidden state H to an output layer for decoding to obtain a final prosodic label corresponding to the word.
The output layer decodes the hidden state H output by the BilSTM layer by using a Softmax function, and outputs a prosody vector PE (PE) corresponding to each word1,...,pet,...,peT]Wherein petIs a 5-dimensional vector, pet=[pt[1],...,pt[k],...,pt[5]];t∈[1,T]Denotes the probability of k occurrence; k represents the target phrase interrupt pattern.
It should be noted that the present embodiment defines the PE (prosody embedding) as a vector composed of five elements, namely break (break), non-break (non-break), space (blank), punctuation (puntation), and stop symbol (stop token), which represent five break phrase patterns.
And 104, connecting the prosody vector and the coding expression into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence.
Specifically, because the word sequence and the character sequence are not equal in length, the present embodiment performs upsampling on the prosody vector, that is, the prosody vector corresponding to each word in the word sequence is allocated to all characters of each word to obtain a character-level prosody vector PE ', and concatenates the coding representation CE' output by the encoder and the character-level prosody vector PE 'to obtain a final joint vector JE', where the expression is:
JE'=[PE';CE']。
continuing, the joint vector JE 'is input into the Decoder, and an acoustic feature sequence Y is obtained at the output end of the Decoder, and the expression is Y ═ Decoder (JE').
Optionally, the decoder is a decoder with an attention mechanism based on a tacotron model, and includes an attention mechanism, a preprocessing network, a long-time and short-time memory neural network, and a linear mapping layer.
Further illustratively, the decoder is an autoregressive recurrent neural network. In the decoding process, the decoding result of the previous step is taken as input and passes through a pre-network formed by 2 layers of full connection, and each layer is formed by 256 hidden ReLU units. The pre-network acts as an information bottleneck layer (boottlenic) that is necessary for learning attention.
The output of the pre-network is spliced with the attention context vector calculated by the attention mechanism and input into 2-layer LSTM for decoding, and each LSTM layer contains 1024 units.
Specifically, in this embodiment, a position-sensitive attention mechanism in a model Tacotron2 is used, a content-based attention mechanism is extended, a position feature is obtained by convolving 32 1-dimensional convolution kernels with the length of 31, and then an input sequence sum is projected to a 128-dimensional hidden layer feature as a position feature to calculate an attention weight, where the specific formula is as follows:
Figure BDA0003033243330000072
fi=F*cαi-1
Figure BDA0003033243330000071
wherein S isiIs current decoder implicitState hjIs the current encoder hidden state, W, V, U is the weight matrix corresponding to the state, b is the offset value, and is initially a 0 vector. Fi,jIs the position feature of the previous attention weight matrix obtained by convolution. Position feature fiFrom the cumulative attention weight caiIs performed.
The output of the LSTM layer, which is regularized using zoneout with a probability of 0.1, is again stitched together with the attention context vector and then subjected to a linear transformation projection to predict the acoustic feature sequence.
And 105, calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model.
Specifically, the real prosody vector is used as a label, and the prosody parameter prediction loss is calculated; the real rhythm is obtained by manually marking a word sequence according to the multidimensional acoustic features and then carrying out unique hot coding; the multidimensional acoustic features are extracted from voice audio corresponding to sample text, specifically, the multidimensional acoustic features are 38-dimensional, and include: 1-dimensional fundamental frequency parameters, 32-dimensional spectral envelope parameters and 5-dimensional aperiodic parameters; the extraction steps are as follows:
1. inputting a waveform, and estimating a fundamental frequency (F0 constant) through a DIO algorithm;
2. f0 and the waveform are used as input, and a 513-dimensional spectral envelope (spectral envelope) is estimated by the cheap trim;
3. inputting F0, SP and a waveform, and estimating the extracted signal by using a D4C algorithm to obtain a 513-dimensional non-periodic parameter (aperiodic parameter);
4. and (3) performing dimensionality reduction processing on the spectral envelope and the aperiodic parameters by adopting a DCT (discrete cosine transformation) algorithm, and respectively compressing the spectral envelope and the aperiodic parameters into 32 dimensions and 5 dimensions.
Continuing, the cross-entropy loss in the training process is taken as the loss function of the prosodic parameter prediction, and the loss function is calculated as follows:
Figure BDA0003033243330000081
k represents the target phrase interrupt pattern.
Recording the mean square error of the first acoustic feature sequence and the multidimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio; the acoustic feature loss calculation is as follows:
Figure BDA0003033243330000082
therein, LosswavRepresenting a loss of acoustic features of the audio; y represents a true acoustic feature; y ist' denotes the predicted acoustic characteristics.
Weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model; the specific calculation method is as follows:
Losstotal=Losswav+w*Losspe
therein, LosstotalRepresents a global penalty; ω represents a weight parameter; losspeRepresenting prosodic parameter prediction loss.
The training data sample number (batch _ size) is set to 32, the model is trained using Adam optimizer with learning rate decay, with β 1 being 0.9 and β 2 being 0.999, where the learning rate starts at step 50k and starts at step 10-3Dynamic damping to 10-5And the acoustic model and the prosody generation model are jointly trained for 200k steps to serve as a final speech synthesis model.
By adopting the scheme, the network parameters are updated by back propagation training according to the global loss function, two tasks are simultaneously trained by adopting a multi-task learning framework, the prosody generator and other Tacotron modules are trained and updated together, the prosody information is integrated into the end-to-end acoustic modeling process, mutual optimization is performed in the model training process, the precision of a prosody generation model is finally improved under the help of the end-to-end acoustic model, so that more accurate prosody vectors are obtained, and the expression of voice synthesis is finally improved.
And 106, inputting the text to be processed into the speech synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a sound waveform.
Specifically, the text to be processed may be a sample text or other texts, and the text to be processed is input into the speech synthesis model, and the corresponding second acoustic feature sequence is finally obtained through the processing of the above steps.
The second acoustic signature sequence is input into the vocoder, and the corresponding acoustic waveform is obtained at the output of the vocoder.
Optionally, the vocoder is a world vocoder.
Therefore, the prosodic information is integrated into an end-to-end acoustic modeling process on the basis of the existing tacotron model, a double-task learning framework is introduced, the main task is an improved tacotron model, and the acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, through the joint training of the double tasks, more displayed prosodic knowledge can be learned in model training, so that the quality of output voice is optimized; meanwhile, a world vocoder is used at the output end, so that the size of the model is greatly reduced, and the speed of voice synthesis is improved.
A second embodiment of the invention relates to a world-tacontron based speech synthesis system.
Please refer to fig. 2, which specifically includes:
the input module is used for acquiring a sample text and acquiring a text to be processed; wherein the sample text carries corresponding voice audio.
And the text preprocessing module is used for converting the sample text into a word sequence and a character sequence.
And the coder is used for coding the character sequence to obtain a coded representation.
Optionally, the encoder used in this embodiment is an encoder based on a tacotron model, and includes a character embedding layer (character embedding), a pre-network (pre-net), and a CBHG module, which are connected in sequence.
Specifically, an input character sequence passes through a character embedding layer, and a character embedding vector with a fixed dimension is output; the pre-network carries out a series of nonlinear transformation on the character embedding vector; the CBHG module converts the non-linearly transformed character embedding vector into an encoded representation of the character embedding vector.
The prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
optionally, the prosody generator comprises an input layer, a bi-directional LSTM layer, and an output layer.
Specifically, the input layer converts the input word sequence into a word vector, and the word vector is used as a feature representation; the bidirectional LSTM layer reads the input word vector and converts the word vector into high-level feature representation; the output layer decodes the hidden state output by the bidirectional LSTM layer by using a Softmax function and outputs a prosody vector corresponding to each word.
A processing module for concatenating the prosody vector and the coding representation into a joint vector; the prosody parameter prediction loss training device is also used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a voice synthesis model;
and the decoder is used for decoding the joint vector to obtain a first acoustic feature sequence or a second acoustic feature sequence.
Optionally, the decoder is a decoder with an attention mechanism based on a tacotron model, and includes an attention mechanism, a preprocessing network, a long-time and short-time memory neural network, and a linear mapping layer.
Specifically, the decoder is an autoregressive recurrent neural network. The long-short term memory neural network is used for splicing the output of the preprocessing network with the attention context vector calculated by the attention mechanism and outputting a new context vector to the linear mapping layer. And the linear mapping layer performs linear projection on the new context vector and outputs an acoustic feature sequence.
A vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
Optionally, the vocoder is a world vocoder.
It should be noted that the Processing module is usually a Central Processing Unit (CPU) of the whole Digital display sensor processor system, and may be configured with a corresponding operating system and a control interface, and specifically may be a Digital logic processor such as a single chip microcomputer, a DSP (Digital Signal Processing), an ARM (Advanced risc machines, ARM processor) and the like, which can be used for automatic control, and may load a control instruction into a memory at any time for storage and execution, and at the same time, may be built-in with units such as a CPU instruction and data memory, an input/output Unit, a power module, and a Digital analog Unit, and may be specifically set according to an actual use condition, and this scheme is not limited thereto.
Therefore, the embodiment combines the prosody generator, the world vocoder and the optimized tacotron model on the basis of the existing tacotron model, does not participate in rich training of WaveNet and a longer synthesis period, has a lower acoustic characteristic size, not only improves the speech synthesis speed, but also improves the speech synthesis quality of the tacotron model.
Referring to fig. 3, a third embodiment of the present invention relates to a server, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the processor implements any one of the methods described in the first embodiment.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
In summary, according to the voice synthesis method, system and server based on world-tacontron, prosodic information is integrated into an end-to-end acoustic modeling process on the basis of an existing tacontron model, a double-task learning framework is introduced, a main task is an improved tacontron model, and acoustic characteristic parameter prediction based on character-level embedded representation is learned; the auxiliary task is a prosody generation model, namely a prosody generator, and learns prosody prediction based on word-level embedding. In the training stage, the invention can learn more displayed prosody knowledge in model training through the joint training of double tasks, thereby optimizing the quality of output voice. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for synthesizing voice based on world-tacontron is characterized by comprising the following steps:
obtaining a sample text, and respectively converting the sample text into a word sequence and a character sequence;
coding the character sequence to obtain a coded representation;
performing phrase interruption prediction on the word sequence to obtain a prosody vector;
connecting the prosody vector and the coding representation into a joint vector, and decoding the joint vector to obtain a first acoustic feature sequence;
calculating prosodic parameter prediction loss according to the prosodic vector, calculating acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
and inputting the text to be processed into the voice synthesis model for processing, and synthesizing the processed second acoustic characteristic sequence into a voice waveform.
2. The world-tacontron based speech synthesis method of claim 1, wherein: the step of converting the sample text into a word sequence and a character sequence respectively comprises:
performing word segmentation on the sample text to obtain a word sequence;
and converting the sample text into a pinyin form sequence with tones, and converting the pinyin form sequence into a character sequence.
3. The world-tacontron based speech synthesis method of claim 1, wherein: the step of encoding the character sequence to obtain an encoded representation comprises:
inputting the character sequence into an encoder, and obtaining an encoded representation at an output end of the encoder; wherein the encoder is a tacontron model-based encoder.
4. The world-tacontron based speech synthesis method of claim 1, wherein: the step of performing phrase break prediction on the word sequence to obtain a prosody vector includes:
and inputting the word sequence into a prosody generator, and obtaining a prosody vector at the output end of the prosody generator.
5. The world-tacontron based speech synthesis method of claim 1, wherein: said step of concatenating said prosody vector and said coded representation into a joint vector comprises:
distributing the prosody vector corresponding to each word in the word sequence to all characters of each word to obtain a character-level prosody vector;
and splicing the coded representation and the character-level prosody vector to obtain a joint vector.
6. The world-tacontron based speech synthesis method of claim 5, wherein: the step of decoding the joint vector to obtain an acoustic feature sequence includes:
inputting the joint vector into a decoder, and obtaining an acoustic feature sequence at the output end of the decoder; wherein the decoder is a decoder with attention mechanism based on a tacotron model.
7. The world-tacontron based speech synthesis method of claim 1, wherein: the step of calculating the prosodic parameter prediction loss according to the prosodic vector, calculating the acoustic feature loss according to the first acoustic feature sequence, and training the prosodic parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model comprises the following steps:
calculating prosodic parameter prediction loss by taking the real prosodic vector as a label; the real rhythm is obtained by carrying out single hot coding on the word sequence according to the voice audio corresponding to the sample text;
recording the mean square error of the first acoustic feature sequence and the multi-dimensional acoustic features as acoustic feature loss; wherein the multi-dimensional acoustic features are extracted from the speech audio;
and weighting and summing the prosodic parameter prediction loss and the acoustic characteristic loss to obtain global loss, and performing back propagation training by using the global loss to obtain a trained speech synthesis model.
8. The world-tacontron based speech synthesis method of claim 1, wherein: the step of synthesizing the second acoustic feature sequence into a sound waveform comprises:
inputting the second acoustic feature sequence into a vocoder, and obtaining a corresponding sound waveform at the output end of the vocoder; wherein the vocoder is a world vocoder.
9. A world-tacontron based speech synthesis system, comprising:
the input module is used for acquiring a sample text and a text to be processed;
the text preprocessing module is used for converting the sample text or the text to be processed into a word sequence and a character sequence;
the coder is used for coding the character sequence to obtain coded representation;
the prosody generator is used for carrying out phrase interruption prediction on the word sequence to obtain a prosody vector;
a processing module for concatenating the prosody vector and the encoded representation into a joint vector;
the prosody parameter prediction loss training module is further used for calculating prosody parameter prediction loss according to the prosody vector, calculating acoustic feature loss according to a first acoustic feature sequence, and training the prosody parameter prediction loss and the acoustic feature loss to obtain a speech synthesis model;
the decoder is used for decoding the joint vector to obtain a first acoustic characteristic sequence or a second acoustic characteristic sequence;
a vocoder for synthesizing the second acoustic signature sequence into a sound waveform.
10. A server comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the world-tacontron based speech synthesis method of any of claims 1-8.
CN202110436317.XA 2021-04-22 2021-04-22 Voice synthesis method, system and server based on world-tacotron Active CN113129862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110436317.XA CN113129862B (en) 2021-04-22 2021-04-22 Voice synthesis method, system and server based on world-tacotron

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110436317.XA CN113129862B (en) 2021-04-22 2021-04-22 Voice synthesis method, system and server based on world-tacotron

Publications (2)

Publication Number Publication Date
CN113129862A true CN113129862A (en) 2021-07-16
CN113129862B CN113129862B (en) 2024-03-12

Family

ID=76779099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110436317.XA Active CN113129862B (en) 2021-04-22 2021-04-22 Voice synthesis method, system and server based on world-tacotron

Country Status (1)

Country Link
CN (1) CN113129862B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961704B1 (en) * 2003-01-31 2005-11-01 Speechworks International, Inc. Linguistic prosodic model-based text to speech
US20120290302A1 (en) * 2011-05-10 2012-11-15 Yang Jyh-Her Chinese speech recognition system and method
CN105989833A (en) * 2015-02-28 2016-10-05 讯飞智元信息科技有限公司 Multilingual mixed-language text character-pronunciation conversion method and system
CN110459202A (en) * 2019-09-23 2019-11-15 浙江同花顺智能科技有限公司 A kind of prosodic labeling method, apparatus, equipment, medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110797006A (en) * 2020-01-06 2020-02-14 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium
CN111326136A (en) * 2020-02-13 2020-06-23 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and storage medium
CN111354333A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Chinese prosody hierarchy prediction method and system based on self-attention
CN111640418A (en) * 2020-05-29 2020-09-08 数据堂(北京)智能科技有限公司 Prosodic phrase identification method and device and electronic equipment
CN111667816A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Model training method, speech synthesis method, apparatus, device and storage medium
KR20200111608A (en) * 2019-12-16 2020-09-29 휴멜로 주식회사 Apparatus for synthesizing speech and method thereof
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
US20200394998A1 (en) * 2018-08-02 2020-12-17 Neosapience, Inc. Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN112331177A (en) * 2020-11-05 2021-02-05 携程计算机技术(上海)有限公司 Rhythm-based speech synthesis method, model training method and related equipment
CN112365880A (en) * 2020-11-05 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112669810A (en) * 2020-12-16 2021-04-16 平安科技(深圳)有限公司 Speech synthesis effect evaluation method and device, computer equipment and storage medium

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961704B1 (en) * 2003-01-31 2005-11-01 Speechworks International, Inc. Linguistic prosodic model-based text to speech
US20120290302A1 (en) * 2011-05-10 2012-11-15 Yang Jyh-Her Chinese speech recognition system and method
CN105989833A (en) * 2015-02-28 2016-10-05 讯飞智元信息科技有限公司 Multilingual mixed-language text character-pronunciation conversion method and system
US20200394998A1 (en) * 2018-08-02 2020-12-17 Neosapience, Inc. Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111354333A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Chinese prosody hierarchy prediction method and system based on self-attention
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110459202A (en) * 2019-09-23 2019-11-15 浙江同花顺智能科技有限公司 A kind of prosodic labeling method, apparatus, equipment, medium
KR20200111608A (en) * 2019-12-16 2020-09-29 휴멜로 주식회사 Apparatus for synthesizing speech and method thereof
CN110797006A (en) * 2020-01-06 2020-02-14 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium
CN111326136A (en) * 2020-02-13 2020-06-23 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and storage medium
CN111640418A (en) * 2020-05-29 2020-09-08 数据堂(北京)智能科技有限公司 Prosodic phrase identification method and device and electronic equipment
CN111667816A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Model training method, speech synthesis method, apparatus, device and storage medium
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN112331177A (en) * 2020-11-05 2021-02-05 携程计算机技术(上海)有限公司 Rhythm-based speech synthesis method, model training method and related equipment
CN112365880A (en) * 2020-11-05 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112669810A (en) * 2020-12-16 2021-04-16 平安科技(深圳)有限公司 Speech synthesis effect evaluation method and device, computer equipment and storage medium

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
DIAOHAN LUO: "On End-to-End Chinese Speech Synthesis Based on World-Tacotron", 2020 INTERNATIONAL CONFERENCE ON CULTURE-ORIENTED SCIENCE & TECHNOLOGY *
党建成;周晶;: "语音合成技术及其应用", 计算机与信息技术, no. 06 *
崔鑫彤;: "语音合成技术专利分析", 电子技术与软件工程, no. 04 *
张斌等: "语音合成方法和发展综述", 小型微型计算机*** *
张鹏远;卢春晖;王睿敏;: "基于预训练语言表示模型的汉语韵律结构预测", 天津大学学报(自然科学与工程技术版), no. 03 *
朱维彬;: "支持重音合成的汉语语音合成***", 中文信息学报, no. 03, 15 May 2007 (2007-05-15) *
林举;解焱陆;张劲松;张微;: "基于声调核参数及DNN建模的韵律边界检测研究", 中文信息学报, no. 06 *
王国梁: "一种基于Tac ot r o n2 的端到端中文语音合成方案", 华东师范大学学报( 自然科学版) *
谢永斌: "基于少量数据集的端到端语音合成技术研究", 中国优秀硕士学位论文全文数据库 (信息科技辑) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal
US11727915B1 (en) 2022-10-24 2023-08-15 Fujian TQ Digital Inc. Method and terminal for generating simulated voice of virtual teacher

Also Published As

Publication number Publication date
CN113129862B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
US11881205B2 (en) Speech synthesis method, device and computer readable storage medium
CN110264991A (en) Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model
CN110782870A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112863483A (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN110767213A (en) Rhythm prediction method and device
CN111276120A (en) Speech synthesis method, apparatus and computer-readable storage medium
US8626510B2 (en) Speech synthesizing device, computer program product, and method
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
CN111354333B (en) Self-attention-based Chinese prosody level prediction method and system
JP7112075B2 (en) Front-end training method for speech synthesis, computer program, speech synthesis system, and front-end processing method for speech synthesis
CN111967334B (en) Human body intention identification method, system and storage medium
CN112599113B (en) Dialect voice synthesis method, device, electronic equipment and readable storage medium
CN113257220B (en) Training method and device of speech synthesis model, electronic equipment and storage medium
CN113205792A (en) Mongolian speech synthesis method based on Transformer and WaveNet
CN114882862A (en) Voice processing method and related equipment
Lazaridis et al. Improving phone duration modelling using support vector regression fusion
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
CN114999447B (en) Speech synthesis model and speech synthesis method based on confrontation generation network
CN112802451B (en) Prosodic boundary prediction method and computer storage medium
CN112735379B (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN115547293A (en) Multi-language voice synthesis method and system based on layered prosody prediction
CN113823259B (en) Method and device for converting text data into phoneme sequence
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
CN117524193B (en) Training method, device, equipment and medium for Chinese-English mixed speech recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant