CN112802450B

CN112802450B - Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof

Info

Publication number: CN112802450B
Application number: CN202110008079.2A
Authority: CN
Inventors: 盛乐园
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-11-18
Anticipated expiration: 2041-01-05
Also published as: CN112802450A

Abstract

The invention discloses a rhythm-controllable Chinese and English mixed voice synthesis method and system, and belongs to the field of voice synthesis. 1) And extracting phoneme pronunciation duration, energy, pitch and Mel frequency spectrum from Chinese and English texts and audios with prosody labels as a training set, and learning text coding expression aligned with the length of the Mel frequency spectrum. 2) And generating predicted energy and pitch through an energy and pitch prediction model, and realizing energy and pitch control. 3) The predicted energy, pitch and text coding representation are combined, the synthesized Mel spectrum is output by a decoder, and then the voice is synthesized by a vocoder. The invention uses the jumping neural network encoder to better control the rhythm pause information in the synthesized voice, and simultaneously uses the predicted time length, energy and pitch to finely control the rhythm pronunciation information of each frame in the synthesized voice, and generates the voice with richer rhythm change. And is completely finished by one model, and the language of the text does not need to be distinguished firstly.

Description

Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof

Technical Field

The invention belongs to the field of voice synthesis, relates to Chinese and English mixed voice synthesis, and particularly relates to a rhythm-controllable Chinese and English mixed voice synthesis method and system.

Background

With the development of deep learning in recent years, the speech synthesis technology has been greatly improved. Speech synthesis moves from the traditional parametric approach and the concatenation approach towards an end-to-end approach. They typically synthesize speech by first generating a mel-frequency spectrum from text features and then using the mel-frequency spectrum to synthesize the speech using a vocoder image. These end-to-end methods can be classified by structure into autoregressive models and non-autoregressive models. The autoregressive model is usually generated by autoregressive using an Encoder-Attention-Decoder (Encoder-Attention-Decoder) mechanism: to generate the current data point, all previous data points in the time series must be generated as model inputs, like Taoctron, taoctron 2, deep voice 3, clarinet, tasformer TTS. Although autoregressive models can produce satisfactory results, attention to Attention may be inadequate, leading to the phenomenon of duplication or word missing in the synthesized speech. The non-autoregressive model can parallelize the generation of mel spectrum from text features much faster than the autoregressive model, like ParaNet, fastspech, alignts, fastspech 2.

Many different voices can be synthesized for a given text and speaker because the synthesized voices are also affected by prosody. Prosody, in turn, can be controlled by fundamental frequency, energy, and duration. The prosody can be finely controlled in three aspects of energy, duration and pitch in an unsupervised manner. On the other hand, fine control of prosody by supervised means, like fastspech 2, has also been studied. However, the currently available autoregressive speech synthesis model has some disadvantages in practical production due to its complex network structure and autoregressive structural form:

(1) The model is complex, the requirement on computing resources is high, and the model cannot be used on hardware with low computing resources.

(2) The naturalness of long sentence synthesis decreases due to the autoregressive structural defect.

(3) In the aspect of voice control, time length, energy and pitch are mostly only adopted for control, and the voice rhythm is controlled incompletely;

(4) The Chinese and English mixed text cannot be synthesized;

therefore, how to realize the operation of the chinese-english hybrid system and the omnidirectional prosody adjustment on hardware with low computing resources and synthesize more natural voice is an unsolved problem in the field of computer intelligent voice synthesis.

Disclosure of Invention

The present invention is directed to solving the problems of the prior art, and on one hand, to overcome the disadvantage of the prior art that the prosody of the synthesized speech is not fully controlled, the present invention provides a method for controlling the prosody of the synthesized speech from four features, namely, text, duration, energy, and pitch labeled by the prosody. On the other hand, in order to overcome the defects that the Chinese-English mixed speech synthesis system needs to distinguish the languages of the text firstly and then synthesizes the text by using the models of the corresponding languages respectively, the invention provides a Chinese-English mixed model, and a method for directly synthesizing the text without distinguishing the languages. The invention reduces the requirement of the traditional complex speech model on computing resources by optimizing the speech synthesis model, overcomes the defect of an autoregressive network structure, realizes the synthesis of Chinese and English mixed texts and can comprehensively control the synthesized prosody.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a rhythm-controllable Chinese and English mixed speech synthesis method comprises the following steps:

1) Acquiring a Chinese and English mixed sample text with a rhythm label and a corresponding standard voice audio, converting the standard voice audio into a real Mel frequency spectrum, and extracting the real energy, the real pitch and the real duration of the standard voice audio; processing the Chinese and English mixed sample text with the prosodic tag to obtain a phoneme sequence with the prosodic tag;

2) Constructing a hybrid speech synthesis model, wherein the hybrid speech synthesis model comprises a jumping neural network encoder, a duration prediction module, a pitch prediction module, an energy prediction module and a decoder, and the jumping neural network encoder consists of an Embedding layer, a CBHG module and a jumping module;

3) Training the constructed mixed speech synthesis model by adopting a phoneme sequence with a rhythm label, which specifically comprises the following steps:

3.1 Processing the phoneme sequence with the prosodic tag by an Embedding embedded layer and a CBHG module in sequence to obtain text coding information, removing the prosodic tag from the text coding information by a hopping module, and obtaining the predicted duration of the text coding information by a duration prediction module;

3.2 For text coding information and predicted time length information without prosodic tags, time length is adjusted and then used as input of a pitch prediction module and an energy prediction module respectively to obtain predicted pitch and predicted energy; combining the predicted pitch, the predicted energy and the text coding information subjected to time length adjustment to be used as the input of a decoder to obtain a predicted Mel frequency spectrum;

3.2 Computing a duration loss from the predicted duration and the true duration, computing a pitch loss from the predicted pitch and the true pitch, computing an energy loss from the predicted energy and the true energy, and computing a mel-frequency spectrum loss from the predicted mel-frequency spectrum and the true mel-frequency spectrum; performing end-to-end training on the mixed speech synthesis model by combining various loss values;

4) The text to be synthesized is preprocessed and then used as the input of a trained mixed speech synthesis model to obtain a predicted Mel frequency spectrum, and then the predicted Mel frequency spectrum is synthesized by a vocoder to be output.

Further, the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries. The prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.

Further, the decoder consists of a bi-directional LSTM and a linear affine transform.

Furthermore, the duration prediction module, the pitch prediction module and the energy prediction module are all composed of three one-dimensional convolution layers and a regularization layer, a bidirectional gating cycle unit GRU and a linear affine transformation.

The invention can utilize the jumping neural network encoder, the duration/energy/pitch prediction module and the neural network decoder to realize the prosody control in the synthesized voice. Compared with the prior art, the invention has the following beneficial effects:

(1) Compared with the traditional method for separately constructing various models, the method adopts a mode of directly converting text to acoustic characteristics, avoids the influence of single model prediction error on the effect of the whole model, thereby improving the fault-tolerant capability of the model and reducing the deployment cost of the model; and the CBHG module is adopted to effectively model the current and context information, extract the features of higher level and extract the context features of the sequence, and the method can learn more text pronunciation features which are difficult to define by people through data, thereby effectively improving the pronunciation effect of the voice.

(2) The text of the invention also comprises a preprocessing process before being input into the speech synthesis model, namely a process of adding prosody labels to the text in a prosodic phrase boundary prediction mode, thereby ensuring controllable text prosody and solving the defect of reduced naturalness of long sentence synthesis caused by unnatural prosody pause of synthesized speech in the traditional speech synthesis method; through a supervision mode, the text, the duration, the energy and the pitch with rhythm labeling are finely controlled, so that the rhythm of the synthesized voice is more comprehensively controlled, and the voice with richer and more natural rhythm change is synthesized;

(3) The invention simplifies the complexity of the training of the speech synthesis model by introducing the duration prediction module, and the traditional end-to-end speech synthesis model adopts the attention module to dynamically align the text and the audio, which needs a large amount of computing resource consumption and time consumption, but the invention avoids the alignment process of the text and the audio in the form of autoregressive attention, thereby reducing the requirement on computing resources, saving the computing cost of the model, realizing the aim of replacing the traditional Chinese and English two independent models by a Chinese-English mixed model, not only using the method of the invention for Chinese-English mixed speech synthesis, but also using the invention for other mixed languages.

Drawings

FIG. 1 is a general schematic diagram of a phonetic synthesis model for prosody controlled Chinese-English mixing used in the present invention;

FIG. 2 is a schematic diagram of a pitch/energy/duration prediction model used in the present invention;

FIG. 3 is the results of an experiment without text prosody control added;

FIG. 4 is the experimental results of the method of the present invention (text prosody tag added, normal energy, normal duration, normal pitch);

FIG. 5 is an experimental result obtained by increasing the amount of energy;

FIG. 6 is an experimental result obtained by increasing the size of a time period;

FIG. 7 is the experimental results for the case of normal pitches;

fig. 8 shows the experimental result obtained by setting the size of 1.5-fold pitch.

Detailed Description

The invention is further illustrated and described below with reference to the drawings and the detailed description.

Compared with a general mixed Chinese and English speech synthesis solution, the prosody pause information in the synthesized speech is better controlled by using the jumping neural network encoder, and meanwhile, the predicted duration, energy and pitch are used for finely controlling the prosody pronunciation information of each frame in the synthesized speech, so that the characteristics of the synthesized speech can be more accurately controlled, and the speech with richer prosody change is generated. On the other hand, the whole solution is completely finished by one model, and the language of the text does not need to be distinguished first, so that the complexity of the model is reduced.

As shown in fig. 1, the method for synthesizing Chinese and english mixed speech with controllable prosody of the present invention includes the following steps:

preprocessing an input Chinese and English text data sequence with rhythm labels to serve as the input of a hopping neural network encoder; the hopping neural network encoder consists of an Embedding embedded layer, a CBHG module and a hopping module;

step two, for the output of the hopping neural network encoder, combining the output of the CBHG module, and obtaining text encoding information with adjusted duration through the duration adjustment;

step three, the text coding information after the duration adjustment is respectively used as the input of a pitch prediction module and an energy prediction module to obtain a predicted pitch and predicted energy; combining the predicted pitch, the predicted energy and the text coding information subjected to time length adjustment to be used as the input of a decoder to obtain a predicted Mel frequency spectrum; the vocoder outputs the synthesized speech.

In one specific implementation of the invention, the hybrid speech synthesis model adopted by the invention consists of a hopping neural network encoder, a duration prediction module, a pitch prediction module, an energy prediction module and a decoder, wherein the hopping neural network encoder consists of an Embedding layer, a CBHG module and a hopping module; the transmission and processing process of the input text in the mixed speech synthesis model comprises the following steps:

1) Aiming at Chinese and English in the text, respectively converting the Chinese and English into corresponding pronunciation phonemes, and constructing a Chinese-English mixed phoneme dictionary; mapping Chinese and English phonemes with rhythm labels to serialized data by adopting a Chinese-English mixed phoneme dictionary to obtain a phoneme sequence w ₁ ,w ₂ ,…,w _U Where U is the length of the text, w _i Indicating phoneme information corresponding to the ith word in the text.

2) For serialized text data (phoneme sequence w) ₁ ,w ₂ ,…,w _U ) Converted into phoneme vector sequence x through Embedding layer ₁ ,x ₂ ,…,x _U And the prosodic tags comprise prosodic words (# 1), prosodic phrases (# 2), intonation phrases (# 3), sentence ends (# 4) and character boundaries (# S).

x ₁ ,x ₂ ,…,x _U ＝Embedding(w ₁ ,w ₂ ,…,w _U )；

x _i Representing a phoneme vector corresponding to the ith word in the text, wherein Embedding (·) represents Embedding processing; for example, a text "she holds her shoes in the hands and deliberately steps on the bottom of a puddle with the feet shining. "convert to after labeling by rhythm label" she holds #1 shoe #1 in #1 hand #3 bare #1 foot #2 and steps on #1 puddle #4 deliberately # 1. "

3) For the converted phoneme vector sequence x ₁ ，x ₂ ，…，x _U The text coding information is input into a CBHG module, and the generated result is subjected to a duration prediction module and a skip module to generate predicted duration and text coding information which does not contain the prosodic tag position after skip coding; the CBHG module employed in this embodiment comprises a one-dimensional convolution filter bank, theseThe convolution kernel effectively models current as well as context information. Followed by a multi-level highway network to extract higher level features. And finally, a bidirectional gating cycle unit GRU cycle neural network RNN is used for extracting the context characteristics of the sequence.

Expressed by the formula:

t ₁ ，t ₂ ，…，t _U ＝CBHG(x ₁ ，x ₂ ，…，x _U )

wherein, t _i Coding information of the ith word in the text;

4) Because the input serialized text data is added with prosody tags, but the tags do not have obvious pronunciation duration, skip coding is needed to remove the prosody tags to generate t' ₁ ，t′ ₂ ，…，t′ _U′ (U '< U), where U' is the text length after removing prosodic tags.

t′ ₁ ，t′ ₂ ，…，t′ _U′ ＝Skip_state(t ₁ ，t ₂ ，…，t _U )

5) Text coding information t 'without rhythm label position after skip coding' ₁ ，t′ ₂ ，…，t′ _U′ And length expansion is carried out by combining the duration prediction module, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each word is output according to the trained duration prediction module, and the length of each word is expanded according to the prediction duration; the text coding information t' after the time length adjustment is obtained after the expansion ₁ ，t″ ₂ ，…，t″ _T And T is the frame number of the extracted real Mel spectrum.

The duration prediction model and the energy and pitch prediction model have the same network structure: the three one-dimensional convolution layers and the regularization layer are used for feature separation; a bidirectional GRU learns the relationship between the front and rear phoneme characteristics; finally, the duration/energy/pitch is predicted through a linear affine transformation.

t″ ₁ ，t″ ₂ ，…，t″ _T ＝State_Expand(t′ ₁ ，t′ ₂ ，…，t′ _U′ )

6) Predicting pitch and energy; and aiming at the text coding information and the predicted time length information without the prosodic tags, the text coding information and the predicted time length information are respectively used as the input of a pitch prediction module and an energy prediction module after time length adjustment to obtain predicted pitch and predicted energy for controlling the generated audio energy and pitch.

7) Encoding the predicted pitch and energy with the text information t ″) ₁ ，t″ ₂ ，…，t″ _T Performing combined text encoding feature E ₁ ，E ₂ ，…，E _T ；

For text coding information t ″ ₁ ，t″ ₂ ，…，t″ _T Respectively obtaining the data through an energy and pitch prediction module:

e ₁ ，e ₂ ，…，e _T ＝Energy_Predictor(t″ ₁ ，t″ ₂ ，…，t″ _T )

p ₁ ，p ₂ ，…，p _T ＝Pitch_Predictor(t″ ₁ ，t″ ₂ ，…，t″ _T )

E ₁ ，E ₂ ，…，E _T ＝((e ₁ ，e ₂ ，…，e _T )+(p ₁ ，p ₂ ，…，p _T ))*(t″ ₁ ，t″ ₂ ，…，t″ _T )

wherein E is ₁ ，E ₂ ，…，E _T For the combined text, encoding the information, e ₁ ，e ₂ ，…，e _T To predict energy, p ₁ ，p ₂ ，…，p _T To predict pitch, t ″) ₁ ，t″ ₂ ，…，t″ _T And coding the information for the text after the time length is adjusted.

8) Encoding features E for text ₁ ,E ₂ ,…,E _T Decoding to generate a predicted Mel frequency spectrum;

in an embodiment of the present invention, the decoder is specifically composed of a bi-directional LSTM and a linear affine transformation, and may be specifically expressed as:

encoding by BLSTM:

combining the bidirectional final hidden states to obtain h ^* Represents:

for the obtained h ^* Through linear affine transformation, a predicted mel spectrum can be generated:

M ₁ ,M ₂ ,…,M _T ＝Linear(h ^* )

finally, the generated Mel frequency spectrum is synthesized into the voice with controllable rhythm by a common vocoder.

In one embodiment of the present invention, as shown in fig. 2, the duration prediction module, the pitch prediction module, and the energy prediction module are each composed of three one-dimensional convolution layers and regularization layers, a bidirectional gated loop unit GRU, and a linear affine transformation.

Adding prosodic tags to the text to be synthesized, wherein the addition of the prosodic tags is realized by adopting a pre-trained prosodic phrase boundary prediction model, inputting the text to be synthesized into the pre-trained prosodic phrase boundary prediction model, and outputting the text to be synthesized with the prosodic tags. The pre-trained prosodic phrase boundary prediction model adopts a decision tree or blstm-crf for predicting the boundary of a phrase and inserting prosodic tags, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.

Compared with the traditional method for separately constructing various models, the method adopts a mode of directly constructing from text to acoustic characteristics and an end-to-end training mode, calculates time loss according to the predicted time and the real time, calculates pitch loss according to the predicted pitch and the real pitch, calculates energy loss according to the predicted energy and the real energy, and calculates mel frequency spectrum loss according to the predicted mel frequency spectrum and the real mel frequency spectrum; performing end-to-end training on the mixed speech synthesis model by combining various loss values; the influence of single model prediction errors on the effect of the integral model is avoided, and therefore the fault tolerance of the model is improved.

The invention also discloses a rhythm-controllable Chinese and English mixed speech synthesis system, which comprises:

text pre-processing module (front end): the system is used for converting the text into a phoneme sequence with a prosodic tag, and when the mixed speech synthesis system is in a training mode, outputting a real Mel frequency spectrum, a real energy, a real pitch and a real duration according to a standard speech audio corresponding to the text; an encoder: the encoding device is used for encoding a phoneme sequence with a prosodic tag, and an Embedding embedded layer, a CBHG module and a jumping module are arranged in the encoder;

a duration prediction module: the system is used for predicting the duration of the text coding information output by the CBHG module and outputting the predicted duration;

an alignment module: aligning text coding information output by an encoder through a prediction duration, wherein the length of the text coding information is required to be consistent with that of a real Mel frequency spectrum in a training stage; in the prediction stage, the prediction duration of each word is output according to the trained duration prediction module, and the length of each word is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion ₁ ,t ₂ ,…,t _T And T is the frame number of the extracted real Mel spectrum.

A pitch prediction module: for reading the output sequence of the alignment module and predicting (or controlling) the pitch of the sequence;

an energy prediction module: for reading the output sequence of the alignment module and predicting (or controlling) the energy of the sequence;

a decoder: the system comprises a time length adjusting module, a pitch predicting module and a power module, wherein the time length adjusting module is used for combining text coding information, predicted pitch and predicted energy after time length adjustment and decoding combined codes to obtain a voice Mel frequency spectrum;

vocoder: when the mixed speech synthesis system is in a speech synthesis mode, the mixed speech synthesis system is started, automatically reads the speech Mel frequency spectrum output by the decoder, and converts the speech Mel frequency spectrum into a sound signal to play the speech.

The speech synthesis system needs to complete training before use, the training process needs to calculate time loss according to predicted time and real time, calculate pitch loss according to predicted pitch and real pitch, calculate energy loss according to predicted energy and real energy, and calculate mel-frequency spectrum loss according to predicted mel-frequency spectrum and real mel-frequency spectrum; and performing end-to-end training on the mixed speech synthesis model by combining various loss values.

Specifically, the text preprocessing module (front end) mainly functions to receive text data, normalize the text, analyze an XML tag, map a standard text to serialized data using a chinese-english mixed phoneme dictionary, and obtain a phoneme sequence w ₁ ,w ₂ ,…,w _U Where U is the length of the text. The rhythm labeling process specifically comprises the following steps: the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries, the prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output. The training samples used in the training stage may be data with prosodic tags in the open source database.

Specifically, the main function of the encoder is to train and learn the text features of the phoneme sequence of the current sample, so that the phoneme sequence can be converted into a fixed dimension vector capable of representing the text features. Compared with the traditional parametric method speech synthesis algorithm, the function of the encoder is similar to the step of manually extracting the features in the parametric method, the encoder can learn representative feature vectors through data, a large amount of manpower is consumed in the process of manually extracting the features to carry out statistical criteria, and the labor cost is greatly increased. On the other hand, compared with the incomplete feature information possibly caused by manually extracting the features, sufficient feature information can be learned under the condition of comprehensive data coverage through the learned feature vectors.

Specifically, the duration prediction module and the alignment module are used for performing length expansion on coding information output by the coder, the introduction of the duration prediction module simplifies the complexity of the training of the speech synthesis model, and the traditional end-to-end speech synthesis model adopts the attention module to dynamically align texts and audios, which requires a large amount of computing resource consumption and time consumption.

In addition, the introduction of the duration prediction module, the pitch prediction module and the energy prediction module enables prosody to be adjusted in three aspects of duration, pitch and energy, and the specific adjustment mode can be realized by adding an adjustable parameter after the output value of each module and multiplying the output result by a coefficient.

Specifically, compared with the traditional decoder, the decoder is simple in structure and only consists of a bidirectional LSTM and a linear affine transformation, and the decoding speed is greatly improved.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and specific steps in the embodiments are not described in detail.

Examples

The invention tests on a text data set containing 12500 audios and corresponding prosody labels, wherein 10000 Chinese, 2000 English and 500 Chinese-English are mixed. The invention preprocesses the data set as follows:

1) Extracting Chinese and English phoneme files and corresponding audio, and extracting pronunciation duration of the phoneme by using an open source tool Montreal-forced-aligner.

2) A mel-frequency spectrum is extracted for each audio, with a window size of 50 milliseconds, a frame shift of 12.5 milliseconds, and a dimension of 80 dimensions.

3) For each audio, the pitch of the audio is extracted using the World vocoder.

4) The energy of the mel spectrum is obtained by summing the mel spectrum extracted from the audio in dimension.

In order to objectively evaluate the performance of the algorithm of the invention, the invention performs the following comparative tests on the four aspects of text rhythm, energy, duration and pitch in the selected test set containing Chinese-English mixture:

1. fig. 4 and 3, experiment one: prosodic control is added to the text to compare it to text that is not added.

2. Fig. 5 and 4, experiment two: the magnitude of the altered energy is compared to the normal energy magnitude.

3. Fig. 6 and 4, experiment three: the length of the changed duration is compared with the normal duration.

4. Fig. 7 and 8, experiment four: the pitch change size is compared to the normal pitch size.

5. Control group: experiments performed using the method of the present invention.

Experiment one shows that: as shown in fig. 4, in the case of adding text prosody, the same text content, as seen from the abscissa, is longer in fig. 4 than in the case of synthesizing the voice of fig. 3. Because text prosody represents varying degrees of pause, the addition can make the synthesized speech more natural.

Experiment two comparisons show that: fig. 5 is based on fig. 4 with energy information enhanced, and it can be seen that the synthesized speech of fig. 5 is louder for the same input, and appears more intense in fig. 5 than in fig. 4 (due to the use of a black-and-white image, i.e., a larger area of off-white color).

Experiment three comparisons show that: fig. 6 is a graph of the duration of each phoneme increased on the basis of fig. 4, and it can be seen that under the same input, the synthesized speech of fig. 6 is longer and slower in speed from the abscissa, and meanwhile, the energy is also shared, and the result is that the speaking sound is lighter.

In order to more clearly and accurately verify the influence caused by pitch change, longer voices are selected for synthesis observation. Only the pitch parameter is changed in fig. 7 and 8, which show two curves, the upper curve representing energy and the lower curve representing pitch. Fig. 8 shows that the pitch is increased on the basis of fig. 7, compared with the two curves at the lower part, the curve with the increased pitch fluctuates more, the pitch is higher, generally speaking, the voice with the higher pitch is higher in energy, and slight change can be found compared with the two curves at the upper part. Listening to the synthesized speech at the same time verifies this change.

From the above experiments, the mixed speech synthesis system of the invention can control the text rhythm, energy, duration and pitch in the speech synthesis process, and is beneficial to the wide application of the speech synthesis system in industrial scenes.

Various technical features of the above embodiments can be arbitrarily combined, and for the sake of simplicity of description, all possible combinations of the technical features in the above embodiments are not described in detail, however, as long as there is no contradiction between the combinations of the technical features, the combinations should be considered as the scope of description in the present specification.

Claims

1. A rhythm-controllable Chinese and English mixed speech synthesis method is characterized by comprising the following steps:

2) Constructing a hybrid speech synthesis model, wherein the hybrid speech synthesis model comprises a hopping neural network encoder, a duration prediction module, a pitch prediction module, an energy prediction module and a decoder, and the hopping neural network encoder consists of an Embedding layer, a CBHG module and a hopping module;

2. The prosody-controlled Chinese-English mixed speech synthesis method according to claim 1, wherein the step 1) of processing the Chinese-English mixed sample text with the prosody tag to obtain the phoneme sequence with the prosody tag specifically comprises:

aiming at Chinese and English in the text, respectively converting the Chinese and English into corresponding pronunciation phonemes, and constructing a Chinese-English mixed phoneme dictionary; mapping Chinese and English phonemes with rhythm labels to serialized data by adopting a Chinese-English mixed phoneme dictionary to obtain a phoneme sequence w ₁ ，w ₂ ，…，w _U Where U is the length of the text.

3. The prosody-controlled Chinese-English mixed speech synthesis method according to claim 1, wherein the step 3.1) is specifically:

3.1.2 For phoneme sequences w with prosodic tags ₁ ，w ₂ ，…，w _U And is converted into a phoneme vector sequence x through an Embedding layer ₁ ，x ₂ ，…，x _U Wherein U is the length of the text;

3.1.3 ) generating text coding information t using the converted phoneme vector sequence as input to the CBHG module ₁ ，t ₂ ，…，t _U Respectively enabling the output result of the CBHG module to pass through a duration prediction module and a skip module to generate text coding information t 'without rhythm label positions after prediction duration and skip coding' ₁ ，t′ ₂ ，…，t′ _U′ Wherein U '< U, wherein U' is the text length after removal of the prosodic tag.

4. The method as claimed in claim 1, wherein the prosodic tags include prosodic words, prosodic phrases, intonation phrases, sentence ends, and character boundaries.

5. The prosody-controlled Chinese and English mixed speech synthesis method according to claim 1, wherein the preprocessing in step 4) is specifically: adding prosodic tags to the text to be synthesized, wherein the addition of the prosodic tags is realized by adopting a pre-trained prosodic phrase boundary prediction model, inputting the text to be synthesized into the pre-trained prosodic phrase boundary prediction model, and outputting the text to be synthesized with the prosodic tags.

6. The prosody-controlled Chinese and English mixed speech synthesis method according to claim 1, wherein the duration adjustment in step 3.2) is specifically: text coding information t 'without rhythm label position after skip coding' ₁ ，t′ ₂ ，…，t′ _U′ And length expansion is carried out by combining the duration prediction module, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each word is output according to the trained duration prediction module, and the length of each word is expanded according to the prediction duration; the text coding information t' after the time length adjustment is obtained after the expansion ₁ ，t″ ₂ ，…，t″ _T T is the frame number of the extracted real Mel frequency spectrum, and U' is the text length after removing the prosodic tag.

7. The prosody-controlled Chinese-English mixed speech synthesis method according to claim 1, wherein the formula for combining the predicted pitch, the predicted energy and the text coding information with the adjusted duration in step 3.2) is as follows:

wherein E is ₁ ，E ₂ ，…，E _T For the combined text, encoding the information, e ₁ ，e ₂ ，…，e _T To predict energy, p ₁ ，p ₂ ，…，p _T For predicting pitch, t ″) ₁ ，t″ ₂ ，…，t″ _T And T is the frame number of the extracted real Mel frequency spectrum, which is the text coding information after the time length is adjusted.

8. A prosody controlled chinese-english hybrid speech synthesis method according to claim 1, wherein said decoder comprises a bi-directional LSTM and a linear affine transformation.

9. The prosody-controlled Chinese and English mixed speech synthesis method of claim 1, wherein the duration prediction module, the pitch prediction module and the energy prediction module are each composed of three one-dimensional convolution layers and regularization layers, a two-way gating loop unit GRU and a linear affine transformation.

10. A speech synthesis system based on the prosody-controlled chinese-english hybrid speech synthesis method of claim 1, comprising:

a text preprocessing module: the system is used for converting the Chinese and English mixed text into a phoneme sequence with a prosodic tag, and outputting a real Mel frequency spectrum, a real energy, a real pitch and a real duration according to a standard voice audio corresponding to the text when the mixed voice synthesis system is in a training mode;

an encoder: the encoding device is used for encoding a phoneme sequence with a prosodic tag, and an Embedding layer, a CBHG module and a jumping module are arranged in the encoder;

an alignment module: aligning text coding information output by an encoder through a prediction duration, wherein the length of the text coding information is required to be consistent with that of a real Mel frequency spectrum in a training stage; in the prediction stage, the prediction duration of each word is output according to the trained duration prediction module, and the length of each word is expanded according to the prediction duration; obtaining text coding information after the time length is adjusted after the expansion;

a pitch prediction module: for reading the output of the alignment module, generating a predicted pitch;

an energy prediction module: the device is used for reading the output of the alignment module and generating predicted energy;

a decoder: the system comprises a time length adjusting module, a prediction module and a prediction module, wherein the time length adjusting module is used for combining text coding information, prediction pitch and prediction energy after time length adjustment and decoding combined codes to obtain a prediction Mel frequency spectrum;

a vocoder: when the mixed speech synthesis system is in a speech synthesis mode, the system is started, and the predicted Mel frequency spectrum output by the decoder is automatically read and converted into a sound signal for speech playing.