CN117789771A

CN117789771A - Cross-language end-to-end emotion voice synthesis method and system

Info

Publication number: CN117789771A
Application number: CN202311545240.5A
Authority: CN
Inventors: 张鹏远; 华桦; 尚增强; 黎塔; 王丽
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-03-29

Abstract

The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The specific method comprises the following steps: collecting original training data of voice-text pairs, extracting voice frequency domain characteristics, discretely encoding text, extracting language independent emotion embedded codes, constructing a complete end-to-end emotion voice synthesis model and performing supervised training. The speech synthesis model comprises an emotion text fusion coding module, a target duration prediction module, a posterior coding module, an audio decoding module and a judging module. After the speech synthesis model is trained to be converged, the required emotion speech of the target speaker can be deduced through the prior encoding module, the duration prediction module and the audio decoding module.

Description

Cross-language end-to-end emotion voice synthesis method and system

Technical Field

The invention relates to the field of intelligent digital signal processing, in particular to a cross-language end-to-end emotion voice synthesis method and system.

Background

Speech synthesis is a technique that converts text into natural, fluent speech audio. In recent years, the deep learning method has great effectiveness in the field of digital signal processing, and the deep neural network supports the continuous updating of the technology in the field of speech synthesis with stronger generalization and fitting capability, so that remarkable progress is achieved. Traditional speech synthesis systems tend to be tedious, and emotion speech synthesis can give vivid emotion colors to synthesized speech, so that the speech synthesis system is more humanized and natural. The research in this aspect also mainly depends on the deep learning technology, especially the development of models such as a Recurrent Neural Network (RNN) and a Variational Automatic Encoder (VAE), and can make the synthesized voice express emotion states such as anger and fear by introducing emotion related labels or features in the training process.

The Chinese patent application with publication number of CN115359774A discloses a cross-language voice synthesis method based on end-to-end tone and emotion migration. The invention collects a small amount of voice of a user, obtains voice data after preprocessing the voice, puts the voice data into a learning network architecture with training completed, extracts an embedded vector carrying the tone and emotion characteristics of a speaker through a speaker encoder in the learning network architecture, synthesizes a Mel frequency spectrum of target voice through a synthesizer in the learning network architecture, and finally inputs the Mel frequency spectrum of the target voice into a vocoder to synthesize the target voice. The method has complex treatment process and poor synthesis effect.

The method lacks a targeted decoupling process for emotion contained in the voice, so that the finally generated voice emotion effect is general and too dependent on the quality of the original speaker data, and the flexibility of using the method is insufficient.

Disclosure of Invention

The invention aims to meet the requirement of generating the language emotion voice under the condition of low resource of missing certain language emotion data.

In order to achieve the above purpose, the present invention is realized by the following technical scheme.

The invention provides a cross-language end-to-end emotion voice synthesis method, which comprises the following steps:

collecting paired speech-phoneme level text data of a large number of different speakers;

extracting frequency domain features and emotion embedded codes of voice, performing discretization coding on speaker ID and text data, and arranging to form a training data set;

constructing a complete end-to-end emotion voice synthesis model, and training the model by utilizing a training data set until the model converges;

the full end-to-end emotion voice synthesis model comprises: the system comprises an emotion text fusion encoding module, a target duration prediction module and an audio decoding module; wherein,

the emotion text fusion coding module is used for acquiring intermediate embedded representations related to the ID, the text and emotion of a speaker through a coder structure;

the target duration prediction module is used for outputting predicted phoneme duration;

the audio decoding module is used for obtaining a generated waveform through up-sampling network operation according to the middle embedded representation and the duration of the voice element;

inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a complete end-to-end emotion voice synthesis model trained to be converged, and outputting to obtain a synthesized emotion voice audio result.

As one of the improvements of the above technical solution, the full end-to-end emotion voice synthesis model further includes: the posterior coding module and the discrimination module;

the posterior coding module is used for extracting posterior coding of the voice in the training stage; the posterior coding and the prior coding are constrained by the same Gaussian distribution, and the prior coding comprises an intermediate embedded representation and a phoneme duration;

the judging module forms a generating-countermeasure relation with the audio decoding module and is used for improving the generating capacity of the audio decoding module.

As one of the improvements of the above technical solution, the frequency domain features of the extracted speech include: linear spectrum and Mel-domain spectrum; wherein,

the kth short period of linear spectrum X _k (ω) is expressed as:

wherein x (n) is an original voice signal, h (·) is a selected window function, ω represents frequency, e is a base of a natural logarithm, and j is an imaginary unit;

ith dimension Mel domain spectrum Mel of kth short period _k (ω, i) is expressed as:

Mel _k (ω,i)＝Fbank(i,ω _max ,ω _min ,ω)P _k (ω)

wherein P is _k (omega) short-time power spectrum of kth short period, P _k (ω)＝|X _k (ω)| ² The method comprises the steps of carrying out a first treatment on the surface of the Fbank is a frequency ω passing through a preset maximum _max And a minimum frequency omega _min The i-dimensional Mel domain filter bank is constructed.

As one of the improvements of the above technical solution, the discrete coding of text data includes: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.

As one of the improvements of the above technical scheme, the language-independent emotion embedded code is obtained by inputting each voice into a pre-trained emotion judgment system;

the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network;

the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus;

the speaker decoupling network comprises a time-lapse neural network (TDNN, time Delay Neural Network) structure.

As an improvement of the above technical solution, the sorting forms a training data set, including:

for each voice, packing an original signal, a speaker ID, an extracted voice linear frequency spectrum, a Mel domain spectrum, a quantized phoneme text sequence and a language independent emotion embedding code into a tuple, and converting the original signal, the frequency spectrum, the Mel domain spectrum and the emotion embedding code into a floating point type high-dimensional tensor, and converting the speaker ID and the quantized phoneme text sequence into a long-integer tensor; all data are divided into a training set and a testing set according to a certain capacity proportion.

The invention also provides a cross-language end-to-end emotion voice synthesis system, and the processing process of the system comprises the following steps:

inputting the ID of the speaker to be synthesized, the text sequence and the emotion embedded code into a fully end-to-end emotion voice synthesis model which is established in advance and trained to be converged, and outputting to obtain a synthesized emotion voice audio result.

Compared with the prior art, the invention has the advantages that:

in general, the method constructs a set of complete end-to-end voice synthesis model architecture, directly establishes a mapping relation from text to voice, and can synthesize natural, smooth and human voice close to the sense of speech and hearing by utilizing the strong characteristic learning capability of deep learning. The method breaks through the bottleneck that emotion is difficult to describe accurately in voice, and the emotion is decoupled and encoded independently from voice, so that the obstacle of low-resource data condition is overcome, and the emotion synthesis of cross-language is realized. The method specifically comprises the following advantages:

1. the pre-trained emotion distinguishing system is used as an encoder, and the aim of the distinguishing system is a classification task, so that the encoder has better capacity of decoupling emotion expression from voice, and emotion coding is more accurate and pure, so that the generation effect is more consistent with human hearing;

2. the data acquisition and training process has no limitation on the languages of the audio and the text, so that the synthesis system can work across languages and is not influenced by the residues of the spoken voice of the speaker during single language coding;

3. the inference can be carried out from the reference voice, namely, the specific emotion type or description is not required to be pointed out, only one sentence of 'example sentence' which is required to be close to the reference voice is required to be given, and the emotion possessed by the reference voice can be imitatively synthesized by the target speaker, so that the flexibility of use is improved.

Drawings

FIG. 1 is a diagram of the overall architecture of the present invention;

FIG. 2 is a flow chart of a cross-language end-to-end emotion voice synthesis method of the present invention;

FIG. 3 is a diagram showing the constitution of the emotion judging system of the present invention;

FIG. 4 is a diagram of emotion text fusion encoding module components;

FIG. 5 is a block diagram of a posterior coding module;

FIG. 6 is a diagram of a target duration prediction module;

fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.

Detailed Description

As shown in FIG. 1, the invention provides a cross-language end-to-end emotion voice synthesis method. By training the deep neural network model by the method, the voice of the target speaker in the A language with natural smoothness and good emotion expression can be synthesized by giving the text in the A language to be synthesized and the reference voice in the B language with emotion. The method comprises the following steps:

step 1) collecting a large number of paired speech-phoneme level text data;

step 2) performing data preprocessing by means of digital signal processing and the like, including extracting frequency domain features of voice and performing discretization coding on text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;

step 3) carrying out normalization processing on the results obtained in the steps 1-2 to form a training data set for a training model for standby;

step 4) constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 3 until the model converges, finishing training, and storing the model;

and 5) training to a converged emotion voice synthesis model by utilizing the step 4, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.

As an improvement of the above method, the step 2) specifically includes:

the frequency domain feature of the voice is extracted, and the steps are as follows: the original voice signal at the sampling point level is subjected to framing and windowing, and the linear frequency spectrum is obtained through short-time Fourier transform,

where x (n) is the original signal, h (·) is the selected window function, k represents the kth short period, ω represents the frequency, e is the base of the natural logarithm, and j is the imaginary unit.

Obtaining power spectrum by taking the modulus square of the linear spectrum, then using a Mel domain filter bank to carry out filtering treatment to obtain short-time Mel domain spectrum of the signal,

P _k (ω)＝|X _k (ω)| ²

Mel _k (ω,i)＝Fbank(i,ω _max ,ω _min ,ω)P _k (ω)

wherein P is _k (omega) is the kth short-time power spectrum, fbank is the frequency ω by the preset maximum and minimum _max ,ω _min Constructed i-dimensional Mel domain filter bank, mel _k (ω, i) represents the ith dimension Mel-domain spectrum of the kth short period.

Discretizing the phoneme-level text, mapping all the different phonemes to positive integers starting from 1 in any order, so that the phoneme-level text corresponding to all the voices can be converted into a sequence containing a plurality of positive integers.

Extracting language independent emotion embedded codes of voices by using a pre-trained emotion judging system, and automatically obtaining high-dimensional emotion embedded representation by passing each voice through the emotion judging system.

As an improvement of the above method, the step 3) specifically includes:

for each speech, the original signal, speaker ID, extracted speech linear spectrum, mel domain spectrum, quantized phoneme text sequence, and language independent emotion embedded code are packaged into a tuple, and the data format therein is converted into a long integer or floating point high-dimensional tensor. All data are divided into a training set and a testing set according to a certain capacity proportion.

As an improvement of the above method, the step 4) specifically includes:

the training set is input to an emotion speech synthesis system, which trains the mapping relationship from speaker ID, quantized phone text sequence, and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.

As an improvement of the above method, the step 5) specifically includes:

and (3) training to a converged emotion voice synthesis model by utilizing the step (4), inputting a speaker ID to be synthesized, a quantized phoneme text sequence and emotion embedded codes, and automatically reasoning the synthesized emotion voice audio result by the model through an emotion text fusion coding module, a target duration prediction module and an audio decoding module.

The technical scheme of the invention is described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 2, embodiment 1 of the present invention provides a cross-language end-to-end emotion voice synthesis method, which includes:

step 101) collecting a plurality of paired speech-phoneme level text data;

102) preprocessing data by means of digital signal processing and the like, including extracting frequency domain characteristics of voice and discretizing and encoding text; extracting language independent emotion embedded codes of the voice by utilizing a pre-trained emotion judging system;

the choice of this emotion recognition system is varied. Off-the-shelf, e.g., https:// github.com/audiong/w 2v2-how-to may be employed. The emotion judgment system can also be built by self, as shown in fig. 3, and is a composition diagram of an emotion judgment system built by the applicant of the invention; the emotion distinguishing system comprises a classifier formed by a two-way parallel neural network and a speaker decoupling network; the two-path parallel neural network comprises a transducer structure, two branches of the convolutional neural network and a linear projection layer which is used for dimension reduction and classification output on a bus; the speaker decoupling network includes a TDNN (Time Delay Neural Network ) module.

Step 103) carrying out normalization processing on the results obtained in the steps 101-102 to form a training data set for a training model for standby;

104), constructing a complete end-to-end emotion voice synthesis model, training the model by utilizing the data set (original audio, key features and emotion embedded codes) prepared in the step 103 until the model converges, finishing training, and storing the model; FIGS. 4-7 are diagrams of the modules of the constructed complete end-to-end emotion voice synthesis model, wherein FIG. 4 is a diagram of emotion text fusion encoding module; FIG. 5 is a block diagram of a posterior coding module; FIG. 6 is a diagram of a target duration prediction module; fig. 7 is a diagram showing the constitution of the audio decoding module and the discriminating module.

The training set is input into a full end-to-end emotion speech synthesis model, training the mapping relationship from speaker ID, quantized phone text sequence and emotion embedding code to audio waveform. The speaker ID, the quantized phoneme text sequence and the emotion embedded code are input to an emotion text fusion priori coding module together, and intermediate embedded representations related to the speaker ID, the text and the emotion are obtained through a self-encoder structure; meanwhile, the information is sent to a target duration prediction coding module, and predicted duration of each phoneme is output. The intermediate embedding and phoneme duration prediction results together form a priori code, and the priori code is input into an audio decoding module through a standardized stream structure, and a final generated waveform is obtained through up-sampling network operation. In the training stage, the posterior coding module is used for extracting posterior coding of the input real voice, and the prior and posterior coding are constrained by the same Gaussian distribution, so that the prior coding is continuously optimized to approach to the real posterior coding in the training process. The judging module and the audio decoding module form an countermeasure generation relation so as to improve the generation capacity of the audio decoding module.

Step 105) training to a converged emotion voice synthesis model by utilizing the step 104, inputting parameters to be synthesized and reference voices, and automatically reasoning emotion voice synthesis audio results of the target speaker by the model.

Example 2

The embodiment 2 of the invention provides a cross-language end-to-end emotion voice synthesis system. The processing procedure of the system comprises the following steps:

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

1. A cross-language end-to-end emotion voice synthesis method comprises the following steps:

2. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein said full end-to-end emotion voice synthesis model further comprises: the posterior coding module and the discrimination module;

3. The cross-language end-to-end emotion voice synthesis method of claim 1, wherein the extracted frequency domain features of the voice comprise: linear spectrum and Mel-domain spectrum; wherein,

the kth short period of linear spectrum X _k (ω) is expressed as:

Mel _k (ω,i)＝Fbank(i,ω _max ,ω _min ,ω)P _k (ω)

4. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said discretizing the encoding of the text data comprises: all the different phonemes in the text data are mapped to positive integers starting from 1 in any order, and the text data is converted into a sequence comprising a number of positive integers.

5. The method for synthesizing end-to-end emotion voice across languages according to claim 1, wherein said language independent emotion embedded codes are obtained by inputting each voice into a pre-trained emotion discrimination system;

the speaker decoupling network comprises a time-lapse neural network structure.

6. The method for end-to-end emotion voice synthesis across language of claim 1, wherein said sorting forms a training data set comprising:

7. A cross-language end-to-end emotion voice synthesis system is characterized in that the processing procedure of the system comprises the following steps: