CN115762464A

CN115762464A - Method for training speech synthesis model, electronic device and storage medium

Info

Publication number: CN115762464A
Application number: CN202211386817.8A
Authority: CN
Inventors: 俞凯; 陈谐; 郭奕玮; 杜晨鹏
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-03-07

Abstract

The invention discloses a training method of a speech synthesis model, electronic equipment and a storage medium, wherein the training method of the speech synthesis model comprises the following steps: training an acoustic model without emotional input, wherein a diffusion process is used in the training process of the acoustic model, and the training target of the acoustic model is to estimate the logarithmic gradient of data distribution at any intermediate moment in the diffusion process; training an emotion classifier, wherein the input of the emotion classifier at least comprises a logarithmic gradient corresponding to a certain intermediate moment in the diffusion process; and performing emotion-controllable speech synthesis sampling by using a soft label guidance technology, wherein the diffusion process corresponds to a reverse denoising process of the speech synthesis sampling, a gradient item of the reverse denoising process is a target estimated by a speech synthesis model, and a soft label guidance item is mathematically equivalent to cross entropy and comprises one side of output of the emotion classifier and the other side of distribution corresponding to target emotion intensity.

Description

Method for training speech synthesis model, electronic device and storage medium

Technical Field

The invention belongs to the technical field of training of a speech synthesis model, and particularly relates to a training method of a speech synthesis model, electronic equipment and a storage medium.

Background

Related technologies are mainly divided into two categories, one category is that Relative attribute Rank (RAR, relative attribute ordering, a Relative information modeling mode) technology is used, and an optimal ordering matrix is searched by an SVM (Support Vector Machine) method through an artificially constructed optimization problem, so that Relative emotion intensity values are obtained in advance for training. The synthesized emotion can be controlled using such intensity values after training is complete. The second type is to operate on emotion embedding space, such as interpolation and the like.

The inventor discovers that in the process of implementing the application: the RAR-based method needs to calculate the relative emotion intensity values of all training data in advance, and the quality of the optimization problem solution at this stage directly affects the subsequent training. The second category is also crucial for the extraction of emotion-embedded expressions, often requiring careful additional constraints to be placed on this space to improve the effect. In addition, some methods have poor synthesis quality, most likely caused by the preamble stage.

Disclosure of Invention

An embodiment of the present invention provides a method for training a speech synthesis model, an electronic device, and a storage medium, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for training a speech synthesis model, including: training an acoustic model without emotional input, wherein a diffusion process is used in the training process of the acoustic model, and the training target of the acoustic model is to estimate the logarithmic gradient of data distribution for any intermediate time in the diffusion process; training an emotion classifier, wherein the input of the emotion classifier at least comprises a logarithmic gradient corresponding to a certain intermediate moment in the diffusion process; and performing emotion-controllable speech synthesis sampling by using a soft label guidance technology, wherein the diffusion process corresponds to a reverse denoising process of the speech synthesis sampling, a gradient item of the reverse denoising process is a target estimated by a speech synthesis model, and a soft label guidance item is mathematically equivalent to cross entropy and comprises one side of output of the emotion classifier and the other side of distribution corresponding to target emotion intensity.

In a second aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a speech synthesis model of any of the embodiments of the present invention.

In a third aspect, the present invention also provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the method for training a speech synthesis model according to any embodiment of the present invention.

In the method of the embodiment of the application, the emotion intensity can be controlled by a soft tag guidance technology provided by the inventor, wherein the soft tag guidance technology is obtained by expanding based on a classifier guidance technology, so that the speech synthesis with controllable emotion and high quality can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present invention;

FIG. 2 is a training and sampling diagram of EmoDiff according to an embodiment of the present invention;

FIG. 3 is a graph illustrating the evaluation of Mean Opinion Score (MOS) and Mel-cone distortion (MCD) according to one embodiment of the present invention;

FIG. 4 is a classification probability chart for the control strength α ∈ {0.0,0.2,0.4,0.6,0.8,1.0}, according to an embodiment of the present invention;

FIG. 5 is a diversity preference test for each emotion provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Please refer to fig. 1, which shows a flowchart of an embodiment of a training method of a speech synthesis model of the present application.

As shown in fig. 1, in step 101, training an acoustic model without emotion input, wherein a diffusion process is used in the training process of the acoustic model, and the training goal of the acoustic model is to estimate the logarithmic gradient of data distribution for any intermediate time in the diffusion process;

in step 102, training an emotion classifier, wherein the input of the emotion classifier at least comprises a logarithmic gradient corresponding to a certain intermediate time in the diffusion process;

in step 103, performing emotion-controllable speech synthesis sampling by using a soft tag guidance technology, wherein the diffusion process corresponds to a reverse denoising process of the speech synthesis sampling, a gradient term of the reverse denoising process is a target of speech synthesis model estimation, and a soft tag guidance term is mathematically equivalent to cross entropy, and one side of the soft tag guidance term is output by the emotion classifier, and the other side of the soft tag guidance term is distribution corresponding to target emotion intensity.

In the method of the embodiment of the application, the emotion intensity can be controlled by a soft label guiding technology provided by the inventor, wherein the soft label guiding technology is obtained by expanding based on a classifier guiding technology, so that controllable-emotion and high-quality speech synthesis can be obtained.

In some optional embodiments, the emotion classifier freezes the acoustic model parameters in the training process, and only updates the weight of the emotion classifier.

In some optional embodiments, the emotion classifier is trained using standard cross-entropy loss LCEs.

In some optional embodiments, the acoustic model learns how to generate a realistic Mel frequency spectrum from a given text, a given sequence of durations by training, and the input to the emotion classifier further includes a token associated with the given text and a time associated with the given text.

In some optional embodiments, the diffusion process employs a de-noising diffusion model that uses exponential moving means on model weights. Therefore, the performance of the denoising diffusion model can be improved.

It should be noted that the above method steps are not intended to limit the execution order of the steps, and in fact, some steps may be executed simultaneously or in the reverse order of the steps, which is not limited herein.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventors have found that the above-mentioned drawbacks are mainly caused by the following reasons: no matter the method based on the RAR or the emotion embedding space, a more complex optimization problem solving process is needed, suboptimal results are likely to be produced, and subsequent model optimization is difficult.

In view of the above drawbacks, those skilled in the art will often adopt methods such as pre-training, adding data to solve an optimization problem, and replacing with a better solver. Because the scheme of the embodiment of the present application is based on a denoising diffusion model (denoising diffusion model), the model is a relatively novel generative model, the mathematical principle is deep, and the application for speech synthesis is less, so that the method is not easy to think.

The method provided by the embodiment of the application is based on a classifier guidance (classifier guidance) technology of a denoising diffusion model, and is expanded into a method named soft-label guidance (soft-label guidance) to directly control the emotional intensity. Specifically, we train an acoustic model that does not rely on emotion labels, and then train a classifier to classify emotions from intermediate variables of the diffusion process. Then, in the synthesis stage, the inventor can directly use the gradient of the classifier to conduct guidance in synthesis so as to synthesize a specific emotion.

The method is very direct in training, only needs to train an emotion classifier additionally, and is widely researched in a speech emotion recognition task (speech emotion recognition), the method is based on neural network optimization, and the training quality can be guaranteed. The soft label guiding technology expanded by the inventor can enable the inventor to directly control the emotional intensity by the classifier. .

Please refer to fig. 2, which illustrates a method for training a speech synthesis model according to an embodiment of the present inventionA block diagram of one specific example of a method and system. Specifically, fig. 2 shows a training and sampling diagram of EmoDiff according to an embodiment of the present application. In training, xt is sampled directly from the known distribution p (xt | x 0). Fractional function when sampling

Estimated by a fractional estimator. "SG" indicates stopping the gradient operation. The Chinese and English controls are as follows: mel-spec: meier spectrum, score estimators: fractional estimator, forward SDE: forward diffusion SDE, phonemee-dependent Gaussian noise: phoneme-related gaussian noise, duration Predictor: duration predictor, duration adapter: duration extender, phoneme Encoder: phoneme coder, phonememe sequence: phoneme sequence, input Encoder: input encoder, input Duration: input duration, input Text: input text, unordinary Acoustic Model tracking: training an unconditional acoustic model; emotion Classifier: emotion classifier, input Emotion: target emotion, classifier Training: training an emotion classifier; reverse-time SDE: reverse time SDE, SDE numerical solvent: SDE numerical simulator, numerical Acoustic Model Sample: unconditional ascending model sampling, soft-Label guide: soft label guidance, neutral: neutral, target observation d with intensity a 40% Neutral 60%: target emotion with Intensity a of 40% neutral 60%, intensity control Emotional Sampling: and (4) emotion sampling with controllable intensity.

As shown in fig. 2, (a) in fig. 2: the inventors trained an acoustic model without emotional input. The model learns how to generate a realistic mel-frequency spectrum from a given text, a given duration sequence. The model uses a diffusion process, and the training goal is to estimate the logarithmic gradient of the data distribution, i.e. the logarithmic gradient, for any intermediate time t of the diffusion process

Fig. 2 (b): the inventors trained an emotion classifier. The input to this classifier is a time in the diffusion process described aboveIntermediate variable x of moment t _t And a text-related representation μ and time t. The training index of the classifier is a common cross entropy classification criterion.

Fig. 2 (c): the inventor uses the soft label guiding technology to carry out speech synthesis sampling with controllable emotion. The diffusion process described in the above diagram (a) corresponds to an inverse denoising process, and the gradient term is the target of the acoustic model estimation in the embodiment of the present application. The soft label guideline of the embodiment of the application is equivalent to a negative cross entropy, one side of which is the output of the inventor classifier, and the other side is the distribution corresponding to the target emotion intensity. The sentiment intensity values of the embodiments of the present application are now defined to be values within 0 to 1. Assuming that the inventor needs an angry emotion of intensity 0.4, this time is equivalent to one class distribution over all emotion tags, a probability of 0.4 over anger and a probability of 0.6 over neutral (neutral). The inventor needs to simulate the denoising process by a numerical method, and the inventor uses a soft label to guide in each iteration process of numerical simulation. Finally, the inventor can generate a sample which meets the emotional strength requirement and is lifelike.

In solving the above technical problem, the inventor considers that the duration (duration) of each phoneme of the speech is also included in a part of the diffusion model and the denoising model, so that the duration can be directly controlled by emotion. However, how to construct a better denoising diffusion model is a relatively difficult problem, and the implementation difficulty is relatively high, so that no further discussion is made.

The inventors have tried to improve the generation stage of the denoised diffusion model, such as using DPS (diffusion temporal sampling), RK45ODE solvent, etc. These methods theoretically contribute to the quality of the diffusion model generation. Neither of which requires additional modifications to the training process. The DPS firstly uses a trained model to recover x _0 from noisy data x _ t, then carries out one-step noise addition, and directly uses the noise addition result to calculate the gradient, which can be approximate to the original noise removal result. RK45 (change-Kutta-Fehlberg method, an ODE sampling method with 4-order and 5-order noise estimation) is an ODE (ordinary differential equation) numerical solution, and since the denoising process is essentially a random differential equation (SDE) with an equivalent version of the ODE, an improved solution can be performed by using ODE solutions such as RK45, and a sample with higher quality can be obtained. However, these methods have been found to reduce the diversity of results and have not been adopted in practice.

Besides the model of the embodiment of the application can be used for emotion controllable and high-quality speech synthesis, the method of the embodiment of the application also has the following advantages: has strong diversity even in the same emotion. This is an advantage directly brought by the use of the denoised diffusion model. The soft tag guidance method of the embodiment of the application can be easily generalized to a method for controlling any emotion combination, not just controlling the emotion intensity. In depth, the method of the embodiment of the application makes it feasible to control any combination of multiple emotions.

The following experiments and experimental data verify the beneficial effects of the embodiments of the present application with respect to the prior art.

Although current speech synthesis models can produce high quality speech, speech synthesis with controllable emotional intensity remains a challenging task. Most of the current work requires an additional optimization process to calculate the emotion intensity values, which can lead to suboptimal results, or degraded synthesis quality. In the embodiment of the present application, the inventor proposes EmoDiff, which is a speech synthesis model based on a diffusion model. In EmoDiff, emotion intensity can be manipulated by our proposed soft label guidance technique, which is extended based on classifier guidance techniques. Specifically, emoDiff uses a soft tag to control emotion intensity rather than a one-hot vector. In this soft tag, the inventor sets a certain emotion to be controlled and a neutral emotion with α and 1- α, respectively, where α is an emotion intensity value, which can be selected from 0 to 1. Experiments in the embodiments of the present application show that EmoDiff can accurately control the emotional intensity and maintain a high quality synthesis effect. Furthermore, emoDiff can still achieve good diversity given an emotion.

1. Brief introduction to the drawings

Although current TTS (speech synthesis) models can produce high quality speech, such as Grad-TTS, VITS, and VQTTS, emotion TTS with controllable strength remains a challenging task. A similar task to emotion TTS is prosody modeling in TTS, which generally does not have data given specific labels. In contrast, emotion TTS typically utilizes a data set with categorical emotion tags. The mainstream emotion TTS model can only synthesize emotion voice according to emotion labels, and has no strength controllability.

In a controllable-intensity TTS model, much of the work has been done to correctly define and calculate the emotional intensity values for the training process. The most popular method for defining and obtaining emotional intensity is Relative Attribute Ranking (RAR). The RAR seeks a ranking matrix through a maximum marginal optimization problem, which is solved by a Support Vector Machine (SVM). The resulting solution is then fed into the model for training. Since this is a separate and artificially constructed stage, it may lead to sub-optimal results, which introduce bias into the training. In addition to RAR, some work explored the operation of emotion embedding into space. The related art designs an algorithm to maximize the distance between emotion insertions and interpolate the embedding space to control the emotion intensity. Additional correlation techniques quantify the distance of emotion embedding to obtain the emotion intensity. However, the structure of the embedding space also greatly affects the performance of these models, resulting in the need to carefully design additional constraints. Intensity control in emotion voice conversion has also been studied in the related art, and there is a similar method. Some of the mentioned work also had the phenomenon of speech quality degradation. As an example, "Mixed Emotion" is an autoregressive model that weights emotion embedding with the intensity values of RAR. It uses pre-training to improve the quality of the synthesis, but still has significant quality degradation.

To overcome these problems, the inventors need a conditional sampling method that can directly control the intensity of emotion weighting. In this work, the inventors propose a soft tag guidance technique based on classifier guidance in speech emotion recognition tasks. Classifier guidance is an efficient sampling technique that uses the gradient of a classifier to guide the sampling trajectory for a given one-hot class label.

In this application, based on the extended soft tag guidance, the inventor proposes EmoDiff, which is an emotion TTS model with sufficient strength controllability. Specifically, the inventors first trained an acoustic model without emotional input. An emotion classifier is then trained on any xt of the diffusion process trace, where t is the timestamp (any intermediate time) in the diffusion process. In reasoning, the inventors guided the reverse denoising process with classifiers and soft emotion labels, where the values for the specified emotion and neutral emotion are set to α and 1- α, respectively, rather than a one-hot distribution where only the specified emotion is 1 and the others are 0. Experiments in the embodiments of the present application show that EmoDiff can accurately control the emotional intensity while maintaining a high speech quality. Furthermore, it can generate different speech samples even for the same emotion, which is an advantage of its use as a diffusion model.

In short, the main advantages of EmoDiff are:

1. when using soft labels, the inventors defined emotional intensity as a classifier-guided weight. This enables accurate intensity control in terms of classifier probability without additional optimization. Thus, it enables the inventors to efficiently generate speech with arbitrarily specified emotional intensity.

2. It is not harmful to the synthesized speech. The generated sample has good quality and naturalness.

3. It can generate different samples even under the same mood.

2. Diffusion model with classifier guidance

2.1. De-noising diffusion model and TTS application

Denoising diffusion probability models have proven successful in many generation tasks. In the score-based interpretation, the diffusion model constructs a forward random differential equation (SDE) that will be used to calculate the diffusion coefficientsData distribution p ₀ (x ₀ ) Conversion to known profile p _T (x _T ) And generates realistic samples starting from the noise using the corresponding inverse time SDE. Therefore, this reverse process is also referred to as a "denoising" process. The neural network would then have to do for any t e [0]Estimating a fractional function

The goal is score matching (score matching). In application, the diffusion model bypasses the training instability and pattern collapse problems in GANs and is superior to previous methods in sample quality and diversity.

The denoising diffusion model is also used in TTS and vocabulary coding tasks, and remarkable effect is achieved. In this application, the inventors have established EmoDiff on the design of GradTTS. Denotes x ∈ R ^d Is a frame mel-spectrum (mel-spectrum) that constructs a forward SDE:

wherein, B _t Is a standard Brownian motion, t is ∈ [0,1 ]]Is the SDE time index. Beta is a _t Referred to as noise schedules, where _t Is increased, and

then there is p ₁ (x ₁ ) N (x; μ, Σ). This SDE can also deduce the conditional distribution x _t |x ₀ ～N(ρ(x ₀ Σ, μ, t), λ (Σ, t)), where both ρ (), λ () have a known closed form. Thus, the inventors can directly follow x ₀ Sample x _t . In practice, the inventors set Σ to be an identity matrix, and thus λ (Σ, t) becomes λ _t I, wherein λ _t Is a scalar of a known closed form. Meanwhile, the inventor distributes (terminal distribution) p to terminals ₁ (x ₁ ) As a condition of text, let μ = μ _θ (y), where y is the aligned phone representation of the frameand (i) is free). The SDE of equation (1) has a corresponding reverse time SDE:

wherein the content of the first and second substances,

is a fractional function to be estimated, and

is a brownian motion of reverse time. It shares the distribution p with the forward SDE in equation (1) _t (x _t ) The trajectory of (2). Thus, from x ₁ N (mu, sigma) starts to solve, the inventor can obtain a realistic sample x ₀ ～p(x ₀ Y). A neural network s _θ (x _t Y, t) is trained to estimate a score function, in the following score matching objective:

2.2. classifier guidance based conditional sampling

The de-noising diffusion model provides a new approach to model the conditional probability p (x | c), where c is a class label. Suppose the inventor now has an unconditionally generated model p (x), and a classifier p (c | x). According to the Bayesian formula, there are:

in the diffusion framework, in order to sample from the conditional distribution p (x | c), the inventors need to estimate the score function

According to equation (4), the inventors only need to add the gradient of the classifier to the unconditional model. This conditional sampling method is namedFor classifier guidance, unsupervised TTS has also been used.

In practice, the classifier gradient is often scaled to control the strength of the guidance. The inventors now use

Instead of the original in equation (4)

Where γ ≧ 0 is referred to as the lead level. A larger gamma will result in a highly class-correlated sample, while a smaller gamma will encourage sample variation.

Unlike a generic classifier, the input to the classifier used here is all x's along the trajectory of the SDE in equation (1) _t Rather than only clean x ₀ . The time index t may be [0,1 ]]Of any of the above. Thus, the classifier can also be represented as p (c | x) _t ,t)。

While equation (6) can effectively control the sampling of class labels c, it cannot be applied directly to soft labels, i.e., labels weighted by intensity, because the guideline p (c | x) is not well defined now. Therefore, the inventors have extended this technique for emotion intensity control in the following description.

3.EmoDiff

3.1. Unconditional acoustic model and classifier training

The training of EmoDiff mainly includes the training of unconditional acoustic models and emotion classifiers. The inventors first trained a diffusion-based acoustic model on emotion data, but did not provide it with emotional conditions. This is called "unconditional acoustic model training", as in (a) of fig. 2. This model is based on GradTTS, except that the inventors provide an explicit sequence of durations by forcing an aligner to facilitate modeling of durations. At this stage, the training target is L _dur +L _diff Wherein L is _dur Is a log-long L2 loss, L _diff Is the diffusion loss, as in equation (3). In practice, the inventors also used a priori losses L, as in GradTTS _prior ＝-logN(x ₀ (ii) a μ, I) to encourage convergence. To is coming toSimplified notation, the inventors used L in FIG. 2 (a) _diff To represent both diffusion and a priori loss.

After training, the acoustic model can estimate a Mel spectrogram x with noise under the condition of inputting the phoneme sequence y _t A score function of (i)

The emotion tag is not limited in this case. According to 2.2, the inventor needs an emotion classifier to separate the noisy Mel spectrum x from _t And distinguishing the emotion type e. Meanwhile, since the inventor always has a text condition y, the classifier is expressed as p (e | x) _t Y, t). As shown in FIG. 2 (b), the input to the classifier consists of three parts, the time stamp t of SDE, the Mel spectrum x of the noise _t And a gaussian mean μ associated with speech. This classifier uses the standard cross entropy loss L _CE To train. Note that the inventors freeze (freeze) the acoustic model parameters at this stage, updating only the weights of the emotion classifiers, otherwise the trajectories of the diffusion models described above would change, producing mathematical errors.

Since the inventor has been demanding the text y as a condition in the paper, the inventor omits it in later sections and expresses this classifier as p (e | x) to simplify notation if no ambiguity is caused.

3.2. Intensity controlled sampling with soft label guidance

In an embodiment of the present application, the inventors extend the classifier guidance to soft label guidance, which can control the intensity weighting of emotions. Assuming that the number of basic emotions is m, each basic emotion e _i All have a one-hot vector form e _i ∈R ^m I belongs to {0,1,.., m-1}. For each e _i In other words, only the ith dimension is 1, in particular, the inventor uses e ₀ To indicate neutrality. For e _i Emotional weighting with upper intensity α, the inventors define d = α e _i +(1-α)e ₀ . Then, the log probability gradient of the classifier p (d | x) versus x can be defined as:

an intuitive understanding of this definition is that the intensity α represents the emotion e _i Contribution to the sampling trajectory of x. A larger α means that the inventor is following the emotion e _i The trace with the greater "power" samples x, otherwise it is e ₀ . Therefore, the inventors can extend equation (4) to:

when the intensity α is 1.0 (100% emotion e) _i ) Or 0.0 (100% neutral), the above operation reduces to equation (4) in the standard classifier guide form. Thus, the inventors can use the soft tag guidance equation (5) in the sampling process and generate a sample with a specified emotion d = α e _i +(1-α)e ₀ Is measured in a true-to-life sample of the intensity a.

FIG. 2: training and sampling graphs for EmoDiff. In training, x _t Is directly from the known distribution p (x) _t |x ₀ ) Is sampled in the middle. Score function when sampling with certain emotional intensity

Estimated by a score estimator. "SG" indicates stopping the gradient operation.

Fig. 2 (c) illustrates the sampling process with controllable intensity. After feeding (feeding) the acoustic model and obtaining the speech-related μ sequence, the inventors looked at x ₁ N (μ, I) are sampled and the inverse time SDE from t =1 to t =0 is simulated by a numerical simulator. In the update step of each simulator, the inventor uses the current x _t Is provided to a classifier and an output probability p is obtained _t (.|x _t ). The guidance term is then calculated using equation (6). Similar to 2.2, the inventors also weighed the guideline by the guideline level γ. Finally, the inventors obtained

It not only understands the input text but also corresponds to the target emotion d with intensity a. This results in an accurate intensity, which correlates well with the probability of the classifier.

More broadly, the soft label guidance of the embodiments of the present application enables more sophisticated control of mixed emotions in addition to intensity control. To represent

Is a combination of all emotions, wherein

Equation (5) can be summarized as:

then equation (6) can also be expressed in this generalized form. This extension can also be interpreted from a probabilistic perspective. Due to combining weights w _i Can be viewed as a basic emotion e _i Classification distribution p on _e Equation (7) corresponds to

Where CE is the cross entropy function. Equation (9) implies the fact that: while following the gradient

In logp (d | x) sampling, the inventors are actually reducing the target emotion distribution p _e And the cross entropy of the classifier output p (. | x). The influence of the gradient of the cross entropy on x may guide the sampling process. Thus, such soft label guidance techniques may generally be applicableFor controlling any arbitrary complex mood as a weighted combination of several basic moods.

In fig. 2 (c), the inventors used cross entropy as a concise notation for soft label guidance terminology. In the intensity control scheme of the embodiment of the present application, it is simplified to the aforementioned formula (5).

4. Experiment and results

4.1. Experimental setup

The inventors performed all experiments using the english portion of the emotion speech data set (ESD). It has 10 speakers, each with a comparison of MOS and MCD. MOS is expressed with a 95% confidence interval. Note that "GradTTS w/emo label" in the experiment did not control emotional intensity.

There are 4 mood categories: anger, happiness, sadness, surprise. And a neutral class. There were 350 parallel corpora per speaker and mood category, with about 1.2 hours per speaker. Melspectrogram and forced alignment were extracted by Kaldi [32] in a frame shift of 12.5ms and a frame length of 50ms, followed by cepstral normalization. The audio samples in these experiments are published.

In the present application, the inventors consider only the emotional TTS problem of single-person speech. In the following section, the inventors performed unconditional GradTTS acoustic model training for all 10 English speakers to obtain reasonable data coverage, and classified and controlled only one female speaker (ID: 0015). The unconditional GradTTS model was trained with an Adam optimizer at a learning rate of 10-4 for 11000000 steps. The inventors used an exponential moving average on the model weights, as it can improve the performance of the diffusion model. The structure of the classifier is a 4-layer 1D CNN, and each layer has BatchNorm and Dropout. In the inference phase, the guideline level γ is fixed at 100.

The inventor chose HifiGAN to train all english users here as a vocoder for all experiments below.

4.2. Emotional TTS quality

The inventors first measured speech quality, including audio quality and speech naturalness. The inventors compared the proposed EmoDiff with the following system.

GT and GT (voc): real recordings and analysis of the synthesis results (re-synthesis with real mel-spectrograms via vocoder).

Mixedemion: it is an autoregressive model based on relative attribute ranking for pre-computing trained intensity values. It is very similar to the intensity controlled emotional switch of Emovox.

GradTTS w/emo label: a conditional GradTTS model with hard emotion tags as input. Therefore, it has no strength controllability, but as a certified acoustic model, should have a good sample quality.

4. Note that in this experiment, the samples from EmoDiff and mixedeotion were controlled at an intensity weight of α =1.0, so they can be directly compared with other samples.

5. Fig. 3 lists the evaluation of Mean Opinion Score (MOS) and mel-frequency cepstral distortion (MCD). The results show that the vocoder has little effect on the sample quality, and the EmoDiff of the embodiment of the present application exceeds the MixedEmotion baseline by a large margin. Also, emoDiff and hard-conditioned gradts both have good, very close MOS results. Their MCD results differ only slightly. This means that EmoDiff does not compromise the intensity controllability of the sample quality as does mixedeotion.

The Chinese and English translation in FIG. 2 is as follows: GT: true recording, GT (voc): analytical synthesis, mixed Emotion without translation, gradTTS w/emo label: gradTTS with emotion input, emoDiff is a proposed method, and MOS: MOS subjective audiometric score, MCD: mel-frequency cepstral distortion. The above translated contents are universal in full text.

FIG. 4: the classification probability when the strength alpha is controlled to be 0.0,0.2,0.4,0.6,0.8 and 1.0. Error bars represent standard deviation. Wherein, the Chinese and English translation contrast is as follows: emotion xxx intensity: intensity of xxx emotion, classifier prob: the classifier outputs a probability. Surrise, sad, happy, angry respectively represent corresponding emotions (Surprise, sadness, joy, anger, since the experiment of the present application is done in the English scenario, no translation is used here).

4.3. Controllability of emotional intensity

To assess the controllability of emotional intensity, the inventors classified the synthesized samples to a controlled intensity with a trained classifier. The t input to the classifier is now set to 0. The average classification probability for the target emotion category is used as an evaluation index. A larger value indicates a greater confidence in discrimination. For EmoDiff and mixedeotion in each mood, the inventors varied the intensity from α =0.0 to 1.0. When the intensity is 0.0, it corresponds to the synthesis of 100% of neutral samples. Greater intensity should result in greater probability.

The results are shown in FIG. 4. To demonstrate the capability of this classifier, the inventors plotted classification probabilities for ground truth data. To show the performance of the hard-condition gradts model, the inventors also plotted probabilities on the samples they synthesized. Since it has no strength controllability, the inventors plot only the value at a strength of 1.0. Here, the standard deviation of each experiment is also presented in the form of error bars.

It can be seen from the figure that, first, the trained classifier performs reasonably well on the real data. It is worth mentioning that the classification accuracy on the validation set is 93.1%. Samples from the GradTS w/emo tag have some lower classification probability. Most importantly, the proposed EmoDiff always covers a larger range than the baseline, from intensity α =0.0 to 1.0. The error range of EmoDiff is always lower than the baseline, which means that the control of the embodiments of the present application is more stable. This demonstrates the effectiveness of the soft label guidance technique proposed by the inventors. The inventors have also noted that sometimes EmoDiff achieves a higher classification probability at a strength of 1.0 than GradTTS for hard conditions. This is also reasonable because the conditioning of emotion labels during training does not guarantee a better class correlation than classifier guidance if there is a powerful classifier and sufficient level of guidance.

4.4. Diversity of emotion samples

Despite generating high quality and intensity-controlled emotion samples, emoDiff also has good sample diversity, even in the same emotion, benefiting from the powerful generation of diffusion models. To assess the diversity of emotion samples, the inventors performed subjective preference tests for each emotion between EmoDiff and mixedetion. The listener is required to select a more diverse one, or "cannot decide". Note that the test was performed for each emotion under a weight of α = 1.0.

FIG. 5: diversity preference test for each mood. The Chinese and English translation controls are as follows: the title "Diversity prediction test of reach observation": diversity preference test, illustrated Cannot Decide: not to be determined, percent: the percentages, surrise, sad, happy, angry, respectively, indicate the corresponding emotions (Surprise, sadness, joy, anger, since the experiments of the present application were done in the English scenario, no translation is used here).

Figure 5 shows the results of the preferences. It is clear that EmoDiff has a great advantage in diversity for each of the three categories of emotion, anger, happiness and surprise. Only in the case of sadness, emoDiff outperforms the baseline with a small margin. This is mainly because MixedEmotion is autoregressive and the inventors found that its variation in duration accounts for a large proportion, especially for sad samples.

5. Conclusion

In this application, the inventors investigated the problem of intensity control in emotion speech synthesis systems. The inventors defined the strong emotion as a weighted sum of the specific emotion and the neutral emotion, weighted by the intensity value. In this modeling approach, the inventors extended the classifier guidance technique to soft label guidance, which enables the inventors to directly control any arbitrary emotional intensity, rather than a single class label. By this technique, the proposed EmoDiff can achieve simple and effective control of emotion intensity through an unconditional acoustic model and emotion classifier. Subjective and objective evaluations showed that EmoDiff is superior to baseline in TTS quality, strength controllability, and sample diversity. In addition, the proposed soft tag guidance can be generally applied to control more complex natural moods, which the inventors regard as future work.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, where the computer-executable instructions may perform the method for training a speech synthesis model in any of the foregoing method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

training an acoustic model without emotional input, wherein a diffusion process is used in the training process of the acoustic model, and the training target of the acoustic model is to estimate the logarithmic gradient of data distribution for any intermediate time in the diffusion process;

training an emotion classifier, wherein the input of the emotion classifier at least comprises a logarithmic gradient corresponding to a certain intermediate moment in the diffusion process;

and carrying out emotion-controllable speech synthesis sampling by using a soft label guidance technology, wherein the diffusion process corresponds to a reverse denoising process of the speech synthesis sampling, a gradient item of the reverse denoising process is an object estimated by a speech synthesis model, and a soft label guidance item is mathematically equivalent to cross entropy and is output by the emotion classifier on one side and distribution corresponding to target emotion intensity on the other side.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a training method and system of a speech synthesis model, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer-readable storage medium optionally includes memory located remotely from the processor, and the remote memory may be connected to the training method of the speech synthesis model over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, which, when executed by a computer, make the computer execute any one of the above methods for training a speech synthesis model.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, one processor 610 being illustrated in fig. 6. The device of the training method and system for the speech synthesis model can also comprise: an input device 630 and an output device 640. The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6. The memory 620 is a non-volatile computer-readable storage medium as described above. The processor 610 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory 620, so as to implement the training method of the speech synthesis model of the above method embodiment. The input means 630 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the training apparatus of the speech synthesis model. The output device 640 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device applied to the method for training a speech synthesis model includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

training an acoustic model without emotional input, wherein a diffusion process is used in the training process of the acoustic model, and the training target of the acoustic model is to estimate the logarithmic gradient of data distribution at any intermediate time in the diffusion process;

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc.

(3) A portable entertainment device: such devices can display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of training a speech synthesis model, comprising:

and performing emotion-controllable speech synthesis sampling by using a soft label guidance technology, wherein the diffusion process corresponds to a reverse denoising process of the speech synthesis sampling, a gradient item of the reverse denoising process is a target estimated by a speech synthesis model, and a soft label guidance item is mathematically equivalent to cross entropy and comprises one side of output of the emotion classifier and the other side of distribution corresponding to target emotion intensity.

2. The method of claim 1, wherein the emotion classifier freezes parameters of the acoustic model during training and updates only weights of the emotion classifier.

3. The method of claim 1, wherein the emotion classifier is trained using standard cross-entropy-loss LCEs.

4. The method of claim 1, wherein the acoustic model learns how to generate realistic mel frequency spectra from a given text, a given sequence of durations through training, and the input to the emotion classifier further comprises tokens associated with the given text and a duration associated with the given text.

5. The method of claim 1, wherein the diffusion process employs a de-noising diffusion model that uses exponential moving means on model weights.

6. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.

7. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the steps of the method of any one of claims 1 to 5.