CN117496944A

CN117496944A - Multi-emotion multi-speaker voice synthesis method and system

Info

Publication number: CN117496944A
Application number: CN202410006409.8A
Authority: CN
Inventors: 杨继臣; 夏佳奇; 王泳; 伍均达
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2024-01-03
Filing date: 2024-01-03
Publication date: 2024-02-02
Anticipated expiration: 2044-01-03
Also published as: CN117496944B

Abstract

Aiming at the defect that the existing open source emotion voice library is smaller in general scale and further limits emotion voice synthesis quality, the invention provides a multi-emotion multi-speaker voice synthesis method and system. The method comprises the following steps: firstly, training a voice emotion classifier by using an existing open source emotion voice library, and removing a classification layer of the classifier to obtain a voice emotion encoder. And then extracting emotion characteristics in the voice by using the voice emotion encoder to train an emotion converter, constructing a novel emotion voice library of multiple emotion of multiple speakers by using the emotion converter, and finally training an emotion voice generator by using the constructed emotion voice library and the open source emotion voice library to realize multi-emotion multi-speaker voice synthesis. According to the emotion voice generation method, the emotion voice library is built according to the emotion converter, and then the emotion voice generator is trained by utilizing the new emotion voice library, so that the emotion voice data set is expanded, and the emotion voice synthesis quality is improved.

Description

Multi-emotion multi-speaker voice synthesis method and system

Technical Field

The invention relates to the field of voice analysis, in particular to a multi-emotion multi-speaker voice synthesis method and system.

Background

Speech synthesis refers to a technique of converting a given text into a synthesized speech of a given speaker. As the application scenes of intelligent voice technology are wider and wider, such as voice broadcasting, AI dubbing, intelligent sound boxes and the like. With the increasing demand of the actual demand for the speech synthesis effect, the neutral synthesized speech cannot meet the actual demand. More flexible and changeable voices are required, voices rich in various emotions are added, and voices with various tone colors are expected to be simulated. Emotion voice synthesis is a great research trend in the current voice synthesis field, however, the disclosed emotion voice data set has the problems of difficult acquisition and small scale, and the annotation of the voice also has the problems of difficult definition, slight subjective display, high cost and the like. And even if the disclosed emotion voice data set can be obtained, the problem that the number of the talkers in the data set is extremely small exists, and the factors limit the voice quality of the current voice synthesis model for synthesizing a plurality of different talkers. In addition, the existing emotion voice synthesis method also needs to include emotion voice data of a large number of different speakers or multi-emotion voice data of each speaker. Based on the above, the most direct way is to solve the problems by using a cross-speaker emotion migration method, namely, an emotion converter can be trained first by using the cross-speaker emotion migration method to enable the audio in the original neutral multi-speaker data set to contain different types of emotion, so that the emotion data set can be well expanded, and the problems that the emotion data set is rare and the number of speakers in the emotion data set is not abundant are solved.

Disclosure of Invention

The invention overcomes the defects of the prior art and provides a multi-emotion multi-speaker voice synthesis method and a system.

The first aspect of the present invention provides a multi-emotion multi-speaker speech synthesis method, comprising:

step S01, acquiring a data set: acquiring an emotion voice library and a multi-speaker voice library, wherein the emotion voice library comprises audio, corresponding text of the audio, voice emotion labels and speaker labels, and the multi-speaker voice library comprises audio, corresponding text of the audio and speaker labels;

step S02, emotion encoder training: training a preset voice emotion classifier by using an emotion voice library, adjusting parameters of the voice emotion classifier to obtain an optimized voice emotion classifier, and removing a classification layer in the voice emotion classifier to obtain an emotion encoder;

step S03, emotion converter training: training a preset emotion converter by using an emotion voice library, and adjusting emotion converter parameters to obtain an optimized emotion converter;

step S04, constructing a new emotion voice library: performing emotion migration by utilizing the optimized emotion converter obtained in the step S03, so that various target emotions can be expressed in the multi-speaker voice library, and a new emotion voice library is constructed based on synthesized audios of speakers in the multi-speaker voice library containing different target emotions;

step S05, training by using an emotion voice generator: training a preset emotion voice generator by using the emotion voice library in the step S01 and the emotion voice library constructed in the step S04 as a complete data set, adjusting emotion voice generator parameters to obtain an optimized emotion voice generator, and inputting a target voice text to be processed, target voice emotion characteristics and a Mel frequency spectrum of a reference audio to the emotion voice generator to obtain final target synthesized voice.

In this scheme, the specific process of step S01 is as follows:

step 1.1, an emotion voice library and a multi-speaker voice library are obtained, wherein the emotion voice library comprises audio, text corresponding to the audio, voice emotion tags and speaker tags. The multi-speaker voice library comprises voice frequency, a text corresponding to the voice frequency and a speaker tag, the emotion voice library and the text corresponding to the voice frequency in the multi-speaker voice library are converted into a phoneme sequence, and then a forced alignment tool is utilized to obtain the duration time corresponding to each phoneme;

step 1.2, extracting acoustic features of audio in an emotion voice library and a multi-speaker voice library, wherein the acoustic features comprise: mel spectrum, pitch, and energy;

and 1.3, dividing a training set, a testing set and a verification set of the emotion voice library according to a preset proportion.

In this scheme, the specific process of step S02 is as follows:

step 2.1, constructing a voice emotion classifier model;

step 2.2, the voice emotion classifier model comprises a series of convolution blocks and linear units, wherein each convolution block comprises a convolution layer, a batch normalization layer and an activation function ReLU; the convolution blocks carry out convolution processing on the mel frequency spectrum input into the model, and the input of each convolution block is the output of the last convolution block, so that the final convolution characteristic is obtained; performing dimension reduction on the final convolution characteristic through pooling operation, and regularizing by applying dropout to obtain a regularized result; inputting the regularization result to a linear unit; the method comprises the steps that a first layer of output of a linear unit is used for generating a feature vector, a second layer of output of the linear unit is used for classifying by mapping the feature vector to a probability space of each emotion type through a sigmoid function, loss calculation is carried out on an output result of the second layer of the linear unit and voice emotion labels in an emotion voice library through a cross entropy loss function, an emotion prediction loss value is obtained, and model parameters of a voice emotion classifier are adjusted;

step 2.3, calculating the accuracy rate of the adjusted voice emotion classification model on the test set, wherein the accuracy rate reaches more than 90% to obtain a voice emotion classifier meeting the requirements;

and 2.4, removing a second layer of the linear unit in the voice emotion classifier model, and keeping other network layers as they are to obtain the emotion encoder.

In this scheme, the specific process of step S03 is as follows:

step 3.1, constructing an emotion converter model, wherein the emotion converter model comprises a preprocessing module, a text encoder, a speaker encoder, a duration predictor, a pitch predictor, an energy predictor and a decoder;

and 3.2, extracting emotion characteristics corresponding to the audio in the emotion voice library by using the emotion encoder in the step 2.4, wherein the emotion characteristics are shown in a formula (1).

（1）；

Wherein,emotion encoder representing step 2.4, < ->Meier frequency spectrum corresponding to audio in emotion voice library>Representing the output of the emotion encoder, namely emotion characteristics corresponding to the audio in the emotion voice library;

inputting a Mel frequency spectrum contained in an emotion voice library training set to a speaker coder in an emotion converter to obtain speaker characteristics, inputting a phoneme sequence contained in the emotion voice library training set to a text coder in the converter to obtain text characteristics, and inputting the speaker characteristics, the text characteristics and the emotion characteristics to a duration predictor, a pitch predictor and an energy predictor in an emotion voice generator in parallel after characteristic fusion to obtain predicted phoneme duration, pitch and energy;

and 3.4, calculating predicted losses of pitch, energy and phoneme duration and actual pitch, energy and phoneme duration and losses of actual Mel frequency spectrum and Mel frequency spectrum output by a decoder in the emotion converter, marking the losses as a first loss and a second loss respectively, and adjusting parameters of the emotion converter according to the sum of the first loss and the second loss to obtain the optimized emotion converter.

In this solution, the step S05 specifically includes:

step 5.1, merging the phoneme sequence of the new emotion voice library into a folder corresponding to the existing emotion voice library;

step 5.2, obtaining the duration time corresponding to each phoneme in the new emotion voice library by using a forced alignment tool and converging the duration time under the corresponding folder of the existing emotion voice library;

and 5.3, extracting the emotion characteristics corresponding to each audio in the new emotion voice library by the emotion encoder in the step 2.4, and extracting the acoustic characteristics of the audio in the new emotion voice library, wherein the method comprises the following steps: mel frequency spectrum, pitch and energy are converged under a folder corresponding to the existing emotion voice library;

step 5.4, obtaining each feature file after merging the existing emotion voice library and the new emotion voice library, and dividing a training set, a testing set and a verification set according to a proper proportion;

5.5, constructing an emotion voice generator model, wherein the emotion voice generator model comprises a preprocessing module, a text encoder, a speaker encoder, a duration predictor, a pitch predictor, an energy predictor and a decoder;

inputting a mel frequency spectrum contained in a training set to a speaker coder in an emotion voice generator to obtain speaker characteristics, inputting a phoneme sequence contained in the training set to a text coder in the emotion voice generator to obtain text characteristics, carrying out characteristic fusion on the speaker characteristics, the text characteristics and emotion characteristics, and then inputting the fused characteristics to a duration predictor, a pitch predictor and an energy predictor in the emotion voice generator in parallel to obtain predicted phoneme duration, pitch and energy;

and 5.7, calculating the predicted loss between the pitch, the energy and the phoneme duration and the actual pitch, the energy and the phoneme duration and the loss between the actual Mel frequency spectrum and the Mel frequency spectrum output by a decoder in the emotion voice generator, and adjusting the parameters of the emotion voice generator according to the sum of the losses to obtain the optimized emotion voice generator.

The second aspect of the present invention also provides a multi-emotion multi-speaker speech synthesis system, comprising:

the voice library acquisition module: the method is used for acquiring a required emotion voice library and a multi-speaker voice library;

and the feature extraction module is used for: the method comprises the steps of performing phoneme conversion on texts corresponding to audios in a voice library to obtain phoneme sequences corresponding to the texts, and extracting Mel frequency spectrums, pitch, energy and phoneme duration corresponding to the audios in the voice library;

emotion encoder module: the method is used for extracting emotion characteristics corresponding to the Mel frequency spectrum;

an emotion conversion module: the emotion database is used for transferring emotion of different categories to each speaker in the multi-speaker voice database, so that a new emotion voice database is obtained;

the data set construction module: the emotion voice library is used for merging files of the emotion voice library with the new emotion voice library, wherein the files comprise phoneme sequences, mel frequency spectrums, pitch, energy and phoneme duration, and then a training set, a testing set and a verification set are divided according to reasonable proportion;

emotion voice generation module: and inputting the Mel frequency spectrum, emotion characteristics and phoneme sequences in the training set to the emotion voice generation module to obtain the target synthesized voice.

The invention discloses a multi-emotion multi-speaker voice synthesis method and system, which comprises the steps of firstly obtaining a required emotion voice library and a multi-speaker voice library, then training a voice emotion classifier model by utilizing the emotion voice library, removing a classification layer of the voice emotion classifier model to obtain an emotion encoder, extracting emotion characteristics of a Mel frequency spectrum corresponding to audio in the emotion voice library by utilizing the emotion encoder, inputting text, emotion characteristics and Mel frequency spectrums corresponding to audio in the emotion voice library into a preset emotion converter, carrying out parameter adjustment, utilizing the optimized emotion converter to transfer different types of emotion characteristics to each speaker in the multi-speaker voice library to generate predicted voice containing different types of target emotion characteristics, constructing a new emotion voice library based on the predicted voice, and finally merging the new emotion voice library and the existing open source emotion voice library into a complete data set to be used as training data to train the preset emotion voice generator. According to the invention, the emotion converter is utilized to obtain a new emotion voice library so as to expand the existing disclosed emotion data set, the problems that the emotion voice data set is not easy to obtain and the number of speakers is small are solved, the emotion voice generator is trained by utilizing the expanded data set, the voice synthesis quality of multiple emotion and multiple speakers is improved, and the emotion voice data set is suitable for the field of voice synthesis.

Drawings

FIG. 1 is a flow chart of a multi-emotion multi-speaker speech synthesis method of the present invention;

fig. 2 shows a flow chart of step S02 of the present invention;

fig. 3 shows a flow chart of step S03 of the present invention;

fig. 4 shows a block diagram of a multi-emotion multi-speaker speech synthesis system of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

Fig. 1 shows a flowchart of a multi-emotion multi-speaker speech synthesis method according to an embodiment of the present invention, and the speech synthesis method according to an embodiment of the present invention includes, but is not limited to, steps S01 to S05, and these five steps are described in detail below with reference to fig. 1.

Step S01, acquiring a data set: firstly, an emotion voice library and a multi-speaker voice library are obtained, wherein the emotion voice library comprises audio, a text corresponding to the audio, a voice emotion tag and a speaker tag. The multi-speaker voice library comprises audio, text corresponding to the audio and speaker tags. Converting the text corresponding to the audio in the emotion voice library and the multi-speaker voice library into a phoneme sequence, and obtaining the duration corresponding to each phoneme by using a forced alignment tool. Then extracting acoustic features of audio in the emotion voice library and the multi-speaker voice library, wherein the acoustic features to be extracted comprise: mel spectrum, pitch, energy. Dividing training sets, test sets and verification sets of the emotion voice library according to reasonable proportions;

step S02, emotion encoder training: firstly, constructing a voice classifier model, wherein the voice emotion classification model comprises a series of convolution blocks and linear units, and each convolution block comprises a convolution layer, a batch normalization layer and an activation function ReLU. And carrying out convolution processing on the Mel frequency spectrums input into the voice emotion classification model according to the convolution blocks, wherein the input of each convolution block is the output of the last convolution block, and the final convolution characteristic is obtained. And performing dimension reduction on the final convolution characteristic through pooling operation, and regularizing by applying dropout to obtain a regularized result. Inputting the regularization result to a linear unit; the linear unit outputs a first layer for generating feature vectors, and maps the feature vectors to probability spaces of each emotion category through a sigmoid function for classification. And carrying out loss calculation on the output result of the second layer of the linear unit and the voice emotion labels in the emotion voice library through a cross entropy loss function to obtain an emotion prediction loss value, and adjusting voice emotion classification model parameters. Finally, calculating the accuracy of the test set on the adjusted voice emotion classification model, obtaining a voice emotion classifier model meeting the requirements after the accuracy reaches more than 90%, removing a second layer of a linear unit in the voice emotion classifier model, and keeping other network layers as they are to obtain an emotion encoder;

step S03, emotion converter training: an emotion converter model is first constructed, and the emotion converter model comprises a preprocessing module, a text encoder, a speaker encoder, a duration predictor, a pitch predictor, an energy predictor and a decoder. And then extracting emotion characteristics corresponding to the audio in the emotion voice library through the emotion encoder in the step 2.4, wherein the emotion characteristics are specifically shown in a formula (1).

（1）；

Wherein,representing the emotion encoder described in step S02, < >>Mel spectrum representing information to be input to an emotion encoder>And representing the output of the emotion encoder, namely emotion characteristics corresponding to the audio in the emotion voice library. Then the mel frequency spectrum contained in the training set corresponding to the emotion voice library is input to the speaker in the emotion converterThe encoder obtains speaker characteristics, inputs phoneme sequences contained in a corresponding training set in the emotion voice library into a text encoder in the converter to obtain text characteristics, performs characteristic fusion on the speaker characteristics, the text characteristics and the emotion characteristics, and then parallelly inputs the characteristics into a duration predictor, a pitch predictor and an energy predictor in the emotion voice generator to obtain predicted phoneme duration, pitch and energy; calculating predicted pitch, energy, phoneme duration loss, real pitch loss, energy loss, phoneme duration loss, real Mel spectrum loss and Mel spectrum loss output by a decoder in the emotion converter, and adjusting parameters of the emotion converter according to the sum of the losses to obtain an optimized emotion converter;

step S04, constructing a new emotion voice library: firstly, inputting a Mel frequency spectrum corresponding to audio in a multi-speaker voice library to a speaker encoder in an emotion converter to obtain speaker characteristics, inputting a phoneme sequence corresponding to the multi-speaker voice library to a text encoder in the converter to obtain text characteristics, randomly selecting one emotion characteristic from each emotion, and inputting the speaker characteristics, the text characteristics and the emotion characteristics to the emotion converter after characteristic fusion. Then, the emotion characteristics are migrated to speakers in the multi-speaker voice library through an emotion converter, and audio with poor effects is removed, so that a new emotion voice library is obtained;

step S05, training by using an emotion voice generator: the phoneme sequence of the new emotion voice library is converged under the folder corresponding to the existing emotion voice library, and then the duration time corresponding to each phoneme in the new emotion voice library is obtained by using the forced alignment tool and converged under the folder corresponding to the existing emotion voice library. Extracting, by the emotion encoder in step S02, emotion features corresponding to each piece of audio in the new emotion voice library, and extracting acoustic features of the audio in the new emotion voice library, including: and converging the Mel frequency spectrum, the pitch and the energy to the folder corresponding to the existing emotion voice library to obtain each feature file after converging the existing emotion voice library and the new emotion voice library. And dividing the training set, the testing set and the verification set according to the proper proportion. The emotion voice generator model is constructed and comprises a preprocessing module, a text encoder, a speaker encoder and a variable adapter, wherein the variable adapter further comprises a duration predictor, a pitch predictor, an energy predictor and a decoder; inputting the mel frequency spectrum contained in the training set to a speaker coder in the emotion voice generator to obtain speaker characteristics, inputting the phoneme sequence contained in the training set to a text coder in the emotion voice generator to obtain text characteristics, and inputting the speaker characteristics, the text characteristics and the emotion characteristics to a duration predictor, a pitch predictor and an energy predictor in the emotion voice generator in parallel after characteristic fusion to obtain predicted phoneme duration, pitch and energy. And calculating predicted pitch, energy, loss between phoneme duration and real pitch, energy and phoneme duration and loss between real Mel frequency spectrum and Mel frequency spectrum output by a decoder in the emotion voice generator, and adjusting parameters of the emotion voice generator according to the sum of the losses to obtain the optimized emotion voice generator.

It should be noted that, emotion labels may include happy, sad, anger, surprise, and the like, and the voice emotion label of the present invention is any one of preset emotion type labels. These losses represent the losses between the calculated predicted pitch, energy, phoneme duration and the actual mel spectrum and the mel spectrum output by the decoder in the emotion converter and are denoted as first and second losses, respectively.

It should be noted that, the text corresponding to the audio in the voice library according to the embodiment of the present invention may be in any field, such as fields of science and technology, news, entertainment, travel, history, etc., that is, the voice synthesis method provided by the present invention may be applied to different fields. For example, the prompt voices of different emotion and different speaker styles are played in a targeted mode at tourist attractions, namely, neutral-style male speaker style voice broadcast civilization words are used near a garbage can, and synthetic voice broadcast welcome words with 'happy' and 'exciting' emotion colors and crossing introduction are used at scenic spot entrances, so that the interactive sensation of tourists and the travelling experience are improved.

It should be noted that, the format and sampling rate of the audio in the speech library in the present invention are not limited, and only the audio sampling rate of the sample data used in the execution of the multi-emotion multi-speaker speech synthesis method provided in the present invention is unified.

In step S02 of the embodiment, the speech emotion classifier model is composed of five convolution blocks and two linear layers, each convolution block is composed of two convolution layers, and after a normalization layer and a ReLU activation function are applied to each convolution layer to obtain an output, the output is finally subjected to an average pooling process.

Fig. 2 shows a flow chart of step S02 of the present invention;

in the embodiment of the present invention, the step S02 specifically includes the following four steps:

step 2.1, constructing a voice classifier model;

step 2.2, the voice emotion classifier model comprises a series of convolution blocks and linear units, wherein each convolution block comprises a convolution layer, a batch normalization layer and an activation function ReLU; carrying out convolution processing on the Mel frequency spectrum in the emotion voice library according to the convolution blocks, wherein the input of each convolution block is the output of the last convolution block, so as to obtain the final convolution characteristic; performing dimension reduction on the final convolution characteristic through pooling operation, and regularizing by applying dropout to obtain a regularized result; inputting the regularization result to a linear unit; the linear unit outputs a first layer for generating feature vectors, and maps the feature vectors to probability spaces of each emotion category through a sigmoid function for classification. And carrying out loss calculation on the output result of the second layer of the linear unit and the voice emotion labels in the emotion voice library through a cross entropy loss function to obtain an emotion prediction loss value, adjusting voice emotion classification model parameters, and training a voice emotion classifier model by referring to FIG. 2, wherein in a Softmax function link illustrated in the figure, a sigmoid function can be set and selected by a user, and the sigmoid function is specifically determined by actual data.

Step 2.3, calculating the accuracy of the test set on the adjusted voice emotion classification model, and obtaining a voice emotion classifier meeting the requirements after the accuracy reaches more than 90%;

and 2.4, removing a second layer of the linear unit in the voice emotion classification model, and keeping other network layers as they are to obtain the emotion encoder.

In the step S03 and the step S05 of the embodiment, the emotion converter and the emotion voice generator adopt the same network architecture, and the architecture needs to include a trained speaker encoder for extracting speaker characteristics. The speaker encoder is derived from a speaker classifier model with parameters adjusted to remove the classification layer. Next, the process of obtaining the speaker encoder included in the emotion converter and emotion voice generator of steps S03 and S05 in the embodiment of the present invention will be described in detail, and the obtaining of the speaker encoder may specifically include, but is not limited to, steps S310 to S330.

Step S310, inputting a Mel frequency spectrum corresponding to audio in a multi-speaker voice library into a preset speaker classifier model to carry out speaker recognition, so as to obtain speaker tags, speaker characteristics and classification predicted values of the speaker tags, wherein the speaker classifier model comprises a coding module and a classification module, the coding module uses three one-dimensional convolution layers and two LSTM layers, a normalization layer and a ReLU activation function are arranged behind each convolution layer, and the classification module adopts three linear layers;

step S320, calculating the loss of the speaker tag and the classification predicted value of the speaker tag by using the classification loss function, and adjusting the parameters of the speaker classifier model to obtain an optimized speaker classifier model;

step S330, removing the last linear layer of the speaker classifier classification module, and reserving other layers to obtain an optimized speaker encoder;

step S340, inputting the Mel frequency spectrum corresponding to the target voice into the optimized speaker encoder to obtain the speaker characteristic corresponding to the target voice.

It should be noted that step S340 is satisfied as shown in formula (2).

（2）；

Wherein,representing said optimized speaker encoder, < >>Represents the mel spectrum converted from the target voice,/->For representing the target->Input to->The speaker characteristics corresponding to the target voice are obtained.

Fig. 3 shows a flow chart of step S03 of the present invention;

the emotion transducer training process corresponding to step 03 may refer to fig. 3. The content encoder in the figure corresponds to the text encoder.

FIG. 4 shows a block diagram of a multi-emotion multi-speaker speech synthesis system of the present invention, comprising the following modules:

the voice data set acquisition module 5010 is used for acquiring a required emotion voice library and a multi-speaker voice library, extracting relevant characteristics, and dividing a training set, a testing set and a verification set;

an emotion encoder acquisition module 5020 for acquiring an emotion encoder for extracting emotion characteristics in audio;

an emotion transducer acquisition module 5030 for acquiring emotion transducers on each speaker for migrating different emotion features to a multi-speaker speech library;

the new emotion voice library construction module 5040 is configured to construct the generated voice obtained by the emotion converter into a new emotion voice library;

an emotion voice generator acquisition module 5050 for acquiring an emotion voice generator for generating a target voice containing a given emotion feature;

in one embodiment of the present invention, the voice data set acquisition module may include the following sub-modules:

the voice library acquisition sub-module is used for acquiring an emotion voice library acquired by the emotion encoder and the emotion converter and a multi-speaker voice library used for constructing a new emotion voice library, wherein the emotion voice library comprises audio, a text corresponding to the audio, a voice emotion tag and a speaker tag, and the multi-speaker voice library comprises the audio, the text corresponding to the audio and the speaker tag;

the feature extraction submodule is used for converting texts in the voice library into phoneme sequences for storage, extracting three acoustic features of mel frequency spectrum, pitch and energy required by training, and extracting phoneme duration;

the data set dividing module is used for dividing the voice library used for training into a training set, a testing set and a verification set according to a self-defined proper proportion.

In one embodiment of the invention, the emotion converter and emotion voice generator include the following submodules:

an encoder sub-module for extracting text features in the input phoneme sequence;

a speaker encoder sub-module for extracting speaker characteristics in the inputted mel spectrum;

the emotion feature extraction submodule is used for extracting emotion features in the input Mel frequency spectrum by using an emotion encoder;

the feature fusion sub-module is used for carrying out feature fusion on the text features, the emotion features and the speaker features;

a duration predictor sub-module for predicting a duration of each phoneme;

the phoneme expansion sub-module is used for carrying out equivalent expansion on each phoneme according to the duration corresponding to the predicted phoneme;

a pitch predictor sub-module for predicting a pitch corresponding to each phoneme;

an energy predictor sub-module for predicting energy corresponding to each phoneme;

and the decoder submodule is used for reconstructing the Mel frequency spectrum according to the fused characteristics, the predicted duration, the pitch and the energy.

According to an embodiment of the present invention, further comprising:

acquiring target voice emotion characteristics;

extracting audio characteristic data of the reference audio;

constructing a GAN feature-based generation model, wherein the generation model comprises a generator and a discriminator;

taking the target voice emotion characteristics as initial characteristic discrimination standard of a discriminator, taking audio characteristic data as initial input of a generator, and training the characteristic generation model against a neural network until the characteristic generation model reaches Nash balance;

generating simulation feature data of preset data quantity through a feature generation model;

extracting features of the target synthesized voice to obtain target feature data;

based on a standard Euclidean distance method, calculating the similarity between the analog feature data and the target feature data, and taking the similarity as a first audio expected value;

acquiring a first audio expected value of target synthesized voice through user feedback data;

and carrying out weighted average calculation on the first audio expected value and the first audio expected value to obtain a speech synthesis quality evaluation index.

In the speech synthesis technology, scientific quality judgment of a synthesis result is generally lacking, and the technical means mainly depend on manpower. The audio expected value is the quality evaluation value of the synthesized data, and the larger the value is, the higher the representative quality is, and the closer the synthesized voice data is to the real emotion audio characteristics. And the corresponding weight value of the weighted average is a preset value.

Compared with the traditional technology, which only carries out the judgment through artificial hearing, the invention greatly improves the scientificity and the accuracy of the judgment of the synthesized voice, has good reference significance, effectively assists technicians to carry out parameter regulation and control on the model in the voice synthesis method, and improves the synthesis efficiency.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-emotion multi-speaker speech synthesis method, comprising:

2. The method for synthesizing multi-emotion multi-speaker speech according to claim 1, wherein the specific process of step S01 is as follows:

step 1.1, acquiring an emotion voice library and a multi-speaker voice library, wherein the emotion voice library comprises audio, a text corresponding to the audio, a voice emotion tag and a speaker tag; the multi-speaker voice library comprises voice frequency, a text corresponding to the voice frequency and a speaker tag, the emotion voice library and the text corresponding to the voice frequency in the multi-speaker voice library are converted into a phoneme sequence, and then a forced alignment tool is utilized to obtain the duration time corresponding to each phoneme;

3. The method for synthesizing multi-emotion multi-speaker speech according to claim 2, wherein the specific process of step S02 is as follows:

step 2.1, constructing a voice emotion classifier model;

4. The method for synthesizing multi-emotion multi-speaker speech according to claim 3, wherein the specific process in step S03 is as follows:

and 3.2, extracting emotion characteristics corresponding to the audio in the emotion voice library by using the emotion encoder in the step 2.4, wherein the emotion characteristics are specifically shown in a formula (1):

（1）；

5. The method of claim 4, wherein the step S05 specifically comprises:

6. A multi-emotion multi-speaker speech synthesis system, the system comprising: the voice library acquisition module: the method is used for acquiring a required emotion voice library and a multi-speaker voice library;