CN117953855A

CN117953855A - Training method of speech synthesis model, speech synthesis method and equipment

Info

Publication number: CN117953855A
Application number: CN202410346345.6A
Authority: CN
Inventors: 赵之源; 李昱; 余飞; 周昌印; 幺宝刚
Original assignee: Hangzhou Gaishi Technology Co ltd; International Digital Economy Academy IDEA
Current assignee: Hangzhou Gaishi Technology Co ltd; International Digital Economy Academy IDEA
Priority date: 2024-03-26
Filing date: 2024-03-26
Publication date: 2024-04-30
Anticipated expiration: 2044-03-26
Also published as: CN117953855B

Abstract

The application discloses a training method of a voice synthesis model, a voice synthesis method and equipment, wherein the training method comprises the steps of training an initial voice conversion model based on a first training voice data set to obtain a target voice conversion model; determining first conversion voices corresponding to the second training voices in the second training voice data set based on the target voice conversion model, and constructing a training voice group based on the second training voices and the corresponding first conversion voices so as to obtain a third training voice data set; training an initial voice reconstruction model based on each training voice group to obtain a target voice reconstruction model; a speech synthesis model is determined based on the target speech conversion model and the target speech reconstruction model. The application uses low-quality voice to strengthen generalization, and uses high-quality voice to reconstruct, thereby reducing the demand of high-quality voice, reducing the training cost of the high-quality zero-sample voice synthesis model, and further reducing the synthesis cost of zero-sample voice synthesis.

Description

Training method of speech synthesis model, speech synthesis method and equipment

Technical Field

The present application relates to the field of speech synthesis technology, and in particular, to a training method for a speech synthesis model, a speech synthesis method and a device.

Background

With the development of Text To Speech (TTS) technology, more and more fields adopt the technology to improve the user experience. For example, the answer content of the voice assistant to a certain question on the intelligent device is preset, so that the following user can output the answer content in the form of voice when requesting the voice assistant to answer the question. In the prior art, in order to improve the speech quality of the synthesized speech, it is common to use a high-quality first training speech to complete the training of the speech synthesis model, so as to obtain the speech information corresponding to the text information based on the speech synthesis model obtained by training. However, high quality first training speech incurs high costs, which increases the training cost of the speech synthesis model and thus the synthesis cost of speech synthesis.

There is thus a need for improvements and improvements in the art.

Disclosure of Invention

The application aims to solve the technical problem of providing a training method of a speech synthesis model, a speech synthesis method and equipment aiming at the defects of the prior art.

In order to solve the above technical problems, a first aspect of the present application provides a training method of a speech synthesis model, where the training method of the speech synthesis model specifically includes:

Training the initial speech conversion model based on each first training speech in a first training speech data set to obtain a target speech conversion model, wherein the first training speech comprises a speaker sound and speaking content;

Determining first conversion voices corresponding to second training voices in a second training voice data set based on the target voice conversion model, and constructing a training voice group based on the second training voices and the corresponding first conversion voices to obtain a third training voice data set, wherein the voice quality of the second training voices is higher than that of the first training voices;

training an initial speech reconstruction model based on each training speech group in the third training speech data set to obtain a target speech reconstruction model;

and determining a speech synthesis model based on the target speech conversion model and the target speech reconstruction model.

The method for training a speech synthesis model, wherein the training the initial speech conversion model based on each first training speech in the first training speech data set to obtain a target speech conversion model specifically includes:

Determining a first predicted mel frequency spectrum of training data in the first training speech data set based on a first codec module in the initial speech conversion model, and optimizing model parameters of the first codec module based on the first predicted mel frequency spectrum and an original mel frequency spectrum of the speaker's voice;

When the training of the first encoding and decoding module is completed, model parameters of a first vocoder in the initial voice conversion model are optimized based on first training voices in the first training voice data set so as to obtain a target voice conversion model.

The method for training a speech synthesis model, wherein training the initial speech conversion model based on each first training speech in the first training speech data set to obtain a target speech conversion model specifically includes:

Determining a first predicted mel frequency spectrum of a first training speech in the first training speech dataset based on a first codec module in the initial speech conversion model;

inputting the first predicted mel frequency spectrum into a first vocoder in the initial voice conversion model to obtain second converted voice;

Determining a first loss function term based on the first predicted mel spectrum and the original mel spectrum of the speaker's voice, and determining a second loss function term based on the second converted speech and the first training speech;

And determining a third loss function term according to the first loss function term and the second loss function term, and training an initial voice conversion model by adopting the third loss function term to obtain a target voice conversion model.

The method for training a speech synthesis model, wherein the determining, based on a first codec module in the initial speech conversion model, a first predicted mel spectrum of a first training speech in the first training speech dataset specifically includes:

Encoding the speaker voice through a speaker encoder in the first encoding and decoding module to obtain a speaker characteristic vector;

Encoding the speaking content by a content encoder in the first encoding and decoding module to obtain a content characteristic vector;

splicing the speaker characteristic vector and the content characteristic vector to obtain a spliced vector;

and decoding the spliced vector by a decoder in the first encoding and decoding module to obtain a first predicted Mel frequency spectrum.

The training method of the speech synthesis model, wherein optimizing the model parameters of the first codec module based on the first predicted mel spectrum and the original mel spectrum of the speaker's voice specifically includes:

Determining a first loss function term according to the first predicted mel spectrum and the original mel spectrum of the speaker's voice;

And optimizing model parameters of the first coding and decoding module based on the first loss function item until a training ending condition is reached, so as to complete training of the first coding and decoding module.

The method for training a speech synthesis model, wherein optimizing model parameters of the first vocoder based on the first training speech in the first training speech data set to obtain a target speech conversion model specifically includes:

inputting training data in the first training voice data set into the first coding and decoding module, and outputting a first predicted mel frequency spectrum through the first coding and decoding module;

Inputting the first predicted mel frequency spectrum into the first vocoder, outputting a second converted voice through the first vocoder, and determining a second loss function term based on the second converted voice and the first training voice;

training the first vocoder based on the second loss function term to obtain a target speech conversion model.

The method for training the speech synthesis model, wherein the determining, based on the target speech conversion model, the first converted speech corresponding to each second training speech in the second training speech data set specifically includes:

Downsampling each second training voice in a second training voice data set to obtain downsampled voices corresponding to each second training voice, wherein the sampling rate of the downsampled voices is the same as that of the first training voice in the first training voice data set;

And inputting each downsampled voice into the target voice conversion model, and outputting a first conversion voice corresponding to each downsampled voice through the target voice conversion model.

The training method of the speech synthesis model, wherein the training the initial speech reconstruction model based on each training speech group in the third training speech data set to obtain the target speech reconstruction model specifically includes:

Extracting a second predicted Mel spectrum of the first converted voice in the training voice group based on the initial voice reconstruction model, and determining a predicted reconstructed voice based on the second predicted Mel spectrum;

And training the initial voice reconstruction model based on the predicted reconstruction voice and a second training voice in the training voice group so as to obtain a target voice reconstruction model.

The second aspect of the present application provides a speech synthesis method, using a speech synthesis model obtained by the training method of a speech synthesis model as described above, the speech synthesis method specifically comprising:

Inputting the speaking content to be synthesized and the target speaker voice corresponding to the speaking content into a target voice conversion model in the voice synthesis model, and obtaining a third conversion voice corresponding to the speaking content to be synthesized through the target voice conversion model;

and inputting the third converted voice into a target voice reconstruction model in the voice synthesis model, and outputting target synthesized voice corresponding to the speaking content to be synthesized through the target voice reconstruction model.

The third aspect of the present application provides a device for obtaining and training a speech synthesis model, where the device specifically includes:

The first training module is used for training the initial voice conversion model based on each first training voice in the first training voice data set to obtain a target voice conversion model, wherein the first training voices comprise speaker voices and speaking contents;

The building module is used for determining first conversion voices corresponding to the second training voices in the second training voice data set based on the target voice conversion model, and building a training voice group based on the second training voices and the corresponding first conversion voices to obtain a third training voice data set, wherein the voice quality of the second training voices is higher than that of the first training voices;

The second training module is used for training an initial voice reconstruction model based on each training voice group in the third training voice data set so as to obtain a target voice reconstruction model;

And the determining module is used for determining a voice synthesis model based on the target voice conversion model and the target voice reconstruction model.

A fourth aspect of the application provides a computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a training method of a speech synthesis model as described above and/or to implement steps in a speech synthesis method as described above.

A fifth aspect of the present application provides a terminal device, comprising: a processor and a memory;

The memory has stored thereon a computer readable program executable by the processor;

The processor, when executing the computer readable program, implements steps in a training method of a speech synthesis model as described above and/or implements steps in a speech synthesis method as described above.

The beneficial effects are that: compared with the prior art, the application provides a training method of a voice synthesis model, a voice synthesis method and equipment, wherein the training method of the voice synthesis model comprises the steps of training an initial voice conversion model based on each first training voice in a first training voice data set to obtain a target voice conversion model; determining first conversion voices corresponding to the second training voices in the second training voice data set based on the target voice conversion model, and constructing a training voice group based on the second training voices and the corresponding first conversion voices so as to obtain a third training voice data set; training an initial speech reconstruction model based on each training speech group in the third training speech data set to obtain a target speech reconstruction model; and determining a speech synthesis model based on the target speech conversion model and the target speech reconstruction model. According to the embodiment of the application, the target voice conversion model is firstly trained through the low-quality first training sample set and the second training sample set, then the target voice reconstruction model is trained through the target voice conversion model and the high-quality second training sample set, and the voice synthesis model is obtained based on the target voice conversion model and the target voice reconstruction model, so that generalization is enhanced by using low-quality voice, and then reconstruction is performed by using high-quality voice, the demand of high-quality voice can be reduced, the training cost of the high-quality zero-sample voice synthesis model can be reduced, and the synthesis cost of zero-sample voice synthesis can be reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a training method of a speech synthesis model according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a training device for a speech synthesis model according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a training method of a speech synthesis model, a speech synthesis method and equipment, and aims to make the purposes, technical schemes and effects of the application clearer and more definite, and the application is further described in detail below by referring to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the sequence number and the size of each step in this embodiment do not mean the sequence of execution, and the execution sequence of each process is determined by the function and the internal logic of each process, and should not be construed as limiting the implementation process of the embodiment of the present application.

The application will be further described by the description of embodiments with reference to the accompanying drawings.

The present embodiment provides a training method of a speech synthesis model, as shown in fig. 1, where the method includes:

s10, training the initial voice conversion model based on each first training voice in the first training voice data set to obtain a target voice conversion model.

Specifically, the first training speech data set includes a plurality of first training speech, and each of the plurality of first training speech includes a speaker voice and a speaking content, where the speaking content may be text data or speech data. For example, the first training speech data set is a low quality data set obtained by sampling with a 16kHz sampling rate, the first training speech including a speaker's voice and corresponding speaking content, wherein the speaker's voice may be multiple.

The initial voice conversion model is an initial network model, and the target voice conversion model can be obtained by optimizing model parameters of the initial voice conversion model. That is, the model structure of the initial speech conversion model is the same as that of the target speech conversion model, and the difference between the model structure of the initial speech conversion model and the model structure of the target speech conversion model is that the model parameters of the initial speech conversion model are initial parameters, and the model parameters of the target speech conversion model are model parameters trained by the first training speech data set, wherein the model parameters can be layer parameters, weights and the like of each network layer included in the initial speech conversion model. In this regard, a speech conversion model is taken as an example to explain the model structure. The initial speech conversion model may employ YourTTS target speech conversion models, TTS models, or the like. In the embodiment of the application, the YourTTS target voice conversion model is adopted as the initial voice conversion model. Specifically, the initial speech conversion model includes a first codec module and a first vocoder, the first codec module is connected to the first vocoder, the first codec module is configured to extract a mel spectrum corresponding to the first training speech, and the first vocoder is configured to convert the mel spectrum into speech data. Of course, in practical applications, the initial speech conversion model may employ other target speech conversion models, such as a TTS model.

It should be noted that, the first codec module and the first vocoder in the initial speech conversion model may be trained synchronously, or may be trained separately, where synchronous training refers to optimizing model parameters of the first codec module and the first vocoder synchronously during training of the initial speech conversion model. The training refers to optimizing the first encoding and decoding module, freezing the model parameters of the first vocoder when the first encoding and decoding module meets the training requirement, and optimizing the model parameters of the first vocoder.

In an implementation manner of the application embodiment, the first codec module and the first vocoder in the initial speech conversion model are respectively trained, and correspondingly, the training the initial speech conversion model based on each first training speech in the first training speech data set to obtain the target speech conversion model specifically includes:

S11, determining a first predicted Mel frequency spectrum of training data in the first training voice data set based on a first encoding and decoding module in the initial voice conversion model, and optimizing model parameters of the first encoding and decoding module based on the first predicted Mel frequency spectrum and an original Mel frequency spectrum of the speaker voice;

and S12, when the training of the first encoding and decoding module is completed, optimizing model parameters of a first vocoder in the initial voice conversion model based on first training voice in the first training voice data set so as to obtain a target voice conversion model.

Specifically, in step S11, the input term of the first codec module is the first training speech, and the output term is the first predicted mel spectrum. The first training speech comprises a speaker sound and speaking content, and correspondingly, the input item of the first encoding and decoding module comprises the speaker sound and speaking content, and the first encoding and decoding module obtains a first prediction mel frequency spectrum by encoding and decoding the speaker sound and the speaking content. Wherein the model parameters of the first vocoder may be frozen while training the first codec module.

The first codec module includes an encoder and a decoder, and since the input includes a speaker voice and a speaking content, the encoder includes a speaker encoder for encoding the speaker voice and a content encoder for encoding the speaking content, and the decoder is for decoding features encoded by the speaker encoder and the content encoder. Based on this, the determining, based on the first codec module, a first predicted mel spectrum of a first training speech in the first training speech data set specifically includes:

s111, encoding the speaker sound through a speaker encoder in the first encoding and decoding module so as to obtain a speaker characteristic vector;

s112, coding the speaking content through a content coder in the first coding and decoding module to obtain a content characteristic vector;

S113, splicing the speaker feature vector and the content feature vector to obtain a spliced vector;

s114, decoding the spliced vector through a decoder in the first encoding and decoding module to obtain a first predicted Mel frequency spectrum.

Specifically, the content encoder may be a text encoder or a speech encoder. That is, the content encoder may include a text encoder and a speech encoder, and when encoding the speaking content, the corresponding encoder is selected according to the data type of the speaking content. The method comprises the following steps: when the speaking content is of a text data type, a text encoder is selected as a content encoder, and when the speaking content is of a speech data type, a speech encoder is selected as a content encoder. Of course, in practical application, the speaking content may be partially text data and partially speech data, and then the text encoder and the speech encoder are used as content encoders at the same time, and the text encoder encodes the speaking content of the text data type; the speaking content of the voice data type is encoded by a content encoder.

When the speaker characteristic vector is extracted by the speaker encoder, the speaker encoder determines the speaker voice according to the sentence characteristic vectors corresponding to all sentences included in the speaker voice. When determining the speaker voice according to the obtained sentence feature vectors, the average value of the sentence feature vectors can be used as the speaker feature vector, so that the target voice conversion model can learn the whole style of the speaker more easily; the sentence feature vectors can be spliced to obtain the speaker feature vector, so that the target voice conversion model can learn the rhythm of the speaker voice more easily.

After the speaker feature vector and the content feature vector are obtained, the speaker feature vector and the content feature vector are spliced to obtain a spliced vector, wherein the splicing sequence of the speaker feature vector and the content feature vector can be the speaker feature vector-the content feature vector or the content feature vector-the speaker feature vector. In the implementation of the present application, the concatenation order is the content feature vector-speaker feature vector. Further, after the splice vector is obtained, the splice vector is used as an input to a decoder, a first predicted mel spectrum is output by the decoder, and then the first codec module is trained based on the first predicted mel spectrum and the original mel spectrum of the speaker's voice.

In one implementation, the optimizing the model parameters of the first codec module based on the first predicted mel spectrum and the original mel spectrum of the speaker's voice specifically includes:

Specifically, after the first predicted mel spectrum is obtained, a first loss function term may be constructed based on the first predicted mel spectrum and an original mel spectrum of the speaker's voice, and model parameters of the first codec module may be trained based on the first loss function until a training end condition is reached, so that the first codec module may learn a voice style of the speaker, where the original mel spectrum may be obtained by performing fourier transform on the speaker's voice to reflect the voice style of the speaker, where the voice style may include a timbre, a prosody, and the like. The training end condition may be that the training number reaches a preset number, or that the first loss function term satisfies a loss requirement (for example, the first loss function term is smaller than a preset loss threshold value, or the like), or that the training number reaches a preset number, or that the first loss function term satisfies any one of the loss requirements, or the like. The first loss function term may be employedThe loss function determination, and correspondingly, the first loss function term may be expressed as:

，

Wherein, Representing a first loss function term,/>Representing the original mel spectrum,/>Representing a first predicted mel spectrum.

Further, in step S12, the model parameters of the first codec module are frozen when the model parameters of the first vocoder are optimized. When the model parameters of the first vocoder are optimized, the input item of the first vocoder is a first predicted mel frequency spectrum, the output item is a second converted voice, and then the model parameters of the first vocoder are optimized based on the second converted voice and the speaker voice. Specifically, in one implementation, the optimizing the model parameters of the first vocoder based on the first training speech in the first training speech data set may specifically include:

S121, inputting first training voices in the first training voice data set into the first coding and decoding module, and outputting a first predicted Mel frequency spectrum through the first coding and decoding module;

s122, inputting the first predicted Mel frequency spectrum into the first vocoder, and outputting second converted voice through the first vocoder;

and S123, optimizing model parameters of the first vocoder based on the second converted voice and the first training voice to obtain a target voice conversion model.

Specifically, when the model parameters of the first vocoder are optimized based on the second converted speech and the first training speech, a predicted speech error between the second converted speech and the first training speech may be calculated, and then the model parameters of the first vocoder are optimized based on the predicted speech error to obtain the target speech conversion model. The predicted speech error may be calculated using a least squares loss function and a feature matching loss function.

In another implementation, the first codec module and the first vocoder in the initial speech conversion model may be trained synchronously, or after the first codec module and the first vocoder are trained separately to obtain the target speech conversion model, the first codec module and the first vocoder in the target speech conversion model may be trained synchronously. That is, the target speech conversion model may be taken as an initial speech conversion model, and a process of synchronously training the first codec module and the first vocoder in the initial speech conversion model may be performed. The process of performing synchronous training on the first coding and decoding module and the first vocoder in the initial voice conversion model to obtain the target voice conversion model may be:

inputting the first predicted mel frequency spectrum into a first vocoder to obtain second converted voice;

Specifically, the processing procedures of the first codec module and the first vocoder are the same as those described above, and will not be described again here, but only the determination procedure of the third loss function term in the synchronization training procedure will be described here. During synchronous training of the first codec module and the first vocoder, the first training voice may be input into the first codec module to obtain a first predicted mel spectrum, then the first predicted mel spectrum is input into the first vocoder to obtain a second converted voice, then a first loss function term is determined based on the first predicted mel spectrum and the original mel spectrum, a second loss function term is determined based on the second converted voice and the speaker voice, then the first loss function term and the second loss function term are added or weighted to determine a third loss function term, and finally the third loss function term is adopted to train the initial voice conversion model to obtain a trained target voice conversion model.

S20, determining first conversion voices corresponding to the second training voices in the second training voice data set based on the target voice conversion model, and constructing a training voice group based on the second training voices and the corresponding first conversion voices so as to obtain a third training voice data set.

Specifically, the second training speech data set includes a plurality of second training speech, each of the plurality of second training speech having a speech quality that is higher than a speech quality of a speaker's voice in the third training speech data set. That is, the second training speech data set is a high quality data set relative to the first training speech data set, e.g., the second training speech data set is a LibriTTS-R speech data set, and the first training speech data set is sampled using a 16kHz sampling rate speech data set. In addition, since the voice quality of the second training voice is higher than that of the speaker's voice in the third training voice data set, the sampling rate of the second training voice is higher than that of the speaker's voice in the third training voice data set, for example, the second training voice is voice data of 44kHz sampling rate, and the speaker's voice in the third training voice data set is voice data of 16kHz sampling rate.

The training voice group comprises a second training voice and a first converted voice of the training voice group, wherein the first converted voice is obtained by converting the second training voice through a target voice conversion model, and the voice quality of the first converted voice is lower than that of the second training voice. Specifically, the specific determining process of determining the first converted voice corresponding to each second training voice in the second training voice data set based on the target voice conversion model may be:

Downsampling each second training voice in the second training voice data set to obtain downsampled voices corresponding to each second training voice;

Specifically, the sampling rate of the second training speech is higher than the sampling rate of the speaker's voice in the first training speech, and the input term of the target speech conversion model is the speaker's voice. Thus, the second training speech needs to be downsampled so that the sample rate of the downsampled speech is the same as the sample rate of the speaker's voice in the first training speech. For example, the second training speech has a sampling rate of 44kHz and the speaker's voice in the first training speech has a sampling rate of 16kHz, then the second training speech needs to be downsampled to 16kHz, that is, the downsampled speech has a sampling rate of 16kHz.

After the downsampled speech is obtained, the downsampled speech may be used as an input item of a target speech conversion model, and a first converted speech corresponding to the downsampled speech may be determined by the target speech conversion model. When determining a first converted voice corresponding to the downsampled voice through the target voice conversion model, the downsampled voice can be used as a speaker voice and speaking content at the same time and respectively input into a speaker encoder and a content encoder in the target voice conversion model; or firstly, performing text conversion on the downsampled voice to obtain text data corresponding to the downsampled voice, then taking the converted text data as speaking content, taking the downsampled voice as a speaker voice, inputting the text data into a content encoder in a target voice conversion model, and inputting the downsampled voice into a speaker encoder in the target voice conversion model. According to the embodiment of the application, the second training voice is converted into the low-quality first conversion voice by adopting the target voice conversion model, and the voice characteristics of the second training voice can be kept unchanged, so that the low-quality voice with the unchanged voice characteristics and the high-quality voice can be obtained, the existing high-quality voice data can be directly adopted to carry out the second training of the target voice reconstruction model, the training cost of the target voice reconstruction model can be further reduced, and the training cost of the voice synthesis model can be further reduced. Where speech characteristics may include speaker characteristics, prosody, content of speaking, etc., remain unchanged.

S30, training an initial voice reconstruction model based on each training voice group in the third training voice data set to obtain a target voice reconstruction model.

Specifically, the initial speech reconstruction model is an initial network model, for example, the initial network model is an SVC model without a speaker coding module, and the target speech reconstruction model can be obtained by optimizing model parameters of the initial speech reconstruction model. That is, the model structure of the initial speech conversion model is the same as that of the target speech conversion model, and the difference between the model structure of the initial speech conversion model and that of the target speech conversion model is that the model parameters of the initial speech reconstruction model are the initial parameters, and the model parameters of the target speech reconstruction model are the model parameters trained by the third training speech data set. The target voice reconstruction model is used for reconstructing low-quality voice to obtain high-quality voice, so that the voice quality of first converted voice obtained by the target voice conversion model can be improved through the target voice reconstruction model, and the voice quality of synthesized voice determined based on the voice synthesis model can be obtained.

In one implementation manner, the training the initial speech reconstruction model based on each training speech group in the third training speech data set to obtain the target speech reconstruction model specifically includes:

s31, extracting a second predicted Mel frequency spectrum of the first converted voice in the training voice group based on the initial voice reconstruction model, and determining a predicted reconstructed voice based on the second predicted Mel frequency spectrum;

s32, training the initial speech reconstruction model based on the predicted reconstruction speech and a second training speech in the training speech group to obtain a target speech reconstruction model.

Specifically, the initial speech reconstruction model is a neural network model constructed based on deep learning and is used for reconstructing low-quality speech into high-quality speech, wherein the initial speech reconstruction model can adopt an existing neural network model, for example, an SVC model without a speaker encoder, and the like. In the embodiment of the application, the initial speech reconstruction model comprises a second encoding and decoding module and a second vocoder, wherein the second encoding and decoding module is connected with the second vocoder, the input item of the second encoding and decoding module is downsampled speech, the output item of the second encoding and decoding module is a second predicted mel frequency spectrum, the input item of the second vocoder is a second predicted mel frequency spectrum, and the output item of the second vocoder is predicted reconstructed speech. The sampling rate of the predicted reconstructed voice is the same as the sampling rate corresponding to the second training voice.

After the predicted reconstructed voice is obtained, a loss function term can be constructed according to the predicted reconstructed voice and the second training voice, and then the initial voice reconstruction model is trained based on the constructed loss function term so as to obtain a target voice reconstruction model, wherein the loss function term can be calculated by adopting a least square loss function and a feature matching loss function.

S40, determining a voice synthesis model based on the target voice conversion model and the target voice reconstruction model.

Specifically, the voice synthesis module comprises a target voice conversion model and the target voice reconstruction model, and the output item of the target voice conversion model is the input item of the target voice reconstruction model. According to the application, the target voice conversion model and the target voice reconstruction model which are included in the voice synthesis model are trained in stages, and the low-quality data set and the second training voice data set can be fully utilized, so that the voice synthesis model can obtain high-quality synthesized voice through low cost on one hand, the generalization of the voice synthesis model can be improved, and the calculation force requirement of the voice synthesis model is reduced.

In summary, the present embodiment provides a method for training a speech synthesis model, a method for synthesizing speech, and a device for training a speech synthesis model, where the method for training a speech synthesis model includes training an initial speech conversion model based on each first training speech in a first training speech dataset to obtain a target speech conversion model; determining first conversion voices corresponding to the second training voices in the second training voice data set based on the target voice conversion model, and constructing a training voice group based on the second training voices and the corresponding first conversion voices so as to obtain a third training voice data set; training an initial speech reconstruction model based on each training speech group in the third training speech data set to obtain a target speech reconstruction model; and determining a speech synthesis model based on the target speech conversion model and the target speech reconstruction model. According to the embodiment of the application, the target voice conversion model is firstly trained through the low-quality first training sample set and the second training sample set, then the target voice reconstruction model is trained through the target voice conversion model and the high-quality second training sample set, and the voice synthesis model is obtained based on the target voice conversion model and the target voice reconstruction model, so that generalization is enhanced by using low-quality voice, and then reconstruction is performed by using high-quality voice, the demand of high-quality voice can be reduced, the training cost of the high-quality zero-sample voice synthesis model can be reduced, and the synthesis cost of zero-sample voice synthesis can be reduced.

Based on the above-mentioned training method of the speech synthesis model, the present embodiment provides a speech synthesis method, as shown in fig. 2, using the speech synthesis model obtained by the training method of the speech synthesis model as described above, where the speech synthesis method specifically includes:

Based on the above-mentioned training method of the speech synthesis model, the present embodiment provides a training device for obtaining the speech synthesis model, as shown in fig. 3, where the training device for obtaining the speech synthesis model specifically includes:

a first training module 100, configured to train the initial speech conversion model based on each first training speech in the first training speech data set to obtain a target speech conversion model, where the first training speech includes a speaker sound and a speaking content;

the construction module 200 is configured to determine a first conversion voice corresponding to each second training voice in the second training voice data set based on the target voice conversion model, and construct a training voice group based on each second training voice and the corresponding first conversion voice thereof to obtain a third training voice data set, where the voice quality of the second training voice is higher than that of the first training voice;

A second training module 300, configured to train an initial speech reconstruction model based on each training speech group in the third training speech data set, so as to obtain a target speech reconstruction model;

a determining module 400 is configured to determine a speech synthesis model based on the target speech conversion model and the target speech reconstruction model.

Based on the above-described training method of the speech synthesis model, the present embodiment provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the training method of the speech synthesis model as described in the above-described embodiment.

Based on the training method of the above-mentioned speech synthesis model, the present application also provides a terminal device, as shown in fig. 4, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The training method of the speech synthesis model is characterized by specifically comprising the following steps of:

2. The method for training a speech synthesis model according to claim 1, wherein the training the initial speech conversion model based on each first training speech in the first training speech data set to obtain the target speech conversion model specifically comprises:

Determining a first predicted mel frequency spectrum of a first training voice in the first training voice data set based on a first codec module in the initial voice conversion model, and optimizing model parameters of the first codec module based on the first predicted mel frequency spectrum and an original mel frequency spectrum of the speaker's voice;

3. The method for training a speech synthesis model according to claim 1 or 2, wherein training the initial speech conversion model based on each first training speech in the first training speech data set to obtain the target speech conversion model specifically comprises:

4. The method for training a speech synthesis model according to claim 2, wherein the determining a first predicted mel spectrum of a first training speech in the first training speech dataset based on a first codec module in the initial speech conversion model specifically comprises:

5. The method according to claim 2, wherein optimizing model parameters of the first codec module based on the first predicted mel spectrum and the original mel spectrum of the speaker's voice specifically comprises:

6. The method for training a speech synthesis model according to claim 2, wherein optimizing model parameters of the first vocoder based on the first training speech in the first training speech dataset to obtain the target speech conversion model specifically comprises:

7. The method for training a speech synthesis model according to claim 1, wherein determining, based on the target speech conversion model, the first converted speech corresponding to each second training speech in the second training speech data set specifically comprises:

8. The method for training a speech synthesis model according to claim 1 or 7, wherein training an initial speech reconstruction model based on each training speech group in the third training speech dataset to obtain a target speech reconstruction model specifically comprises:

9. A speech synthesis method, characterized in that a speech synthesis model is obtained using the training method of a speech synthesis model according to any one of claims 1-8, said speech synthesis method comprising in particular:

10. The device for obtaining the training of the speech synthesis model is characterized by comprising the following specific components:

11. A computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a method of training a speech synthesis model according to any one of claims 1-8 and/or to implement steps in a method of speech synthesis according to claim 9.

12. A terminal device, comprising: a processor and a memory;

the processor, when executing the computer readable program, implements the steps of the training method of the speech synthesis model according to any of claims 1-8 and/or the steps of the speech synthesis method according to claim 9.