CN118262696A

CN118262696A - Singing voice synthesis model training method, singing voice synthesis method, device and storage medium

Info

Publication number: CN118262696A
Application number: CN202410322582.9A
Authority: CN
Inventors: 刘若澜; 陈梦
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-06-28

Abstract

The present application relates to a singing voice synthesis model training method, a singing voice synthesis method, a computer device, and a storage medium. The training method comprises the following steps: acquiring first singing voice waveform information of a first sample singing voice and second singing voice waveform information of a second sample singing voice; the first sample singing voice corresponds to the music score information, and the second sample singing voice does not correspond to the music score information; the number of the first sample singing voice is smaller than that of the second sample singing voice; inputting the first singing voice waveform information into a singing voice synthesis model to be trained by using the music spectrum information, and training an encoder and a decoder of the singing voice synthesis model by using the first singing voice waveform information and the music spectrum information to obtain an initial singing voice synthesis model; inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model. By adopting the method, the time for carrying out music score marking on singing voice frequency can be reduced, so that the training efficiency of the singing voice synthesis model is improved.

Description

Singing voice synthesis model training method, singing voice synthesis method, device and storage medium

Technical Field

The present application relates to the field of audio processing technology, and in particular, to a singing voice synthesis model training method, a singing voice synthesis method, a computer device, and a storage medium.

Background

With the development of Speech synthesis (TTS), a sub-field of synthesizing a singing voice of a user has emerged, which is different from conventional Speech synthesis in that the pitch requirement is strict for the tempo, and a singing voice singing with a tone of the user is generated by inputting a music spectrum of the singing voice and audio data of the user, thereby utilizing the music spectrum of the singing voice and the tone of the user.

In the conventional art, the synthesis of the singing voice of the user is usually implemented by using a pre-trained singing voice synthesis model, and the model can be obtained by training by using a large amount of audio data carrying a music score label.

However, the above method for training the singing voice synthesis model requires to use a large amount of audio data carrying the music score to train the model, thus a large amount of data cleaning and processing work is caused, and meanwhile, the time for music score marking on the audio data is longer, and the difficulty for acquiring the data is higher, so that the training efficiency of the singing voice synthesis model is lower.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a singing voice synthesis model training method, a singing voice synthesis method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product that can improve the efficiency of singing voice synthesis model training.

In a first aspect, the present application provides a singing voice synthesis model training method, including:

Acquiring first singing voice waveform information corresponding to a first sample singing voice and second singing voice waveform information corresponding to a second sample singing voice; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of the first sample singing sounds is smaller than the number of the second sample singing sounds;

Inputting the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice into a singing voice synthesis model to be trained, and training an encoder and a decoder of the singing voice synthesis model to be trained by utilizing the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model;

And inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model.

In one embodiment, the training the encoder and the decoder of the singing voice synthesis model to be trained by using the waveform information of the first singing voice and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model includes: obtaining prior distribution, first posterior distribution and first tone coding corresponding to the singing voice of the first sample through an encoder of the singing voice synthesis model to be trained; inputting the first posterior distribution and a first timbre code into a decoder of the singing voice synthesis model to be trained to obtain first predicted waveform information of the singing voice of the first sample; and training an encoder and a decoder of the singing voice synthesis model to be trained according to the difference between the first predicted waveform information and the first singing voice waveform information and the difference between the prior distribution and the first posterior distribution to obtain the initial singing voice synthesis model.

In one embodiment, the encoder of the singing voice synthesis model to be trained comprises: a timbre encoder, a text encoder, and a posterior encoder; the obtaining, by the encoder of the singing voice synthesis model to be trained, a priori distribution, a first posterior distribution, and a first timbre code corresponding to the first sample singing voice includes: inputting the waveform information of the first singing voice into the tone encoder to obtain the first tone code and obtain a first linear spectrum corresponding to the first sample singing voice; inputting the first timbre code and the first linear spectrum into the posterior encoder to obtain the first posterior distribution; and inputting the music spectrum information of the singing voice of the first sample into the text encoder to obtain the prior distribution corresponding to the singing voice of the first sample.

In one embodiment, the inputting the music spectrum information of the first sample singing voice into the text encoder to obtain the prior distribution corresponding to the first sample singing voice includes: inputting the phoneme information and the pitch information into the text encoder to obtain a phoneme text code; and carrying out phoneme expansion on the phoneme text codes by utilizing the phoneme duration information to obtain prior distribution corresponding to the singing voice of the first sample.

In one embodiment, the training the encoder and decoder of the singing voice synthesis model to be trained according to the difference between the first predicted waveform information and the first singing voice waveform information and the difference between the prior distribution and the first posterior distribution to obtain an initial singing voice synthesis model includes: acquiring first predicted Mel spectrum information corresponding to the first predicted waveform information and first actual Mel spectrum information corresponding to the first singing voice waveform information; carrying out standardized flow treatment on the first posterior distribution to obtain a first posterior distribution after standardized flow treatment; and training an encoder and a decoder of the singing voice synthesis model to be trained based on the difference between the first predicted Mel spectrum information and the first actual Mel spectrum information and the difference between the prior distribution and the first posterior distribution after the standardized stream processing to obtain an initial singing voice synthesis model.

In one embodiment, the training the decoder of the initial singing voice synthesis model by using the second singing voice waveform information to obtain a trained singing voice synthesis model includes: obtaining a second posterior distribution and a second tone code corresponding to the second sample singing through an encoder of the initial singing synthesis model; inputting the second posterior distribution and a second timbre code into a decoder of the initial singing voice synthesis model to obtain second predicted waveform information of the second sample singing voice; and training a decoder of the initial singing voice synthesis model according to the difference between the second predicted waveform information and the second singing voice waveform information to obtain the trained singing voice synthesis model.

In one embodiment, the encoder of the initial singing voice synthesis model includes: a timbre encoder and a posterior encoder; the encoder of the initial singing voice synthesis model obtains a second posterior distribution corresponding to the second sample singing voice, inputs the second posterior distribution into a decoder of the initial singing voice synthesis model, obtains second predicted waveform information of the second sample singing voice, and comprises the following steps: inputting the second singing voice waveform information into the tone encoder to obtain a second tone code corresponding to the second sample singing voice, and obtaining a second linear spectrum corresponding to the second sample singing voice; and inputting the second timbre code and the second linear spectrum into the posterior coder to obtain the second posterior distribution.

In one embodiment, the training the decoder of the initial singing voice synthesis model according to the difference between the second predicted waveform information and the second singing voice waveform information to obtain a trained singing voice synthesis model includes: acquiring second predicted Mel spectrum information corresponding to the second predicted waveform information and second actual Mel spectrum information corresponding to the second singing voice waveform information; and training a decoder of the initial singing voice synthesis model based on the difference between the second predicted mel spectrum information and the second actual mel spectrum information to obtain a trained singing voice synthesis model.

In a second aspect, the present application also provides a singing voice synthesizing method, including:

acquiring singing voice waveform information corresponding to singing voice to be synthesized and music spectrum information corresponding to the singing voice to be synthesized, wherein the music spectrum information comprises phoneme information, phoneme duration information and pitch information;

Inputting the singing voice waveform information and the music spectrum information into an encoder of a trained singing voice synthesis model to obtain posterior distribution and tone coding corresponding to the singing voice to be synthesized, and inputting the posterior distribution and tone coding into a decoder of the trained singing voice synthesis model to generate the singing voice to be synthesized; the singing voice synthesis model is trained by the singing voice synthesis model training method according to any embodiment of the first aspect.

In one embodiment, the encoder of the trained singing voice synthesis model comprises a text encoder and a tone encoder; inputting the singing voice waveform information and the music spectrum information into an encoder of the trained singing voice synthesis model to obtain posterior distribution and tone color coding corresponding to the singing voice to be synthesized, wherein the method comprises the following steps: inputting the music spectrum information into the text encoder to obtain prior distribution corresponding to the singing voice to be synthesized; performing inverse normalized flow processing on the prior distribution to obtain posterior distribution; and inputting the singing voice waveform information into the tone encoder to obtain tone codes corresponding to the singing voice to be synthesized.

In one embodiment, the music score information includes: the phoneme information, the phoneme duration information and the pitch information of the singing voice to be synthesized; inputting the music score information into the text encoder to obtain a priori distribution corresponding to the singing voice to be synthesized, wherein the method comprises the following steps: inputting the phoneme information and the pitch information into the text encoder to obtain a phoneme text code of the singing voice to be synthesized; and carrying out phoneme expansion on the phoneme text codes of the singing voice to be synthesized by utilizing the phoneme duration information to obtain prior distribution corresponding to the singing voice to be synthesized.

In a third aspect, the present application also provides a singing voice synthesis model training device, including:

The sample singing voice acquisition module is used for acquiring first singing voice waveform information corresponding to the first sample singing voice and second singing voice waveform information corresponding to the second sample singing voice; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of the first sample singing sounds is smaller than the number of the second sample singing sounds;

The first model training module is used for inputting the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice into a singing voice synthesis model to be trained, and training an encoder and a decoder of the singing voice synthesis model to be trained by using the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model;

and the second model training module is used for inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model.

In a fourth aspect, the present application also provides a singing voice synthesizing apparatus, including:

the waveform music spectrum acquisition module is used for acquiring singing voice waveform information corresponding to singing voice to be synthesized and music spectrum information corresponding to the singing voice to be synthesized;

The singing voice synthesizing module is used for inputting the singing voice waveform information and the music spectrum information into an encoder of a trained singing voice synthesizing model to obtain posterior distribution and tone coding corresponding to the singing voice to be synthesized, inputting the posterior distribution and tone coding into a decoder of the trained singing voice synthesizing model to generate the singing voice to be synthesized; the singing voice synthesis model is trained by the singing voice synthesis model training method according to any embodiment of the first aspect.

In a third aspect, the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the singing voice synthesis model training method according to any one of the embodiments of the first aspect or implements the singing voice synthesis method according to any one of the embodiments of the second aspect when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the singing voice synthesis model training method according to any one of the embodiments of the first aspect, or implements the singing voice synthesis method according to any one of the embodiments of the second aspect.

In a fifth aspect, the present application also provides a computer program product, comprising a computer program which, when executed by a processor, implements a singing voice synthesis model training method as described in any of the embodiments of the first aspect, or implements a singing voice synthesis method as described in any of the embodiments of the second aspect.

The above-described singing voice synthesis model training method, singing voice synthesis method, apparatus, computer device, storage medium, and computer program product by acquiring first singing voice waveform information corresponding to a first sample singing voice, and second singing voice waveform information corresponding to a second sample singing voice; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of the first sample singing voice is smaller than that of the second sample singing voice; inputting the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice into a singing voice synthesis model to be trained, and training an encoder and a decoder of the singing voice synthesis model to be trained by utilizing the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model; inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model. According to the application, the first sample singing voice carrying the corresponding music spectrum information and the second sample singing voice not carrying the corresponding music spectrum information can be obtained, so that the encoder and the decoder of the singing voice synthesis model can be trained by using the first singing voice waveform information of the first sample singing voice with a small number and the music spectrum information, the initial singing voice synthesis model is obtained, and then the decoder of the initial singing voice synthesis model is trained by using the second singing voice waveform information of the second sample singing voice with a large number, so that the trained singing voice synthesis model is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow diagram of a singing voice synthesis model training method in one embodiment;

FIG. 2 is a flow diagram of an embodiment of obtaining an initial singing voice synthesis model;

FIG. 3 is a flow chart of obtaining a priori distribution, posterior distribution, and timbre coding in one embodiment;

FIG. 4 is a flow chart of an embodiment of obtaining an initial singing voice synthesis model;

FIG. 5 is a flow diagram of a trained singing voice synthesis model in one embodiment;

FIG. 6 is a flow chart of a singing voice synthesizing method in one embodiment;

FIG. 7 is a block diagram of a singing voice synthesis base model training system in one embodiment;

FIG. 8 is a block diagram of an audio training system in one embodiment;

FIG. 9 is a system block diagram of the singing voice synthesis stage in one embodiment;

FIG. 10 is a block diagram showing the structure of a singing voice synthesis model training apparatus in one embodiment;

FIG. 11 is a block diagram showing the structure of a singing voice synthesizing apparatus in one embodiment;

Fig. 12 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a singing voice synthesis model training method is provided, and this embodiment is illustrated by applying the method to a server, it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

Step S101, obtaining first singing voice waveform information corresponding to a first sample singing voice and second singing voice waveform information corresponding to a second sample singing voice; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of first sample singing sounds is smaller than the number of second sample singing sounds.

Wherein, the first sample singing voice refers to singing voice frequency used for training a singing voice synthesis model and marked with corresponding music spectrum information in advance, and the second sample singing voice refers to singing voice frequency used for training the singing voice synthesis model and not marked with the music spectrum information, and the music spectrum information can be composed of the following elements: the phonemes, the duration of the phonemes, i.e. the number of frames the phonemes contain, and the pitch, i.e. the fundamental frequency, and the number of first sample singing sounds is much smaller than the number of second sample singing sounds. Specifically, in the process of training the singing synthesis model, sample singing can be collected first, and a small part of sample singing can be marked with music information to be used as a first sample singing, meanwhile, most of the remaining sample singing can be used as a second sample singing, and then singing waveforms of the first sample singing and the second sample singing can be respectively extracted, so that singing waveforms of the first sample singing, namely first singing waveform information, and singing waveforms of the second sample singing, namely second singing waveform information, can be obtained.

Step S102, inputting the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice into a singing voice synthesis model to be trained, and training an encoder and a decoder of the singing voice synthesis model to be trained by utilizing the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model.

The singing voice synthesis model to be trained refers to a singing voice synthesis model to be trained, the singing voice synthesis model can be composed of an encoder and a decoder, and the initial singing voice synthesis model is a singing voice synthesis model obtained by training a small number of first sample singing voice, and the small number of first sample singing voice can be used for training encoder parameters and decoder parameters of the singing voice synthesis model, so that the initial singing voice synthesis model is obtained. Specifically, the server may input the first singing voice waveform information and the melody information corresponding to the first sample singing voice into the singing voice synthesis model to be trained, so as to update the encoder parameters and the decoder parameters of the singing voice synthesis model to be trained through the first singing voice waveform information and the melody information, thereby implementing the training of the encoder and the decoder, and obtaining the initial singing voice synthesis model.

Step S103, inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model.

The trained singing voice synthesis model is a neural network model finally used for singing voice synthesis, the model can be obtained by training the initial singing voice synthesis model obtained in the step S102 by a large number of second sample singing voices which do not carry the music spectrum information, in the stage of further training the initial singing voice synthesis model, the encoder parameters of the initial singing voice synthesis model can be fixed, and the decoder of the initial singing voice synthesis model can be trained by a large number of second sample singing voices which do not carry the music spectrum information, so that the training of the singing voice synthesis model is completed. Specifically, after the server obtains the initial singing voice synthesis model, the server may further fix the encoder parameters of the initial singing voice synthesis model, and by inputting a large number of second singing voice waveform information corresponding to the second sample singing voice into the initial singing voice synthesis model, the decoder parameters of the initial singing voice synthesis model may be updated by using the second singing voice waveform information, so as to implement the decoder training of the initial singing voice synthesis model, and finally obtain the trained singing voice synthesis model.

In the above-mentioned singing voice synthetic model training method, the first singing voice waveform information corresponding to the first sample singing voice and the second singing voice waveform information corresponding to the second sample singing voice are obtained; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of the first sample singing voice is smaller than that of the second sample singing voice; inputting the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice into a singing voice synthesis model to be trained, and training an encoder and a decoder of the singing voice synthesis model to be trained by utilizing the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model; inputting the second singing voice waveform information into the initial singing voice synthesis model, and training a decoder of the initial singing voice synthesis model by utilizing the second singing voice waveform information to obtain a trained singing voice synthesis model. According to the application, the first sample singing voice carrying the corresponding music spectrum information and the second sample singing voice not carrying the corresponding music spectrum information can be obtained, so that the encoder and the decoder of the singing voice synthesis model can be trained by using the first singing voice waveform information of the first sample singing voice with a small number and the music spectrum information, the initial singing voice synthesis model is obtained, and then the decoder of the initial singing voice synthesis model is trained by using the second singing voice waveform information of the second sample singing voice with a large number, so that the trained singing voice synthesis model is obtained.

In an exemplary embodiment, as shown in fig. 2, step S102 may further include:

Step S201, obtaining a priori distribution, a first posterior distribution and a first timbre code corresponding to the singing voice of the first sample through an encoder of a singing voice synthesis model to be trained.

The prior distribution refers to prior probability distribution of the first sample singing voice, the prior distribution can be obtained by processing the music spectrum information corresponding to the first sample singing voice through an encoder of a singing voice synthesis model to be trained, the first posterior distribution is posterior probability distribution corresponding to the first sample singing voice, the posterior distribution can be obtained by processing the first singing voice waveform information corresponding to the first sample singing voice through an encoder of the singing voice synthesis model to be trained, and the first tone coding is tone coding corresponding to the first sample singing voice, for example, tone coding corresponding to a singer of the first sample singing voice can be obtained by processing the first singing voice waveform information through the encoder.

Specifically, after the server inputs the waveform information of the first singing voice and the music spectrum information corresponding to the singing voice of the first sample into the singing voice synthesis model to be trained, the server can output corresponding prior distribution, posterior distribution and tone color coding through an encoder of the model.

Step S202, inputting the first posterior distribution and the first timbre code into a decoder of the singing voice synthesis model to be trained to obtain first predicted waveform information of the singing voice of the first sample.

The first predicted waveform information refers to singing voice waveform information of a first sample singing voice predicted by a singing voice synthesis model to be trained, and the singing voice waveform information can be obtained by a decoder according to a first posterior distribution and a first tone coding. Specifically, the server may input the first posterior distribution and the first timbre code to a decoder of the singing voice synthesis model to be trained, and output first predicted waveform information resulting in the singing voice of the first sample by the decoder.

Step S203, the encoder and decoder of the singing voice synthesis model to be trained are trained according to the difference between the first predicted waveform information and the first singing voice waveform information and the difference between the prior distribution and the first posterior distribution, and an initial singing voice synthesis model is obtained.

After obtaining the first predicted waveform information, the server may further train the encoder and decoder of the singing voice synthesis model to be trained, that is, train the encoder parameters and decoder parameters, by using the difference between the first predicted waveform information output by the decoder and the first singing voice waveform information corresponding to the first sample singing voice, and the difference between the prior distribution and the first posterior distribution, thereby obtaining the initial singing voice synthesis model.

In this embodiment, the training of the singing voice synthesis model may be that the encoder is used to obtain the prior distribution, the first posterior distribution and the first timbre code corresponding to the singing voice of the first sample, and then the decoder is used to obtain the first predicted waveform information, so that the difference between the first predicted waveform information and the first singing voice waveform information, and the difference between the prior distribution and the posterior distribution are used to train the singing voice synthesis model, thereby further improving the accuracy of the initial singing voice synthesis model obtained by training.

Further, an encoder of the singing voice synthesis model to be trained, comprising: tone encoder, text encoder, and a posterior encoder, as shown in fig. 3, step S201 may further include:

step S301, inputting the waveform information of the first singing voice into a tone encoder to obtain a first tone code, and obtaining a first linear spectrum corresponding to the first sample singing voice.

In this embodiment, the encoders constituting the singing voice synthesis model may include three types, i.e., a tone encoder, a text encoder, and a posterior encoder, respectively. The server can obtain the tone code corresponding to the first singing voice waveform information, namely the first tone code, by the tone encoder by inputting the first singing voice waveform information into the tone encoder. The server may also extract a linear spectrum corresponding to the first sample singing voice, that is, the first linear spectrum, where the linear spectrum may be extracted by performing data processing on the first singing voice waveform information by using a singing voice synthesis model, or may be obtained by performing data processing on the first singing voice waveform information in advance, and input the singing voice synthesis model to be trained.

Step S302, inputting the first timbre code and the first linear spectrum into a posterior coder to obtain a first posterior distribution.

The posterior encoder is an encoder for generating a posterior distribution, and the server may input the first timbre code and the first linear spectrum into the posterior encoder after obtaining the first timbre code and the first linear spectrum, and output the corresponding first posterior distribution by the posterior encoder.

Step S303, inputting the music spectrum information of the singing voice of the first sample into a text encoder to obtain the prior distribution corresponding to the singing voice of the first sample.

The text encoder is used for outputting prior distribution of the singing voice of the first sample, and after obtaining the music spectrum information, the server can also input the music spectrum information into the text encoder, and the text encoder outputs prior distribution of the singing voice of the first sample.

In this embodiment, the encoder forming the singing voice synthesis model may include a timbre encoder, a text encoder and a posterior encoder, where the timbre encoder may be used to obtain timbre codes, the text encoder may be used to obtain prior distribution, and the posterior encoder may be used to obtain posterior distribution.

Further, step S303 may further include: inputting the phoneme information and the pitch information into a text encoder to obtain a phoneme text code; and carrying out phoneme expansion on the phoneme text codes by using the phoneme duration information to obtain prior distribution corresponding to the singing voice of the first sample.

The phoneme text codes refer to the result output by the text encoder, and after obtaining the melody information of the singing voice of the first sample, the server may input the phoneme information and the pitch information to the text encoder, and the text encoder outputs the corresponding phoneme text codes. And then, phoneme expansion can be carried out on the phonemes by utilizing the phoneme duration information, so that the priori distribution of the singing voice of the first sample is obtained.

In this embodiment, the prior distribution may be obtained by inputting the phoneme information and the pitch information in the music spectrum information of the singing voice of the first sample into the text encoder, and then performing phoneme expansion on the output result of the text encoder by using the phoneme duration information in the music spectrum information.

In addition, as shown in fig. 4, step S203 may further include:

Step S401, obtain first predicted mel spectrum information corresponding to the first predicted waveform information, and first actual mel spectrum information corresponding to the first singing voice waveform information.

The first predicted mel spectrum information refers to a mel spectrum corresponding to first predicted waveform information, the first actual mel spectrum information is a mel spectrum corresponding to first singing waveform information, and after obtaining the first predicted waveform information and the first singing waveform information, the server may further extract the mel spectrum from the waveform information, so as to obtain first predicted mel spectrum information corresponding to the first predicted waveform information and first actual mel spectrum information corresponding to the first singing waveform information respectively.

Step S402, carrying out standardized flow processing on the first posterior distribution to obtain the first posterior distribution after the standardized flow processing.

The normalized flow processing refers to a processing mode of converting a simple distribution (such as Gaussian distribution) into any complex distribution, and after the first posterior distribution is obtained, the normalized flow processing can be performed on the first posterior distribution, so that the processed first posterior distribution is obtained.

Step S403, training the encoder and decoder of the singing voice synthesis model to be trained based on the difference between the first predicted mel spectrum information and the first actual mel spectrum information, and the difference between the prior distribution and the first posterior distribution after the normalized stream processing, to obtain an initial singing voice synthesis model.

Finally, the server may construct a loss function to train the encoder and decoder of the singing voice synthesis model based on the difference between the first predicted mel-spectrum information and the first actual mel-spectrum information, for example, the L2 loss between the first predicted mel-spectrum information and the first actual mel-spectrum information, and the difference between the prior distribution and the first posterior distribution after the normalized stream processing, for example, the kl loss between the prior distribution and the posterior distribution after the normalized stream processing, to obtain the initial singing voice synthesis model.

In this embodiment, the server may train the encoder and the decoder by using the difference between mel spectrum information corresponding to the waveform information and the difference between the prior distribution and the first posterior distribution after the normalized stream processing, so as to further improve the accuracy of the obtained initial singing voice synthesis model.

In one embodiment, as shown in fig. 5, step S103 may further include:

step S501, obtaining a second posterior distribution corresponding to the second sample singing voice and a second tone color code through an encoder of the initial singing voice synthesis model.

The second posterior distribution refers to a posterior probability distribution corresponding to the second sample singing voice, where the posterior probability distribution may be obtained by processing, by an encoder of the initial singing voice synthesis model, second singing voice waveform information corresponding to the second sample singing voice, and the second tone code may be a tone code corresponding to the second sample singing voice, for example, a tone code corresponding to a singer of the second sample singing voice, and the tone code may also be obtained by processing, by the encoder, the second singing voice waveform information.

Specifically, similar to the first posterior distribution and the first timbre coding, in this embodiment, after the initial singing voice synthesis model is obtained, the second singing voice waveform information may be input into the singing voice synthesis model to be trained, and then the corresponding posterior distribution and timbre coding may be output through the encoder of the model.

Step S502, inputting the second posterior distribution and the second timbre code into the decoder of the initial singing voice synthesis model to obtain second predicted waveform information of the second sample singing voice.

The second predicted waveform information refers to singing voice waveform information of a second sample singing voice predicted by the initial singing voice synthesis model, and the singing voice waveform information can be obtained by a decoder according to a second posterior distribution and second tone coding. Specifically, the server may input the second posterior distribution and the second timbre code to a decoder of the initial singing voice synthesis model, and output second predicted waveform information of the second sample singing voice by the decoder.

Step S503, training the decoder of the initial singing voice synthesis model according to the difference between the second predicted waveform information and the second singing voice waveform information to obtain a trained singing voice synthesis model.

After obtaining the second predicted waveform information, the server may further train the decoder of the initial singing voice synthesis model, that is, fix the encoder parameters, train only the decoder parameters by using the difference between the second predicted waveform information output by the decoder and the second singing voice waveform information corresponding to the second sample singing voice, thereby obtaining a trained singing voice synthesis model.

In this embodiment, the decoder of the singing voice synthesis model can be further trained by the second sample singing voice, so that the tone generalization capability of the model can be improved, and the training can be completed by collecting the singing voice frequency data without performing music notation because the music notation information is not needed in the training stage.

In one embodiment, an encoder of an initial singing voice synthesis model includes: a timbre encoder and a posterior encoder; step S501 may further include: inputting the waveform information of the second singing voice into a tone encoder to obtain a second tone code corresponding to the second sample singing voice, and obtaining a second linear spectrum corresponding to the second sample singing voice; and inputting the second timbre code and the second linear spectrum into a posterior coder to obtain second posterior distribution.

In this embodiment, the encoder of the initial singing voice synthesis model may include a timbre encoder for obtaining a timbre code, and a posterior encoder for generating a posterior distribution, and the server may obtain the timbre code corresponding to the second singing voice waveform information, that is, the second timbre code, by inputting the second singing voice waveform information into the timbre encoder. The server may extract a second linear spectrum corresponding to the second sample singing voice, which is similar to the first linear spectrum, and the extraction of the linear spectrum may be obtained by performing data processing on the second singing voice waveform information by the singing voice synthesis model, or may be obtained by performing data processing on the second singing voice waveform information in advance, and input the first singing voice synthesis model. After the second timbre code and the second linear spectrum are obtained, the second timbre code and the second linear spectrum may be input to a posterior encoder to obtain a second posterior distribution.

In this embodiment, the encoder forming the singing voice synthesis model may include a timbre encoder and a posterior encoder, where the timbre encoder may be used to obtain timbre encoding, and the posterior encoder may be used to obtain posterior distribution.

Further, step S503 may further include: acquiring second predicted mel spectrum information corresponding to the second predicted waveform information and second actual mel spectrum information corresponding to the second singing voice waveform information; based on the difference between the second predicted mel-spectrum information and the second actual mel-spectrum information, a decoder of the initial singing voice synthesis model is trained to obtain a trained singing voice synthesis model.

The second predicted mel spectrum information refers to a mel spectrum corresponding to second predicted waveform information, the second actual mel spectrum information is a mel spectrum corresponding to second singing waveform information, and after obtaining the second predicted waveform information and the second singing waveform information, the server may further perform mel spectrum extraction on the waveform information, so as to obtain second predicted mel spectrum information corresponding to the second predicted waveform information and second actual mel spectrum information corresponding to the second singing waveform information, respectively. The server may then construct a loss function to train a decoder of the singing voice synthesis model based on a difference between the second predicted mel-spectrum information and the second actual mel-spectrum information, e.g., an L2 loss between the second predicted mel-spectrum information and the second actual mel-spectrum information, to obtain a trained singing voice synthesis model.

In this embodiment, the server may further train the decoder by using the difference between mel spectrum information corresponding to the waveform information, so as to further improve the accuracy of the obtained trained singing voice synthesis model.

In an embodiment, as shown in fig. 6, a singing voice synthesizing method is further provided, where this embodiment is applied to a server for illustration, and it is understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step S601, obtaining singing voice waveform information corresponding to singing voice to be synthesized and music spectrum information corresponding to singing voice to be synthesized.

The singing voice to be synthesized refers to the singing voice which is needed to be synthesized by a user through the singing voice synthesis technology, and the singing voice waveform information and the music spectrum information corresponding to the singing voice to be synthesized refer to the singing voice waveform information and the music spectrum information which are needed to be used for synthesizing the singing voice to be synthesized. Specifically, in the case where the user needs to synthesize the singing voice to be synthesized, the singing voice waveform information and the music score information for synthesizing the singing voice to be synthesized may be provided to the server first as the singing voice waveform information and the music score information corresponding to the singing voice to be synthesized.

Step S602, inputting singing voice waveform information and music score information into an encoder of a trained singing voice synthesis model to obtain posterior distribution and tone color coding corresponding to singing voice to be synthesized; inputting posterior distribution and tone coding into a decoder of a singing voice synthesis model after training, and generating singing voice to be synthesized; the singing voice synthesis model is obtained through training by the singing voice synthesis model training method in any embodiment.

Then, the server can input the singing voice waveform information and the music score information corresponding to the singing voice to be synthesized into the trained singing voice synthesis model, and the singing voice synthesis model outputs the corresponding singing voice to be synthesized. The singing voice synthesis model may be composed of an encoder that may be used to generate a posterior distribution and a timbre code, and a decoder that may be used to generate the singing voice to be synthesized using the posterior distribution and the timbre code. The server can input singing voice waveform information and melody information for generating the singing voice to be synthesized to the encoder of the trained singing voice synthesis model, and posterior distribution and tone color coding corresponding to the singing voice to be synthesized are obtained by the encoder. Then, the server inputs the posterior distribution and the timbre code to the decoder of the trained singing voice synthesis model, and the decoder generates the singing voice to be synthesized based on the posterior distribution and the timbre code.

The singing voice synthesis model is obtained by training a small amount of sample singing voice marked with music spectrum information and a large amount of sample singing voice not marked with music spectrum information, so that the time for marking the music spectrum of the singing voice can be reduced on the premise of ensuring that the accurate singing voice to be synthesized is obtained, and the training efficiency of the singing voice synthesis model is improved.

According to the singing voice synthesizing method, the singing voice waveform information corresponding to the singing voice to be synthesized and the music score information corresponding to the singing voice to be synthesized are obtained; inputting singing voice waveform information and melody information into an encoder of a trained singing voice synthesis model to obtain posterior distribution and tone color coding corresponding to singing voice to be synthesized; inputting posterior distribution and tone coding into a decoder of a singing voice synthesis model after training, and generating singing voice to be synthesized; the singing voice synthesis model is obtained through training by the singing voice synthesis model training method in any embodiment. The application can input the singing voice waveform information and the music score information corresponding to the singing voice to be synthesized into the singing voice synthesis model after training, and the singing voice synthesis model is generated by the model.

Further, the encoder of the trained singing voice synthesis model comprises a text encoder and a tone encoder; inputting the singing voice waveform information and the melody information into an encoder of the trained singing voice synthesis model to obtain posterior distribution and tone color coding corresponding to the singing voice to be synthesized, and further comprising: inputting the music spectrum information into a text encoder to obtain prior distribution corresponding to singing voice to be synthesized; performing inverse normalized flow treatment on the prior distribution to obtain posterior distribution; and inputting the singing voice waveform information into a tone encoder to obtain tone codes corresponding to the singing voice to be synthesized.

In this embodiment, the encoder used for generating the singing voice to be synthesized in the trained singing voice synthesis model may be composed of a text encoder and a tone encoder, wherein the text encoder is used for generating a priori probability distribution, i.e. a priori distribution, of the singing voice to be synthesized according to the music spectrum information, and the tone encoder is used for generating a tone code of the singing voice to be synthesized. The inverse normalized stream processing refers to a process of performing inverse operation on the normalized stream processing, i.e., inverse normalized stream processing.

Specifically, the server may input the score information to the text encoder, and the text encoder outputs the score information to obtain the prior distribution of the singing voice to be synthesized, and then the prior distribution may be subjected to inverse normalized stream processing, that is, inverse normalized stream processing, so as to obtain posterior distribution. And the server can also input the singing voice waveform information of the singing voice to be synthesized into the tone encoder, so that tone codes corresponding to the singing voice to be synthesized are obtained through the tone encoder.

In this embodiment, the posterior distribution is obtained by inputting the cursive spectrum information into the text encoder and performing the inverse standard fluidization processing on the prior distribution, and the timbre coding is obtained by inputting the singing voice waveform information into the timbre encoder, so that the accuracy of the posterior distribution and the timbre coding can be further improved.

Further, the music score information may include: phoneme information, phoneme duration information and pitch information of singing voice to be synthesized; inputting the music spectrum information into a text encoder to obtain a priori distribution corresponding to singing voice to be synthesized, and further comprising: inputting the phoneme information and the pitch information into a text encoder to obtain a phoneme text code of singing voice to be synthesized; and carrying out phoneme expansion on the phoneme text codes of the singing voice to be synthesized by utilizing the phoneme duration information to obtain the prior distribution corresponding to the singing voice to be synthesized.

Phoneme text encoding refers to the result output by the text encoder, in this embodiment, the music score information may be composed of the following elements: the phonemes, the duration of the phonemes, i.e. the number of frames the phonemes contain, and the pitch, i.e. the base frequency. After obtaining the music score information of the singing voice of the first sample, the server can input the phoneme information and the pitch information into a text encoder, and the text encoder outputs the corresponding phoneme text code. And then, phoneme expansion can be carried out on the phonemes by utilizing the phoneme duration information, so that the priori distribution of singing voice to be synthesized is obtained.

In this embodiment, the prior distribution may be obtained by inputting the phoneme information and the pitch information in the music spectrum information of the singing voice to be synthesized into the text encoder, and then performing phoneme expansion on the output result of the text encoder by using the phoneme duration information in the music spectrum information.

In one embodiment, there is also provided a singing voice synthesizing method, which can directly generate a singing voice with a target tone according to the audio frequency of the target tone, and in order to realize the process, the training of the singing voice synthesizing model is divided into two stages of basic model training and tone training, and the parameters updated in the first stage and the second stage are used for synthesizing the singing voice.

FIG. 7 is an overall block diagram of a singing voice synthesis system for basic model training. The system incorporates pitch and phoneme duration information to enable generation of singing sounds of specified pitch, and a timbre encoder to extract timbre codes from the singing waveforms as speaker characterization to participate in modeling. It contains a text encoder, timbre encoder, a posterior encoder, decoder and normalized stream. The loss function of the model mainly comprises decoder loss, i.e. L2 loss between the predicted singing voice waveform extraction Mel spectrum and the real waveform extraction Mel spectrum, and kl loss of Z and Zp after normalized flow. The training data includes singing voice data of multiple persons, corresponding phoneme strings and duration information of sound height.

After the first stage training is completed, in order to improve the tone generalization capability of the model, audio data of as many speakers as possible needs to be fed. The timbre training phase uses only the left half of the model shown in fig. 7 and only the parameters in the decoder are involved in the update. The system block diagram can be simplified to that shown in fig. 8. The training data of the stage only needs audio data and does not need any labeling information such as phonemes, duration and the like, so that the training data is not limited to high-quality singing voice data, separated singing voice data, high-quality voice data and crawled voice data, and can be used for training a decoder. The more data the decoder sees, the easier the zero sample learning.

In the synthesis stage, as shown in fig. 9, the phoneme, pitch and duration are used as input, zp is obtained through the expansion of the text encoder and the phoneme, Z is obtained through the reverse standard flow, the singing voice waveform is processed through the tone encoder to obtain tone encoding, and the tone encoding and the Z are processed through the decoder updated in the second stage together, so that the singing voice with the target tone can be obtained.

The embodiment can be realized by the following steps:

Training phase:

(1) And (3) data processing:

The training data is divided into two parts, which correspond to two-stage training of the model respectively. The first stage data contains singing and music of multiple speakers. For example: there are ten total speakers, each with 100 songs and corresponding melodies, where the melodies contain phonemes, phoneme lengths and pitches (fundamental frequencies). The data processing includes extracting mel-spectra, linear spectra, and timbre codes for the audio data. After the data preparation is completed, each song will contain: phoneme, phoneme duration, fundamental frequency, mel spectrum, linear spectrum and tone coding. The second stage of data requires audio data containing thousands of speakers, either speech or singing. The data processing includes only extracting mel-spectra, linear spectra and audio codes from the audio data.

(2) Singing voice synthesis model first-stage model training:

as shown in fig. 7, the text encoder, standard stream, a posterior encoder, and decoder may all be referenced to corresponding modules in the VITS, or other acoustic models may be used. The timbre encoder may use a speaker recognition model to extract the speaker code, such as deepspk models, or other speaker recognition models. The encoder employed in this embodiment is a stacked FFT block, the standard stream is 1*1 reversible convolutions and affine pairs, the decoder generates waveforms for GAN structures like hifigan, the phoneme extension represents copying the results in the text encoder by the duration of the phoneme (the number of frames the phoneme contains) and extending the output from the phoneme level to the frame level. In the training phase, the phonemes and the pitch are used as inputs to the text encoder, and the outputs are expanded by the phonemes to obtain the prior probabilities Zp. Meanwhile, the linear spectrum and the speaker code are used as inputs of a posterior encoder to obtain posterior probability Z, and meanwhile, Z is subjected to standard fluidization to obtain Zp distribution for calculating kl loss. And finally, inputting the posterior probability Z and the speaker code into a decoder to obtain a predicted singing voice waveform. The loss of the model is divided into two parts. Part is decoder loss, mainly comprising L2 loss of predicted singing mel spectrum and real singing data mel spectrum. In addition, KL loss is KL loss of a posterior probability Z after normalized stream processing and a priori probability Zp after phoneme expansion. The training data at this stage is singing voice data with music spectrum.

(3) Second stage model training of acoustic model:

as shown in fig. 8, this stage model involves only a posterior encoder and decoder in training, where only the decoder parameters are updated and the posterior encoder parameters are fixed. The loss function is only decoder loss, i.e. mel-spectrum loss.

And (3) an online synthesis stage:

As shown in fig. 9, in the synthesis stage, firstly, the music spectrum information needs to be input, the priori Zp is obtained through the text encoder and the phoneme extension, then the posterior Z is obtained through the reverse standard flow, the audio recorded by the target user is extracted by the text encoder to be encoded by the speaker, and the singing waveform with the timbre of the target speaker can be synthesized through the decoder.

According to the singing voice synthesizing method, as the posterior encoder and the decoder can only need audio waveform files to train and do not need additional inputs such as phonemes, the decoder can be trained by using a large amount of pure audio data which does not carry phonemes after the rest of the decoder is trained by using a small amount of audio waveforms carrying phonemes, so that phoneme labels required by training can be reduced, and the training efficiency of a model is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a singing voice synthesis model training device for realizing the singing voice synthesis model training method and a singing voice synthesis device for realizing the singing voice synthesis method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the singing voice synthesis model training device or embodiments of the singing voice synthesis device provided below may be referred to above for limitations of the singing voice synthesis model training method or singing voice synthesis method, which are not described herein.

In an exemplary embodiment, as shown in fig. 10, there is provided a singing voice synthesis model training apparatus including: a sample singing voice acquisition module 1001, a first model training module 1002, and a second model training module 1003, wherein:

A sample singing voice acquisition module 1001, configured to acquire first singing voice waveform information corresponding to a first sample singing voice and second singing voice waveform information corresponding to a second sample singing voice; the first sample singing voice corresponds to music spectrum information, the music spectrum information comprises phoneme information, phoneme duration information and pitch information, and the second sample singing voice does not correspond to the music spectrum information; the number of first sample singing sounds is smaller than the number of second sample singing sounds;

a first model training module 1002, configured to input first singing voice waveform information and music spectrum information corresponding to a first sample singing voice into a singing voice synthesis model to be trained, and train an encoder and a decoder of the singing voice synthesis model to be trained by using the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model;

The second model training module 1003 is configured to input the second singing voice waveform information into the initial singing voice synthesis model, train a decoder of the initial singing voice synthesis model using the second singing voice waveform information, and obtain a trained singing voice synthesis model.

In an exemplary embodiment, as shown in fig. 11, there is provided a singing voice synthesizing apparatus including: a waveform profile acquisition module 1101 and a singing voice synthesis module 1102, wherein:

The waveform music spectrum acquisition module is used for acquiring singing voice waveform information corresponding to singing voice to be synthesized and music spectrum information corresponding to singing voice to be synthesized;

The singing voice synthesizing module is used for inputting the singing voice waveform information and the music spectrum information into the encoder of the trained singing voice synthesizing model to obtain posterior distribution and tone coding corresponding to the singing voice to be synthesized, inputting the posterior distribution and the tone coding into the decoder of the trained singing voice synthesizing model to generate the singing voice to be synthesized; the singing voice synthesis model is trained by the singing voice synthesis model training method according to any one of the above embodiments.

The above-described singing voice synthesis model training apparatus or the respective modules in the singing voice synthesis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing singing voice waveform information. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a singing voice synthesis model training method, or implements a singing voice synthesis method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A singing voice synthesis model training method, the method comprising:

2. The method of claim 1, wherein the training the encoder and decoder of the singing voice synthesis model to be trained using the first singing voice waveform information and the music spectrum information corresponding to the first sample singing voice to obtain an initial singing voice synthesis model comprises:

Obtaining prior distribution, first posterior distribution and first tone coding corresponding to the singing voice of the first sample through an encoder of the singing voice synthesis model to be trained;

Inputting the first posterior distribution and a first timbre code into a decoder of the singing voice synthesis model to be trained to obtain first predicted waveform information of the singing voice of the first sample;

And training an encoder and a decoder of the singing voice synthesis model to be trained according to the difference between the first predicted waveform information and the first singing voice waveform information and the difference between the prior distribution and the first posterior distribution to obtain the initial singing voice synthesis model.

3. The method of claim 2, wherein the encoder of the singing voice synthesis model to be trained comprises: a timbre encoder, a text encoder, and a posterior encoder;

The obtaining, by the encoder of the singing voice synthesis model to be trained, a priori distribution, a first posterior distribution, and a first timbre code corresponding to the first sample singing voice includes:

Inputting the waveform information of the first singing voice into the tone encoder to obtain the first tone code and obtain a first linear spectrum corresponding to the first sample singing voice;

Inputting the first timbre code and the first linear spectrum into the posterior encoder to obtain the first posterior distribution;

and inputting the music spectrum information of the singing voice of the first sample into the text encoder to obtain the prior distribution corresponding to the singing voice of the first sample.

4. The method of claim 3, wherein said inputting the music spectrum information of the first sample singing voice into the text encoder to obtain a priori distribution corresponding to the first sample singing voice comprises:

Inputting the phoneme information and the pitch information into the text encoder to obtain a phoneme text code;

and carrying out phoneme expansion on the phoneme text codes by utilizing the phoneme duration information to obtain prior distribution corresponding to the singing voice of the first sample.

5. The method according to any one of claims 2 to 4, wherein training the encoder and decoder of the singing voice synthesis model to be trained based on the difference between the first predicted waveform information and the first singing voice waveform information, and the difference between the prior distribution and the first posterior distribution, to obtain an initial singing voice synthesis model, comprises:

Acquiring first predicted Mel spectrum information corresponding to the first predicted waveform information and first actual Mel spectrum information corresponding to the first singing voice waveform information;

Carrying out standardized flow treatment on the first posterior distribution to obtain a first posterior distribution after standardized flow treatment;

And training an encoder and a decoder of the singing voice synthesis model to be trained based on the difference between the first predicted Mel spectrum information and the first actual Mel spectrum information and the difference between the prior distribution and the first posterior distribution after the standardized stream processing to obtain an initial singing voice synthesis model.

6. The method of claim 1, wherein the training the decoder of the initial singing voice synthesis model using the second singing voice waveform information to obtain a trained singing voice synthesis model comprises:

Obtaining a second posterior distribution and a second tone code corresponding to the second sample singing through an encoder of the initial singing synthesis model;

inputting the second posterior distribution and a second timbre code into a decoder of the initial singing voice synthesis model to obtain second predicted waveform information of the second sample singing voice;

And training a decoder of the initial singing voice synthesis model according to the difference between the second predicted waveform information and the second singing voice waveform information to obtain the trained singing voice synthesis model.

7. The method of claim 6, wherein the encoder of the initial singing voice synthesis model comprises: a timbre encoder and a posterior encoder;

The encoder of the initial singing voice synthesis model obtains a second posterior distribution corresponding to the second sample singing voice, inputs the second posterior distribution into a decoder of the initial singing voice synthesis model, obtains second predicted waveform information of the second sample singing voice, and comprises the following steps:

inputting the second singing voice waveform information into the tone encoder to obtain a second tone code corresponding to the second sample singing voice, and obtaining a second linear spectrum corresponding to the second sample singing voice;

And inputting the second timbre code and the second linear spectrum into the posterior coder to obtain the second posterior distribution.

8. The method of claim 6 or 7, wherein said training the decoder of the initial singing voice synthesis model based on the difference between the second predicted waveform information and the second singing voice waveform information to obtain a trained singing voice synthesis model, comprises:

Acquiring second predicted Mel spectrum information corresponding to the second predicted waveform information and second actual Mel spectrum information corresponding to the second singing voice waveform information;

And training a decoder of the initial singing voice synthesis model based on the difference between the second predicted mel spectrum information and the second actual mel spectrum information to obtain a trained singing voice synthesis model.

9. A singing voice synthesizing method, characterized in that the method comprises:

Inputting the singing voice waveform information and the music spectrum information into an encoder of a trained singing voice synthesis model to obtain posterior distribution and tone coding corresponding to the singing voice to be synthesized, and inputting the posterior distribution and tone coding into a decoder of the trained singing voice synthesis model to generate the singing voice to be synthesized; the singing voice synthesis model is trained by the singing voice synthesis model training method as claimed in any one of claims 1 to 8.

10. The method of claim 9, wherein the encoder of the trained singing voice synthesis model comprises a text encoder and a timbre encoder;

Inputting the singing voice waveform information and the music spectrum information into an encoder of the trained singing voice synthesis model to obtain posterior distribution and tone color coding corresponding to the singing voice to be synthesized, wherein the method comprises the following steps:

inputting the music spectrum information into the text encoder to obtain prior distribution corresponding to the singing voice to be synthesized;

performing inverse normalized flow processing on the prior distribution to obtain posterior distribution;

and inputting the singing voice waveform information into the tone encoder to obtain tone codes corresponding to the singing voice to be synthesized.

11. The method of claim 10, wherein said inputting the music score information into the text encoder to obtain a priori distributions corresponding to the singing voice to be synthesized comprises:

inputting the phoneme information and the pitch information into the text encoder to obtain a phoneme text code of the singing voice to be synthesized;

And carrying out phoneme expansion on the phoneme text codes of the singing voice to be synthesized by utilizing the phoneme duration information to obtain prior distribution corresponding to the singing voice to be synthesized.

12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.