CN112509600A

CN112509600A - Model training method and device, voice conversion method and device and storage medium

Info

Publication number: CN112509600A
Application number: CN202011446585.1A
Authority: CN
Inventors: 陈闽川; 马骏; 王少军; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-16
Also published as: WO2022121180A1

Abstract

The application relates to the field of voice processing and discloses a method and a device for training a voice conversion model, a voice conversion method, voice conversion equipment and a storage medium, wherein the method comprises the following steps: obtaining sample audio, and converting the sample audio into a sample Mel frequency spectrum, wherein the sample audio comprises unlabeled audio and labeled audio; collecting noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio; inputting the output Mel frequency spectrum into a discrimination network to obtain the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum; and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum, and taking the trained generation network as a voice conversion model. The requirements of the model for audio corpus construction are reduced, and the complexity of model construction is reduced.

Description

Model training method and device, voice conversion method and device and storage medium

Technical Field

The present application relates to the field of language processing, and in particular, to a method and an apparatus for training a speech conversion model, a speech conversion method, a speech conversion device, and a storage medium.

Background

With the development of the voice conversion technology, the application prospect of the voice conversion technology is increasingly wide, for example, the voice conversion technology can be used for dubbing movie works or television works, or can be used for generating various synthetic results in voice synthesis, and the like. Most of the existing voice conversion adopts a confrontation generation network to perform voice conversion, all audio corpora need to have corresponding tags when performing voice conversion, and each speaker tag corresponding to each audio needs to be identified when performing voice conversion of multiple speakers, so that the complexity of model construction is high.

Therefore, how to reduce the requirement of the model for audio corpus and reduce the complexity of model construction become problems to be solved urgently.

Disclosure of Invention

The application provides a training method and device of a voice conversion model, a voice conversion method and device and a storage medium, so as to reduce the requirement of the model for audio corpus construction and reduce the complexity of the model construction.

In a first aspect, the present application provides a method for training a speech conversion model, the method including:

obtaining sample audio, and converting the sample audio into a sample Mel frequency spectrum, wherein the sample audio comprises unlabeled audio and labeled audio; collecting noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio; inputting the output Mel frequency spectrum into a discrimination network to obtain the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum; and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum, and taking the trained generation network as a voice conversion model to finish model training.

In a second aspect, the present application provides a method of voice conversion, the method comprising:

acquiring audio data to be converted and a target conversion label of a user; inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain converted audio data; the pre-trained voice conversion model is a generated network obtained by training by adopting the training method of the voice conversion model.

In a third aspect, the present application further provides an apparatus for training a speech conversion model, where the apparatus includes:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring sample audio and converting the sample audio into a sample Mel frequency spectrum, and the sample audio comprises unlabeled audio and labeled audio; the noise acquisition module is used for acquiring a noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio; the judgment output module is used for inputting the output Mel frequency spectrum into a judgment network to obtain the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum; and the model training module is used for performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum, and finishing model training by taking the generated network after training as a voice conversion model.

In a fourth aspect, the present application further provides a speech conversion apparatus, including:

the data acquisition module is used for acquiring audio data to be converted and a target conversion label of a user; the audio conversion module is used for inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain converted audio data; the pre-trained voice conversion model is a generated network obtained by training by adopting the training method of the voice conversion model.

In a fifth aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the training method of the speech conversion model and the speech conversion method as described above when the computer program is executed.

In a sixth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, causes the processor to implement the training method of the speech conversion model and the speech conversion method as described above.

The application discloses a training method and device of a voice conversion model, a voice conversion method, equipment and a storage medium, wherein sample audio comprising label audio and non-label audio is obtained, the sample audio is converted into a sample Mel frequency spectrum, then the noise audio is collected, the noise audio and the sample Mel frequency spectrum are jointly input into a generating network to obtain an output Mel frequency spectrum, the output Mel frequency spectrum is input into a judging network to obtain type probability and a label of the output Mel frequency spectrum, finally, the generating network and the judging network are alternately and iteratively trained according to the type probability and the label of the output Mel frequency spectrum, and the trained generating network is used as the voice conversion model to finish model training. The labels of the output Mel frequency spectrum are obtained by using the discrimination network, so that only a small amount of labeled audio can be used for training when the network is generated and the discrimination network is trained, the requirement on audio corpus when a speech conversion model is trained is reduced, and the complexity of model construction is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating a method for training a speech conversion model according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a speech conversion method provided in an embodiment of the present application;

FIG. 3 is a schematic block diagram of a training apparatus for a speech conversion model according to an embodiment of the present application;

fig. 4 is a schematic block diagram of a speech conversion apparatus provided in an embodiment of the present application;

fig. 5 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a training method and device of a voice conversion model, a voice conversion method and device and a storage medium. The training method of the voice conversion model can be based on the generation of the confrontation network training voice conversion model, and the discrimination network is trained to enable the discrimination network to output the input labels of the Mel frequency spectrum, only a small amount of labeled audio is needed to train, the difficulty in obtaining the sample audio is reduced, the requirement on audio corpora during the training of the voice conversion model is also reduced, and the complexity of model construction is reduced.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for training a speech conversion model according to an embodiment of the present application. The training method of the voice conversion model carries out alternate iterative training on the generation network and the discrimination network, and the generated network after training is used as the voice conversion model.

As shown in fig. 1, the training method of the speech conversion model specifically includes: step S101 to step S104.

S101, obtaining sample audio, and converting the sample audio into a sample Mel frequency spectrum, wherein the sample audio comprises unlabeled audio and labeled audio.

The sample audio includes unlabeled audio and labeled audio, where the labeled audio refers to audio with a certain label, for example, the labels corresponding to the audio are various, such as man, woman, girl, boy, and the like, and such audio with a certain label is referred to as labeled audio.

The label-free audio is that the audio itself does not have a corresponding label, and the label is set to be unknown for the audio of which the audio itself does not have a corresponding label, that is, the label-free audio is the audio with the label being unknown, and indicates that the audio does not have a determined label.

The sample audio may be obtained in a variety of ways, such as using a web crawler to obtain the sample audio from a network, and so on. For the obtained sample audio, the sample audio is converted into a sample mel frequency spectrum by using a mel filter, and each sample mel audio carries a corresponding label.

S102, collecting noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is non-label audio.

The generating network is used for generating a noise Mel frequency spectrum corresponding to the noise audio according to the collected noise audio. In a specific implementation, the structure of the generation network may include a pre-sampling layer, a down-sampling layer, a bottleneck layer, and an up-sampling layer.

The pre-treatment layer consists of a convolution layer, a batch standardization layer and a nonlinear affine transformation layer; the down-sampling layer consists of a plurality of convolution layers and batch processing layers; the bottleneck layer consists of convolutions with residuals; the upsampling layer consists of an expanded convolution and a batch normalization layer.

Randomly acquiring a noise audio, wherein the acquired noise audio needs to obey a prior probability distribution, which can be a uniform distribution or a Gaussian distribution. And then setting the label of the collected noise audio as unknown, inputting the unknown label as a label-free audio and a sample Mel frequency spectrum into a generation network together, and processing the noise audio by the generation network to obtain an output Mel frequency spectrum output by the generation network.

Since the input to the generation network is the noise audio and the sample mel spectrum, the resulting output mel spectrum includes both the sample mel spectrum corresponding to the sample audio and the noise mel spectrum corresponding to the noise audio.

S103, inputting the output Mel frequency spectrum into a discrimination network to obtain the type probability and the prediction label of the output Mel frequency spectrum.

The type of the output mel spectrum includes a sample mel spectrum and a noise mel spectrum, and the type probability of the output mel spectrum specifically refers to the probability that the output mel spectrum is the sample mel spectrum.

The judgment network is used for judging the probability that the input output Mel frequency spectrum is the sample Mel frequency spectrum and determining the prediction label corresponding to the output Mel frequency spectrum.

In the specific implementation process, the main network of the discrimination network can be composed of a plurality of nonlinear affine transformations and convolution layers, the last layer is linear mapping of two classes and multiple classes, and the output results of the discrimination network are respectively the probability that the input output Mel frequency spectrum is a sample Mel frequency spectrum and the prediction label of the output Mel frequency spectrum.

And taking the output Mel frequency spectrum of the generated network output as the input of the discrimination network, and obtaining the probability that the output Mel frequency spectrum predicted by the discrimination network is the sample Mel frequency spectrum and the prediction label of the output Mel frequency spectrum.

And S104, performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the prediction label, and using the trained generation network as a voice conversion model to finish model training.

And performing alternate iterative training on the generation network and the discrimination network according to the probability that the output Mel frequency spectrum predicted by the discrimination network is the sample Mel frequency spectrum and the prediction label of the output Mel frequency spectrum, and then, when the training of the generation network and the discrimination network is finished, not using the discrimination network, but using the trained generation network as a voice conversion model to finish the training of the voice conversion model.

Because under the condition of limited training data, overfitting can be caused if the discriminant network is optimized firstly, so that the final model cannot be converged, the training optimization of the generation network and the discriminant network needs to be performed alternately in the training process.

In the process of alternately training the generation network and the discrimination network, the discrimination network is optimized firstly, and when the training starts, the discrimination network can easily distinguish the noise Mel spectrum and the sample Mel spectrum from the output Mel spectrum, which shows that the noise Mel spectrum generated by the generation network according to the noise audio at the beginning has a large deviation compared with the sample Mel spectrum. And then optimizing the generation network to gradually reduce the loss function of the generation network, gradually improving the two-classification capability of the discrimination network in the process, and gradually improving the discrimination accuracy of the discrimination network on the output Mel frequency spectrum output by the generation network. The generated network generates a noise Mel frequency spectrum close to real data as much as possible to deceive the discrimination network, and the discrimination network needs to distinguish the sample Mel frequency spectrum from the noise Mel frequency spectrum generated by the generated network as much as possible, so that the generated network and the discrimination network form a dynamic game process.

And finally, until the judgment network cannot judge whether the output Mel frequency spectrum is the sample Mel frequency spectrum or the noise Mel frequency spectrum, the training of the generation network is finished at the moment, and the generation network after the training is used as a voice conversion model.

In an embodiment, the method further comprises: and when the accuracy of the prediction label of the output Mel frequency spectrum output by the discrimination network reaches a preset value, inputting the sample Mel frequency spectrum of the unlabeled audio frequency into the discrimination network, and taking the obtained prediction label as the label of the unlabeled audio frequency.

Since both the noise audio and the sample audio have corresponding labels, the resulting output mel-frequency spectrum also has labels corresponding to the corresponding audio.

When the accuracy of the predicted label of the output Mel frequency spectrum output by the discrimination network reaches a preset value, the discrimination network is considered to be capable of accurately judging the label corresponding to the Mel frequency spectrum.

Therefore, the sample mel frequency spectrum of the unlabeled audio is input into the discrimination network, the discrimination network predicts the label corresponding to the sample mel frequency spectrum of the unlabeled audio, and the predicted label is used as the label of the unlabeled audio.

At this time, the non-tag audio is changed into the tag audio according to the prediction tag, and the tag is the prediction tag. After the unlabeled audio is converted into the labeled audio, the training of the discrimination network can be added again, and the cycle is performed, so that the discrimination network can predict the label classification of the sample audio with few labels.

In one embodiment, the method comprises: adjusting the speech speed of the sample audio to obtain a speed-adjusting sample audio, and converting the speed-adjusting sample audio into a speed-adjusting Mel frequency spectrum; and training a discrimination network according to the speed-regulating Mel frequency spectrum, so that the discrimination network outputs the speech speed corresponding to the speed-regulating Mel frequency spectrum.

The speech rate of the sample audio is adjusted to obtain a speed-adjustable sample audio, which can be adjusted to 0.9 times, 1.0 times, and 1.1 times, for example. And then converting the speed regulation sample audio into a speed regulation Mel frequency spectrum by using a Mel filter, and training the discrimination network by using the speed regulation Mel frequency spectrum to enable the discrimination network to output the speech speed corresponding to the speed regulation Mel frequency spectrum.

By training the discrimination network, the discrimination network can identify the speech speed, the training stability of the confrontation generation network can be improved, and training errors caused by different speech speeds in sample audio are reduced.

In an embodiment, the performing, according to the type probability of the output mel frequency spectrum, an alternating iterative training on the generating network and the discriminating network includes: calculating the value of the type loss function of the generation network and the value of the type loss function of the discrimination network according to the type probability of the output Mel frequency spectrum; respectively carrying out alternate iterative training on the generated network and the type network according to the value of the type loss function of the generated network and the value of the type loss function of the discrimination network; and finishing the training of the generated network when the type probability output by the discrimination network reaches a preset value.

Calculating the value of the type loss function of the generated network and the value of the type loss function of the discriminant network according to the type probability of the output Mel frequency spectrum output by the discriminant network, then adjusting the network parameters of the generated network and the discriminant network according to the value of the type loss function of the generated network and the value of the type loss function of the discriminant network, performing iterative training on the generated network and the discriminant network, and gradually reducing the value of the type loss function of the generated network.

The capacity of distinguishing the two classifications of the network model is determined by setting a preset value, and the noise Mel frequency spectrum generated by the generated network by using the noise audio is similar to the sample Mel frequency spectrum. Wherein the preset value may be 0.5. When the preset value is 0.5, the judgment network cannot judge whether the Mel frequency spectrum generated by the generation network is a noise Mel frequency spectrum or a sample Mel frequency spectrum at the moment, and the generation network is trained completely.

It should be noted that, when the type probability of the discrimination network output reaches a preset value, the values of the loss functions of the generation network and the discrimination network both approach to be stable.

For example, the formula for generating the type loss function for a network may be as follows:

L_G1＝-E_{x～p(x),c～p(c)}[log(D(G(x,c),c))]

the formula for the type loss function of the discrimination network can be as follows:

L_D1＝-E_{(y,c)～p(y,c)}[log(D(y,c))]-E_{x～p(x),c～p(c)}[log(1-D(G(x,c),c))]

wherein L is_G1A type loss function, L, representing the generating network_D1The method includes the steps of representing a type loss function of a discrimination network, D (G (x, c), c) representing the probability that the discrimination network judges a sample Mel spectrum x labeled with c as a sample Mel spectrum, and D (y, c) representing the probability that a noise Mel spectrum x labeled with c is judged as a sample Mel spectrum.

In an embodiment, the performing, according to the type probability of the output mel frequency spectrum and the prediction label, an alternating iterative training on the generation network and the discrimination network includes: if the audio corresponding to the output Mel frequency spectrum is determined to be sample audio according to the type probability of the output Mel frequency spectrum, and the prediction label of the output Mel frequency spectrum is different from the label of the corresponding sample audio, the error is counted into the label loss function of the discrimination network; if the audio corresponding to the output Mel frequency spectrum is determined to be noise audio according to the type probability of the output Mel frequency spectrum, and the predicted label of the output Mel frequency spectrum is different from the label of the corresponding noise audio, the error is counted into the label loss function of the generation network; and performing iterative training on the generated network according to the label loss function of the generated network, and performing iterative training on the type network according to the label loss function of the discrimination network.

Because the output of the discrimination network also comprises a prediction label for outputting the Mel frequency spectrum, label loss functions of the generation network and the discrimination network are determined according to the prediction label, so that the generation network and the discrimination network are optimized, and the generation network can generate audio with a specific label.

In the process that the discrimination network carries out label prediction on the output Mel frequency spectrum, if the prediction label is different from the label of the sample Mel frequency spectrum when the sample Mel frequency spectrum is subjected to label prediction, the error of the sample Mel frequency spectrum prediction label is considered to be wrong, and the error is added into a label loss function of the discrimination network.

If the predicted label is different from the label of the noise Mel frequency spectrum when the label prediction is carried out on the noise Mel frequency spectrum, the error of the noise Mel frequency spectrum predicted label is considered to be wrong, and the mistake is counted into a label loss function of the generated network.

For example, the formula for generating the label loss function for a network may be as follows:

L_G2＝-E_{x～p(x),c～p(c)}[log(p(c)(c|(G(x,c)))]

the formula for the tag loss function of the discrimination network can be as follows:

L_D2＝-E_{(y,c)～p(y,c)}[log(p(c)(c|y)]

wherein L is_G2Representing the tag loss function, L, of the generating network_D2The label loss function of the discrimination network is shown, p (c) (G (x, c)) shows the case where the discrimination network predicts the label of the sample mel-frequency spectrum x with the label c incorrectly, and p (c) (c | y) shows the case where the discrimination network predicts the label of the noise mel-frequency spectrum x with the label c incorrectly.

And after the value of the label loss function of the generating network and the value of the label loss function of the judging network are calculated based on the formula, performing alternate iterative training on the generating network and the type network to gradually reduce the values of the label loss functions of the generating network and the judging network to show that the generating network can generate the audio with the specific label.

The training method of the voice conversion model provided in the above embodiment includes obtaining a sample audio including a labeled audio and a non-labeled audio, converting the sample audio into a sample mel spectrum, then collecting a noise audio, and inputting the noise audio and the sample mel spectrum together into the generation network to obtain an output mel spectrum, inputting the output mel spectrum into the discrimination network to obtain a type probability and a label of the output mel spectrum, and finally performing alternate iterative training on the generation network and the discrimination network according to the type probability and the label of the output mel spectrum, and using the generated network after training as the voice conversion model to complete model training. The labels of the output Mel frequency spectrum are obtained by using the discrimination network, so that only a small amount of labeled audio can be used for training when the network is generated and the discrimination network is trained, the requirement on audio corpus when a speech conversion model is trained is reduced, and the complexity of model construction is reduced.

Referring to fig. 2, fig. 2 is a schematic flow chart of a voice conversion method according to an embodiment of the present application.

As shown in fig. 2, the voice conversion method includes: step S201 to step S202.

S201, audio data to be converted and a target conversion label of a user are obtained.

The audio to be converted refers to the audio which needs to be converted by the user, and the target conversion label refers to a label when the audio to be converted is converted.

For example, the audio to be converted is the audio of a woman's timbre, and the target conversion label is a girl.

S202, inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain converted audio data.

The pre-trained speech conversion model is a generated network obtained by training by using any one of the speech conversion model training methods provided in the above embodiments.

And inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model, wherein the voice conversion model can perform audio synthesis according to the audio data to be converted and the target conversion label so as to output the converted audio data. Therefore, the purpose of voice conversion is achieved, and user experience is improved.

Referring to fig. 3, fig. 3 is a schematic block diagram of a training apparatus for a speech conversion model according to an embodiment of the present application, the training apparatus for a speech conversion model being used for executing the aforementioned training method for a speech conversion model. The training device of the speech conversion model can be configured in a server or a terminal.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.

As shown in fig. 3, the training apparatus 300 for a speech conversion model includes: the system comprises a sample acquisition module 301, a noise acquisition module 302, a judgment output module 303 and a model training module 304.

A sample obtaining module 301, configured to obtain sample audio, and convert the sample audio into a sample mel-frequency spectrum, where the sample audio includes unlabeled audio and labeled audio.

The noise acquisition module 302 is configured to acquire a noise audio, and input the noise audio and the sample mel spectrum together into a generation network to obtain an output mel spectrum, where the noise audio is an untagged audio.

And a decision output module 303, configured to input the output mel spectrum into a decision network, so as to obtain a type probability of the output mel spectrum and a label of the output mel spectrum.

And the model training module 304 is configured to perform alternating iterative training on the generation network and the discrimination network according to the type probability of the output mel frequency spectrum and the label of the output mel frequency spectrum, and use the generated network after training as a speech conversion model to complete model training.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the training apparatus for a speech conversion model and each module described above may refer to the corresponding processes in the foregoing embodiment of the training method for a speech conversion model, and are not described herein again.

Referring to fig. 4, fig. 4 is a schematic block diagram of a speech conversion apparatus according to an embodiment of the present application, where the speech conversion apparatus is configured to perform the foregoing speech conversion method. The voice conversion device can be configured in a server or a terminal.

As shown in fig. 4, the voice conversion apparatus 400 includes: a data acquisition module 401 and an audio conversion module 402.

The data obtaining module 401 is configured to obtain audio data to be converted and a target conversion tag of a user.

An audio conversion module 402, configured to input the audio data to be converted and the target conversion label into a pre-trained voice conversion model, so as to obtain converted audio data; the pre-trained voice conversion model is a generated network obtained by training by adopting the training method of the voice conversion model.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the voice conversion apparatus and the modules described above may refer to the corresponding processes in the foregoing voice conversion method embodiment, and are not described herein again.

Both the training means of the speech conversion model and the speech conversion means described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 5, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of a method for training a speech conversion model and a method for speech conversion.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor causes the processor to perform any one of a method for training a speech conversion model and a method for speech conversion.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

obtaining sample audio, and converting the sample audio into a sample Mel frequency spectrum, wherein the sample audio comprises unlabeled audio and labeled audio; collecting noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio; inputting the output Mel frequency spectrum into a discrimination network to obtain the type probability and the prediction label of the output Mel frequency spectrum; and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the prediction label, and using the trained generation network as a voice conversion model to finish model training.

In one embodiment, the processor is further configured to implement:

and when the accuracy of the prediction label of the output Mel frequency spectrum output by the discrimination network reaches a preset value, inputting the sample Mel frequency spectrum of the unlabeled audio frequency into the discrimination network, and taking the obtained prediction label as the label of the unlabeled audio frequency.

In one embodiment, the processor is configured to implement:

adjusting the speech speed of the sample audio to obtain a speed-adjusting sample audio, and converting the speed-adjusting sample audio into a speed-adjusting Mel frequency spectrum; and training a discrimination network according to the speed-regulating Mel frequency spectrum, so that the discrimination network outputs the speech speed corresponding to the speed-regulating Mel frequency spectrum.

In one embodiment, the processor, when implementing the alternating iterative training of the generation network and the discrimination network according to the type probability of the output mel frequency spectrum, is configured to implement:

calculating the value of the type loss function of the generation network and the value of the type loss function of the discrimination network according to the type probability of the output Mel frequency spectrum; respectively carrying out alternate iterative training on the generated network and the type network according to the value of the type loss function of the generated network and the value of the type loss function of the discrimination network; and finishing the training of the generated network when the type probability output by the discrimination network reaches a preset value.

In one embodiment, the processor, when implementing the alternating iterative training of the generation network and the discrimination network according to the type probability and the prediction label of the output mel frequency spectrum, is configured to implement:

if the audio corresponding to the output Mel frequency spectrum is determined to be sample audio according to the type probability of the output Mel frequency spectrum, and the prediction label of the output Mel frequency spectrum is different from the label of the corresponding sample audio, the error is counted into the label loss function of the discrimination network; if the audio corresponding to the output Mel frequency spectrum is determined to be noise audio according to the type probability of the output Mel frequency spectrum, and the predicted label of the output Mel frequency spectrum is different from the label of the corresponding noise audio, the error is counted into the label loss function of the generation network; and performing iterative training on the generated network according to the label loss function of the generated network, and performing iterative training on the type network according to the label loss function of the discrimination network.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the speech conversion model training methods and the speech conversion method provided in the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a speech conversion model, comprising:

obtaining sample audio, and converting the sample audio into a sample Mel frequency spectrum, wherein the sample audio comprises unlabeled audio and labeled audio;

collecting noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio;

inputting the output Mel frequency spectrum into a discrimination network to obtain the type probability and the prediction label of the output Mel frequency spectrum;

and performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the prediction label, and using the trained generation network as a voice conversion model to finish model training.

2. The method of training a speech conversion model according to claim 1, further comprising:

3. The method of claim 1, wherein the method comprises:

adjusting the speech speed of the sample audio to obtain a speed-adjusting sample audio, and converting the speed-adjusting sample audio into a speed-adjusting Mel frequency spectrum;

and training a discrimination network according to the speed-regulating Mel frequency spectrum, so that the discrimination network outputs the speech speed corresponding to the speed-regulating Mel frequency spectrum.

4. The method for training the speech conversion model according to claim 1, wherein the training of the generation network and the discriminant network in an alternating iterative manner according to the type probability of the output mel spectrum comprises:

calculating the value of the type loss function of the generation network and the value of the type loss function of the discrimination network according to the type probability of the output Mel frequency spectrum;

respectively carrying out alternate iterative training on the generated network and the type network according to the value of the type loss function of the generated network and the value of the type loss function of the discrimination network;

and finishing the training of the generated network when the type probability output by the discrimination network reaches a preset value.

5. The method for training the speech conversion model according to claim 1, wherein the training of the generation network and the discriminant network in an alternating iterative manner according to the type probability of the output mel-frequency spectrum and the prediction label comprises:

if the audio corresponding to the output Mel frequency spectrum is determined to be sample audio according to the type probability of the output Mel frequency spectrum, and the prediction label of the output Mel frequency spectrum is different from the label of the corresponding sample audio, the error is counted into the label loss function of the discrimination network;

if the audio corresponding to the output Mel frequency spectrum is determined to be noise audio according to the type probability of the output Mel frequency spectrum, and the predicted label of the output Mel frequency spectrum is different from the label of the corresponding noise audio, the error is counted into the label loss function of the generation network;

and performing iterative training on the generated network according to the label loss function of the generated network, and performing iterative training on the type network according to the label loss function of the discrimination network.

6. A method of speech conversion, comprising:

acquiring audio data to be converted and a target conversion label of a user;

inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain converted audio data;

wherein the pre-trained speech conversion model is a generation network trained by the training method of the speech conversion model according to any one of claims 1 to 5.

7. An apparatus for training a speech conversion model, comprising:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring sample audio and converting the sample audio into a sample Mel frequency spectrum, and the sample audio comprises unlabeled audio and labeled audio;

the noise acquisition module is used for acquiring a noise audio, and inputting the noise audio and the sample Mel frequency spectrum together into a generation network to obtain an output Mel frequency spectrum, wherein the noise audio is a non-label audio;

the judgment output module is used for inputting the output Mel frequency spectrum into a judgment network to obtain the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum;

and the model training module is used for performing alternate iterative training on the generation network and the discrimination network according to the type probability of the output Mel frequency spectrum and the label of the output Mel frequency spectrum, and finishing model training by taking the generated network after training as a voice conversion model.

8. A speech conversion apparatus, comprising:

the data acquisition module is used for acquiring audio data to be converted and a target conversion label of a user;

the audio conversion module is used for inputting the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain converted audio data;

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor, configured to execute the computer program and to implement the training method of the speech conversion model according to any one of claims 1 to 5 and the speech conversion method according to claim 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the method of training a speech conversion model according to any one of claims 1 to 5 and the method of speech conversion according to claim 6.