CN111429893A

CN111429893A - Many-to-many speaker conversion method based on Transitive STARGAN

Info

Publication number: CN111429893A
Application number: CN202010168932.2A
Authority: CN
Inventors: 李燕萍; 何铮韬
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-07-17

Abstract

The invention discloses a trans STARGAN-based many-to-many speaker conversion method, which combines a STARGAN generator with a transmission network, transmits characteristics extracted by a coding network in the generator to a corresponding network layer of a decoding network, improves the learning capability of the decoding network on semantic characteristics with different scales, realizes the learning function of a model on deep characteristics of a frequency spectrum, improves the frequency spectrum generation quality of the decoding network, and more fully learns the semantic characteristics and the personalized characteristics of speakers, thereby better promoting the personalized similarity and the voice quality of converted synthesized voice, overcoming the problem of poor personalized similarity and naturalness after the conversion of the STARGAN model, and realizing the high-quality many-to-many speaker conversion under the condition of non-parallel texts.

Description

Many-to-many speaker conversion method based on Transitive STARGAN

Technical Field

The invention relates to a many-to-many speaker conversion method, in particular to a many-to-many speaker conversion method based on TransitioNSTAGGAN.

Background

Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended based on the research of speech analysis, synthesis, and speaker recognition. The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker while retaining semantic information, i.e., to make the source speaker's speech sound like the target speaker's speech after conversion.

After years of research, many classical conversion methods have emerged, and the speech conversion technology can be classified into a conversion method under a parallel text condition and a conversion method under a non-parallel text condition according to the classification of training corpora. The method is characterized in that a large number of parallel training texts are collected in advance, time and labor are consumed, and the parallel texts cannot be collected in a cross-language conversion and medical auxiliary system, so that the speech conversion research under the condition of non-parallel texts has greater application background and practical significance.

The existing voice conversion method under the condition of non-parallel text includes a method based on a Cycle-Consistent adaptive network (Cycle-GAN), a method based on a Conditional variable Auto-Encoder (C-VAE), and a method based on a Disco-GAN (discovery-domain relationships With general adaptive network). Compared with the traditional GAN, the speech conversion method based on the Disco-GAN model improves the speech quality by adding a style discriminator to extract the speech personality characteristics, but can only realize one-to-one speech conversion. A voice conversion method based on a C-VAE model directly utilizes an identity tag of a speaker to establish a voice conversion system, wherein an encoder realizes the separation of semantics and personal information on voice, and a decoder realizes the reconstruction of the voice through the semantics and the identity tag of the speaker, thereby being capable of removing the dependence on parallel texts. The voice conversion method based on the Cycle-GAN model utilizes the antagonism loss and the Cycle consistent loss, and simultaneously learns the forward mapping and the inverse mapping of the acoustic characteristics, so that the problem of over-smoothness can be effectively solved, and the voice quality of conversion synthesis is improved.

The voice conversion method based on Star-generated confrontation Network (STARGAN) model has the advantages of Disco-GAN, C-VAE and Cycle-GAN, because the generator of the method has a coding and decoding structure, many-to-many mapping can be learned at the same time, and the attribute output by the generator is controlled by the identity label of a speaker, so many-to-many voice conversion under the condition of non-parallel text can be realized, but because the coding Network and the decoding Network in the generator are mutually independent, the separation of semantic features and speaker personalized features can not be realized well directly through the coding Network of the generator, and meanwhile, the decoding Network of the generator can not realize the synthesis of the semantic features and the speaker personalized features well, so that the semantic features of the frequency spectrum can be lost easily in the process of continuously extracting the features by the coding Network, the decoding Network of the generator depends on the information after multi-layer downsampling coding Network coding, resulting in limitations of the overall generator in the retention of semantic features.

There is a commonality between the style conversion and the voice conversion in the image domain, that is, the original content features are preserved and the style features are changed. Specifically, in the image domain, the content of the image is retained, the style of the image, such as color or texture, is converted, and in the voice conversion domain, the semantic features of the spectrum are retained and the personality features are converted. The transfer Network (transmissive Network) is applied in the image field, and the core idea is to transfer the features of the generator encoding stage to the corresponding decoding stage, and to strengthen the learning ability and expression ability of the generator to the semantic features. The transmission network optimizes the coding and decoding structure of the generator, can better retain the content of the source image, can avoid the problems of gradient disappearance or gradient explosion in the back propagation process, and can facilitate the training of a deep network.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a conversion method of many-to-many speakers based on Transive STARGAN and a computer storage medium, which solve the problem of network degradation in the training process of the existing method, improve the learning capability of the decoding network on semantic features of different scales by building multiple layers of TransNet between encoding and decoding networks of a STARGAN generator, realize the learning function of a model on deep features of a frequency spectrum, improve the frequency spectrum generation quality of the decoding network, and learn the semantic features and the personalized features of the speakers more fully, thereby better improving the personalized similarity and the voice quality of converted and synthesized voice.

The technical scheme is as follows: the invention relates to a trans STARGAN-based many-to-many speaker conversion method, which comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:

(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;

(1.2) extracting the frequency spectrum characteristic x, the aperiodic characteristic and the fundamental frequency characteristic of the voice of each speaker by the WOR L D voice analysis/synthesis model of the training corpus;

(1.3) matching the spectral characteristics x of the source speaker_sSpectral feature x of the targeted speaker_tSource speaker tag c_sAnd target speaker tag c_tThe method comprises the steps that the method is input into a Transitiove STARGAN network for training, the Transitiove STARGAN network consists of a generator G, a discriminator D and a classifier C, the generator G consists of an encoding network and a decoding network, and a plurality of layers of TransNet used for optimizing a generator network structure are built between the encoding network and the decoding network;

(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration number is reached, so that a trained Transive STARGAN network is obtained;

(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;

the transition phase comprises the steps of:

(2.1) extracting the frequency spectrum characteristic x from the voice of the source speaker in the corpus to be converted through a WOR L D voice analysis/synthesis model_s', aperiodic characteristics, and fundamental frequency characteristics;

(2.2) applying the spectral characteristics x of the source speaker_s', target speaker tag characteristics c_t' inputting the trained Transive STARGAN network in the step (1.4) to obtain the spectrum feature x of the target speaker_tc′；

(2.3) converting the source speaker fundamental frequency feature extracted in the step (2.1) into the fundamental frequency feature of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) the target speaker spectrum characteristic x generated in the step (2.2)_tc', the fundamental frequency characteristic of the target speaker obtained in the step (2.3) and the aperiodic characteristic extracted in the step (2.1) are synthesized through a WOR L D voice analysis/synthesis model to obtain the converted speaker voice.

Further, the encoding network of the generator G includes 5 convolutional layers, the decoding network of the generator G includes 5 deconvolution layers, and the constructed TransNet is 4 layers, specifically, the output of the first convolutional layer of the encoding network is spliced with the output of the fourth convolutional layer of the decoding network, and then the spliced output is input to the fifth convolutional layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.

Further, the filter sizes of 5 convolution layers of the coding network of the generator G are 3 × 9, 4 × 8, 3 × 5, 9 × 5, the step sizes are 1 × 1, 2 × 2, 1 × 1, 9 × 1, and the filter depths are 32, 64, 128, 64, 5, respectively; the filter sizes of the 5 deconvolution layers of the decoding network of the generator G are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively; the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1; the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

Further, the training process in steps (1.3) and (1.4) comprises the following steps:

(1) the spectral characteristics x of the source speaker_sCoding network of input generator G to obtain speaker independent semantic features G (x)_s)；

(2) Generating the semantic feature G (x)_s) Tag characteristics c with the targeted speaker_tThe decoded network input to the generator G is trained, and the loss function of the generator G is minimized in the training process, so that the spectral feature x of the target speaker is obtained_tc；

(3) Generating the above-mentioned spectral characteristics x of the target speaker_tcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)_tc)；

(4) Generating the semantic feature G (x)_tc) Source speaker tag feature c_sInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a reconstructed frequency spectrum characteristic x of the source speaker_sc；

(5) The frequency spectrum characteristic x of the target speaker generated in the step (2) is compared_tcTrue spectral feature x of the targeted speaker_tAnd the tag characteristics c of the targeted speaker_tInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;

(6) the frequency spectrum characteristic x of the target speaker generated in the step (2) is compared_tcAnd the true spectral feature x of the targeted speaker_tInputting a classifier C for training, and minimizing a loss function of the classifier C;

(7) and (4) returning to the step (1) and repeating the steps until the set iteration number is reached, thereby obtaining the trained Transive STARGAN network.

Further, the input process in step (2.2) comprises the following steps:

(1) the spectral characteristics x of the source speaker_s' encoding network of input generator G, deriving speaker independent semantic features G (x)_s′)；

(2) Generating the semantic feature G (x)_s') tag characteristics c with the targeted speaker_t' input to the decoding network of the generator G together to obtain the spectral feature x of the target speaker_tc′。

The computer storage medium of the present invention has stored thereon a computer program which, when executed by a computer processor, implements the method of any of the above.

Has the advantages that: the method can realize many-to-many speaker voice conversion under the condition of non-parallel text by using the transitional STARGAN, and has the improvement that a multilayer TransNet is established between a coding network and a decoding network of a generator, and the learning capability and the expression capability of the generator to semantic features are enhanced by splicing the semantic features of corresponding coding network layers on the decoding network layers, and the beneficial effects are as follows: (1) the converted voice is more exquisite and real, namely the multilayer TransNet transmits the semantic features of different scales in the coding stage to the corresponding decoding stage, so that the learning capability of the decoding network on the semantic features of different scales is improved, and the problem of semantic feature loss caused by STARGAN network degradation is solved; (2) the personality similarity of the converted voice is improved, namely the multilayer TransNet reduces the learning burden of the generator network on the semantics, so that the decoding network is favorable for learning the personality characteristic conversion of the speaker; (3) the network training is more stable and efficient, and the multi-layer TransNet not only avoids the problem of gradient disappearance or gradient explosion in the network training back propagation process, but also enables the generator to obtain more semantic information in the decoding stage, and accelerates the convergence speed of the network training. In summary, transitive stargan realizes a high-quality and high-efficiency many-to-many voice conversion method under the condition of non-parallel text.

Drawings

FIG. 1 is a schematic diagram of the model Transactive STARGAN of the present method;

FIG. 2 is a network structure diagram of a generator of the model Transactive STARGAN of the method;

FIG. 3 is a spectrogram of the synthetic speech of the present method model sensitive STARGAN and the reference model STARGAN;

FIG. 4 is a graph of training loss for the generator networks of the model Transactive STARGAN and the reference model STARGAN of the present method.

Detailed Description

The invention applies the idea of the transmission network to the field of voice conversion, is used for transmitting information characteristics with different scales in a generator network and enhances the learning ability and the expression ability of the generator network. The invention utilizes the transmission network to compensate the semantic information lost by the generator in the encoding and decoding stages, so that the model can fully learn the deep characteristics of the frequency spectrum, thereby obtaining the frequency spectrum with richer details, avoiding the problem of fuzzy details of the frequency spectrum generated by the generator network, and further improving the frequency spectrum generation quality of the decoding network. The structure further reduces the difficulty of learning semantics by the generator network, thereby improving the naturalness and the definition of the converted voice.

As shown in fig. 1, the method implemented in this example is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.

The training stage comprises the following implementation steps:

1.1) obtaining a training corpus of a non-parallel text, wherein the training corpus is a corpus of multiple speakers and comprises a source speaker and a target speaker. The corpus is taken from the VCC2018 corpus. The corpus training set has 6 male and 6 female speakers, each speaker having 81 sentences of corpus.

1.2) trainingExtracting spectral envelope characteristic x, aperiodic characteristic and logarithmic fundamental frequency log f of each speaker sentence from corpus through WOR L D voice analysis/synthesis model₀. Wherein, because Fast Fourier Transform (FFT) length is set to 1024, the obtained spectral envelope characteristic x and aperiodic characteristic are 1024/2+1 ═ 513 dimensions. Each voice block has 512 frames, each frame extracts the feature of 36-dimensional Mel cepstral coefficients (MCEP) as the spectral feature of a Transitive STARGAN model, and 8 voice blocks are obtained by one training. Thus, the corpus has dimensions 8 × 36 × 512.

1.3) the Transitive STARGAN network in the embodiment is based on a Cycle-GAN model, and improves the Cycle-GAN effect by improving the structure of GAN and combining a classifier. Transive STARGAN consists of three parts: a generator G for generating a true spectrum, a discriminator D for judging whether the input is a true spectrum or a generated spectrum, and a label for judging whether the generated spectrum belongs to c_tThe classifier C of (1).

The objective function of a Transitive STARGAN network is:

wherein, L_G(G) To generate the loss function of the generator:

wherein λ is_cls>＝0、λ_cyc>0 and λ_id>0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,

L_cyc(G) and L_id(G) The countermeasure loss of the generator, the classification loss of the classifier optimization generator, the cycle consistency loss and the feature mapping loss are respectively expressed.

The loss function of the discriminator is:

wherein, D (x)_t,c_t) Indicating that discriminator D discriminates true spectral features, G (x)_s,c_t) Representing the spectral characteristics of the target speaker, i.e. x, generated by the generator G_tc，D(G(x_s,c_t),c_t) Representing the spectral features discriminatively generated by the discriminator,

representing the expectation of the probability distribution generated by the generator G,

an expectation representing a true probability distribution;

the loss function of the classifier two-dimensional convolutional neural network is:

wherein p is_C(c_t|x_t) C, representing the characteristic of the classifier for distinguishing the target speaker as a label_tOf the true spectrum of the spectrum.

1.4) extracting the frequency spectrum characteristic x of the source speaker extracted in 1.2)_sAnd target speaker tag characteristics c_tAs a combined feature (x)_s,c_t) Training the generator, making the loss function L of the generator_GAs small as possible, obtaining the frequency spectrum characteristic x of the generated target speaker_tc。

As shown in fig. 2, the generator adopts a two-dimensional convolutional neural network, and is composed of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5. The decoding network comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are respectively 9 × 5, 3 × 5, 4 × 8 and 3 × 9, the step sizes are respectively 9 × 1, 1 × 1, 2 × 2 and 1 × 1, and the filter depths are respectively 64, 128, 64, 32 and 1; establishing a TransNet between the coding network and the decoding network, splicing the output of the first convolution layer of the coding network with the output of the fourth convolution layer of the decoding network, and inputting the spliced output to the fifth convolution layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.

1.5) generating the target speaker frequency spectrum characteristic x obtained in 1.4)_tcAnd 1.2) the spectral feature x of the target speaker of the corpus obtained_tAnd target speaker tag c_tTraining the discriminator as the input of the discriminator to make the discriminator lose function

As small as possible.

The discriminator adopts a two-dimensional convolutional neural network, which comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1;

the loss function of the discriminator is:

the optimization target is as follows:

1.6) obtaining the obtained frequency spectrum characteristic x of the target speaker_tcInputting the data into the coding network of the generator G again to obtain speaker-independent semanticsFeature G (x)_tc) The semantic feature G (x) obtained above is used_tc) Source speaker tag feature c_sInputting the data into a decoding network of a generator G together for training, minimizing a loss function of the generator G in the training process, and obtaining the spectral feature x of the reconstructed source speaker_sc. The loss function of the generator is minimized in the training process, including the countermeasure loss, the cycle consistent loss and the classification loss of the feature mapping loss generator. Wherein the training cycle consistency loss is to make the source speaker spectral feature x_sAfter passing through the generator G, the reconstructed spectral characteristics x of the source speaker_scCan be mixed with x_sAs consistent as possible. Loss of training feature mapping to guarantee x_sSpeaker tag is still c after passing through generator G_sThe classification loss refers to the target speaker spectrum x generated by the classifier discrimination generator_tcBelongs to the label c_tThe probability of loss.

The loss function of the generator is:

the optimization target is as follows:

wherein λ_cls>＝0、λ_cyc>0 and λ_id>0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively.

Represents the penalty of the generator in GAN:

wherein the content of the first and second substances,

expressing the expectation of the probability distribution generated by the generator, G (x)_s,c_t) The representation generator generates spectral features.

And loss of discriminator

Together, form the common countermeasures losses in GAN that are used to discriminate whether the spectrum input to the discriminator is the true spectrum or the generated spectrum. During the training process

As small as possible, the generator is continuously optimized until a spectral feature G (x) is generated that can be spurious_s,c_t) Making it difficult for the discriminator to discriminate between true and false.

For classifier C to optimize the classification loss of the generator:

wherein p is_C(c_t|G(x_s,c_t) Means that the classifier discriminates that the target speaker spectrum label belongs to c_tProbability of (a), G (x)_s,c_t) Representing the target speaker spectrum generated by the generator. In the course of the training process,

as small as possible, so that the frequency spectrum G (x) generated by the generator G_s,c_t) Can be correctly classified as label c by the classifier_t。

L_cyc(G) And L_id(G) By using the loss of the generator in the Cycle-GAN model, L_cyc(G) To generate cycle consistent losses in generator G:

wherein, G (G (x)_s,c_t),c_s) For the reconstructed spectral features of the source speaker,

in order to reconstruct the loss expectation of the source speaker spectrum and the true source speaker spectrum, in the loss of the training generator, L_cyc(G) As small as possible, so that the target spectrum G (x) is generated_s,c_t) Source speaker tag c_sInputting the data into the generator again, and obtaining the reconstructed source speaker voice frequency spectrum as much as possible_sSimilarly, through training L_cyc(G) The semantic features of the speaker voice can be effectively ensured and are not lost after being coded by the generator.

L_id(G) To generate the feature mapping penalty for G:

wherein, G (x)_s,c_s) The source speaker frequency spectrum and the source speaker frequency spectrum characteristics obtained after the speaker label is input into the generator,

is x_sAnd G (x)_s,c_s) Expected loss of training L_id(G) Label c capable of effectively ensuring input frequency spectrum_sRemains unchanged after input to the generator.

1.7) generating the spectral characteristics x of the target speaker_tcAnd spectral feature x of the targeted speaker_tAnd inputting the classifier for training, and minimizing a loss function of the classifier.

The classifier uses a two-dimensional convolutional neural network C, including 5 convolutional layers, the filter sizes of the 5 convolutional layers are 4 × 4, 3 × 4, and 1 × 4, respectively, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

the optimization target is as follows:

1.8) repeating the steps 1.4), 1.5), 1.6) and 1.7) until the set iteration number is reached, thereby obtaining the trained Transive STARGAN network, wherein the generator parameter phi, the discriminator parameter theta and the classifier parameter psi are the trained parameters. The iteration times are different because the specific setting of the neural network is different and the performance of the experimental equipment is different. The number of iterations in this experiment was selected to be 100000.

1.9) use of the logarithmic fundamental frequency log f₀The mean value and the mean variance of the pitch frequency are established to establish a fundamental frequency conversion relation, the mean value and the mean variance of the logarithmic fundamental frequency of each speaker are counted, and the logarithmic fundamental frequency log f of the source speaker is converted by utilizing the linear transformation of the logarithmic domain_0sObtaining the logarithm fundamental frequency log f of the target speaker by conversion_0t′。

The fundamental transfer function is:

wherein, mu_sAnd σ_sMean and mean square error, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectively_tAnd σ_tRespectively, the mean and mean square error of the fundamental frequency of the target speaker in the logarithmic domain.

The implementation steps of the conversion stage are as follows:

2.1) extracting the frequency spectrum characteristic x of different sentences of the source speaker by the WOR L D voice analysis/synthesis model_s', aperiodic character, fundamental frequency.

2.2) extracting the spectral feature x of the source speaker voice extracted in 2.1)_s' with the objectSpeaker tag feature c_t' as a joint feature (x)_s′,c_t') input 1.8) the trained Transive STARGAN network, thereby reconstructing the spectral feature x of the target speaker_tc′。

2.3) converting the fundamental frequency of the source speaker extracted in the step 2.1) into the fundamental frequency of the target speaker by the fundamental frequency conversion function obtained in the step 1.9).

2.4) comparing the spectral characteristics x of the target speaker generated in 2.2)_tc', 2.3) and 2.1) synthesizing the converted speaker's speech through a WOR L D speech analysis/synthesis model.

Compared with the speech effect synthesized by the reference STARGAN model, the transient STARGAN model of the invention has the advantage that spectrogram analysis is shown in FIG. 3. It can be seen that compared with the reference STARGAN model, the sensitive STARGAN model has stronger semantic retention capability, and the spectrogram of the synthesized voice has clearer details, more complete fundamental tone and harmonic information, so that the synthesized voice is finer and more real.

The training speed of the Transitive STARGAN model of the present invention was compared with the training speed of the reference STARGAN model, and the training loss of the generator is shown in fig. 4. It can be seen that compared with the reference STARGAN model, the generator network of the Transitive STARGAN model can reach the convergence state with fewer iterations, and meanwhile, has smaller training loss, so that the Transitive STARGAN model can accelerate the training speed of the network.

The embodiments of the present invention, if implemented in the form of software functional modules and sold or used as independent products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. The computer program, when executed by a processor, may implement the aforementioned Transive STARGAN-based many-to-many speaker transformation method. For example, the computer storage medium is a computer-readable storage medium.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A many-to-many speaker conversion method based on Transitive STARGAN is characterized by comprising a training phase and a conversion phase, wherein the training phase comprises the following steps:

(1.3) matching the spectral characteristics x of the source speaker_sSpectral feature x of the targeted speaker_tSource speaker tag c_sAnd target speaker tag c_tThe method comprises the steps that the method is input into a Transive STARGAN network for training, the Transive STARGAN network consists of a generator G, a discriminator D and a classifier C, the generator G consists of a coding network and a decoding network, and a plurality of layers of TransNet used for optimizing a generator network structure are built between the coding network and the decoding network;

the transition phase comprises the steps of:

2. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the encoding network of the generator G comprises 5 convolutional layers, the decoding network of the generator G comprises 5 deconvolution layers, the constructed TransNet is 4 layers, specifically, the output of the first convolutional layer of the encoding network is spliced with the output of the fourth convolutional layer of the decoding network and then input into the fifth convolutional layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.

3. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 2, wherein: the filter sizes of the 5 convolution layers of the coding network of the generator G are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5; the filter sizes of the 5 deconvolution layers of the decoding network of the generator G are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively; the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1; the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.

4. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the training process in steps (1.3) and (1.4) comprises the following steps:

(5) The target speaker generated in the step (2)Spectral feature x of_tcTrue spectral feature x of the targeted speaker_tAnd the tag characteristics c of the targeted speaker_tInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;

5. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the input process in step (2.2) comprises the following steps:

6. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 5.