CN111429893A - Many-to-many speaker conversion method based on Transitive STARGAN - Google Patents

Many-to-many speaker conversion method based on Transitive STARGAN Download PDF

Info

Publication number
CN111429893A
CN111429893A CN202010168932.2A CN202010168932A CN111429893A CN 111429893 A CN111429893 A CN 111429893A CN 202010168932 A CN202010168932 A CN 202010168932A CN 111429893 A CN111429893 A CN 111429893A
Authority
CN
China
Prior art keywords
speaker
network
generator
stargan
many
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010168932.2A
Other languages
Chinese (zh)
Inventor
李燕萍
何铮韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010168932.2A priority Critical patent/CN111429893A/en
Publication of CN111429893A publication Critical patent/CN111429893A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a trans STARGAN-based many-to-many speaker conversion method, which combines a STARGAN generator with a transmission network, transmits characteristics extracted by a coding network in the generator to a corresponding network layer of a decoding network, improves the learning capability of the decoding network on semantic characteristics with different scales, realizes the learning function of a model on deep characteristics of a frequency spectrum, improves the frequency spectrum generation quality of the decoding network, and more fully learns the semantic characteristics and the personalized characteristics of speakers, thereby better promoting the personalized similarity and the voice quality of converted synthesized voice, overcoming the problem of poor personalized similarity and naturalness after the conversion of the STARGAN model, and realizing the high-quality many-to-many speaker conversion under the condition of non-parallel texts.

Description

Many-to-many speaker conversion method based on Transitive STARGAN
Technical Field
The invention relates to a many-to-many speaker conversion method, in particular to a many-to-many speaker conversion method based on TransitioNSTAGGAN.
Background
Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended based on the research of speech analysis, synthesis, and speaker recognition. The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker while retaining semantic information, i.e., to make the source speaker's speech sound like the target speaker's speech after conversion.
After years of research, many classical conversion methods have emerged, and the speech conversion technology can be classified into a conversion method under a parallel text condition and a conversion method under a non-parallel text condition according to the classification of training corpora. The method is characterized in that a large number of parallel training texts are collected in advance, time and labor are consumed, and the parallel texts cannot be collected in a cross-language conversion and medical auxiliary system, so that the speech conversion research under the condition of non-parallel texts has greater application background and practical significance.
The existing voice conversion method under the condition of non-parallel text includes a method based on a Cycle-Consistent adaptive network (Cycle-GAN), a method based on a Conditional variable Auto-Encoder (C-VAE), and a method based on a Disco-GAN (discovery-domain relationships With general adaptive network). Compared with the traditional GAN, the speech conversion method based on the Disco-GAN model improves the speech quality by adding a style discriminator to extract the speech personality characteristics, but can only realize one-to-one speech conversion. A voice conversion method based on a C-VAE model directly utilizes an identity tag of a speaker to establish a voice conversion system, wherein an encoder realizes the separation of semantics and personal information on voice, and a decoder realizes the reconstruction of the voice through the semantics and the identity tag of the speaker, thereby being capable of removing the dependence on parallel texts. The voice conversion method based on the Cycle-GAN model utilizes the antagonism loss and the Cycle consistent loss, and simultaneously learns the forward mapping and the inverse mapping of the acoustic characteristics, so that the problem of over-smoothness can be effectively solved, and the voice quality of conversion synthesis is improved.
The voice conversion method based on Star-generated confrontation Network (STARGAN) model has the advantages of Disco-GAN, C-VAE and Cycle-GAN, because the generator of the method has a coding and decoding structure, many-to-many mapping can be learned at the same time, and the attribute output by the generator is controlled by the identity label of a speaker, so many-to-many voice conversion under the condition of non-parallel text can be realized, but because the coding Network and the decoding Network in the generator are mutually independent, the separation of semantic features and speaker personalized features can not be realized well directly through the coding Network of the generator, and meanwhile, the decoding Network of the generator can not realize the synthesis of the semantic features and the speaker personalized features well, so that the semantic features of the frequency spectrum can be lost easily in the process of continuously extracting the features by the coding Network, the decoding Network of the generator depends on the information after multi-layer downsampling coding Network coding, resulting in limitations of the overall generator in the retention of semantic features.
There is a commonality between the style conversion and the voice conversion in the image domain, that is, the original content features are preserved and the style features are changed. Specifically, in the image domain, the content of the image is retained, the style of the image, such as color or texture, is converted, and in the voice conversion domain, the semantic features of the spectrum are retained and the personality features are converted. The transfer Network (transmissive Network) is applied in the image field, and the core idea is to transfer the features of the generator encoding stage to the corresponding decoding stage, and to strengthen the learning ability and expression ability of the generator to the semantic features. The transmission network optimizes the coding and decoding structure of the generator, can better retain the content of the source image, can avoid the problems of gradient disappearance or gradient explosion in the back propagation process, and can facilitate the training of a deep network.
Disclosure of Invention
The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a conversion method of many-to-many speakers based on Transive STARGAN and a computer storage medium, which solve the problem of network degradation in the training process of the existing method, improve the learning capability of the decoding network on semantic features of different scales by building multiple layers of TransNet between encoding and decoding networks of a STARGAN generator, realize the learning function of a model on deep features of a frequency spectrum, improve the frequency spectrum generation quality of the decoding network, and learn the semantic features and the personalized features of the speakers more fully, thereby better improving the personalized similarity and the voice quality of converted and synthesized voice.
The technical scheme is as follows: the invention relates to a trans STARGAN-based many-to-many speaker conversion method, which comprises a training phase and a conversion phase, wherein the training phase comprises the following steps:
(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;
(1.2) extracting the frequency spectrum characteristic x, the aperiodic characteristic and the fundamental frequency characteristic of the voice of each speaker by the WOR L D voice analysis/synthesis model of the training corpus;
(1.3) matching the spectral characteristics x of the source speakersSpectral feature x of the targeted speakertSource speaker tag csAnd target speaker tag ctThe method comprises the steps that the method is input into a Transitiove STARGAN network for training, the Transitiove STARGAN network consists of a generator G, a discriminator D and a classifier C, the generator G consists of an encoding network and a decoding network, and a plurality of layers of TransNet used for optimizing a generator network structure are built between the encoding network and the decoding network;
(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration number is reached, so that a trained Transive STARGAN network is obtained;
(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;
the transition phase comprises the steps of:
(2.1) extracting the frequency spectrum characteristic x from the voice of the source speaker in the corpus to be converted through a WOR L D voice analysis/synthesis models', aperiodic characteristics, and fundamental frequency characteristics;
(2.2) applying the spectral characteristics x of the source speakers', target speaker tag characteristics ct' inputting the trained Transive STARGAN network in the step (1.4) to obtain the spectrum feature x of the target speakertc′;
(2.3) converting the source speaker fundamental frequency feature extracted in the step (2.1) into the fundamental frequency feature of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);
(2.4) the target speaker spectrum characteristic x generated in the step (2.2)tc', the fundamental frequency characteristic of the target speaker obtained in the step (2.3) and the aperiodic characteristic extracted in the step (2.1) are synthesized through a WOR L D voice analysis/synthesis model to obtain the converted speaker voice.
Further, the encoding network of the generator G includes 5 convolutional layers, the decoding network of the generator G includes 5 deconvolution layers, and the constructed TransNet is 4 layers, specifically, the output of the first convolutional layer of the encoding network is spliced with the output of the fourth convolutional layer of the decoding network, and then the spliced output is input to the fifth convolutional layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.
Further, the filter sizes of 5 convolution layers of the coding network of the generator G are 3 × 9, 4 × 8, 3 × 5, 9 × 5, the step sizes are 1 × 1, 2 × 2, 1 × 1, 9 × 1, and the filter depths are 32, 64, 128, 64, 5, respectively; the filter sizes of the 5 deconvolution layers of the decoding network of the generator G are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively; the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1; the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.
Further, the training process in steps (1.3) and (1.4) comprises the following steps:
(1) the spectral characteristics x of the source speakersCoding network of input generator G to obtain speaker independent semantic features G (x)s);
(2) Generating the semantic feature G (x)s) Tag characteristics c with the targeted speakertThe decoded network input to the generator G is trained, and the loss function of the generator G is minimized in the training process, so that the spectral feature x of the target speaker is obtainedtc
(3) Generating the above-mentioned spectral characteristics x of the target speakertcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)tc);
(4) Generating the semantic feature G (x)tc) Source speaker tag feature csInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a reconstructed frequency spectrum characteristic x of the source speakersc
(5) The frequency spectrum characteristic x of the target speaker generated in the step (2) is comparedtcTrue spectral feature x of the targeted speakertAnd the tag characteristics c of the targeted speakertInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;
(6) the frequency spectrum characteristic x of the target speaker generated in the step (2) is comparedtcAnd the true spectral feature x of the targeted speakertInputting a classifier C for training, and minimizing a loss function of the classifier C;
(7) and (4) returning to the step (1) and repeating the steps until the set iteration number is reached, thereby obtaining the trained Transive STARGAN network.
Further, the input process in step (2.2) comprises the following steps:
(1) the spectral characteristics x of the source speakers' encoding network of input generator G, deriving speaker independent semantic features G (x)s′);
(2) Generating the semantic feature G (x)s') tag characteristics c with the targeted speakert' input to the decoding network of the generator G together to obtain the spectral feature x of the target speakertc′。
The computer storage medium of the present invention has stored thereon a computer program which, when executed by a computer processor, implements the method of any of the above.
Has the advantages that: the method can realize many-to-many speaker voice conversion under the condition of non-parallel text by using the transitional STARGAN, and has the improvement that a multilayer TransNet is established between a coding network and a decoding network of a generator, and the learning capability and the expression capability of the generator to semantic features are enhanced by splicing the semantic features of corresponding coding network layers on the decoding network layers, and the beneficial effects are as follows: (1) the converted voice is more exquisite and real, namely the multilayer TransNet transmits the semantic features of different scales in the coding stage to the corresponding decoding stage, so that the learning capability of the decoding network on the semantic features of different scales is improved, and the problem of semantic feature loss caused by STARGAN network degradation is solved; (2) the personality similarity of the converted voice is improved, namely the multilayer TransNet reduces the learning burden of the generator network on the semantics, so that the decoding network is favorable for learning the personality characteristic conversion of the speaker; (3) the network training is more stable and efficient, and the multi-layer TransNet not only avoids the problem of gradient disappearance or gradient explosion in the network training back propagation process, but also enables the generator to obtain more semantic information in the decoding stage, and accelerates the convergence speed of the network training. In summary, transitive stargan realizes a high-quality and high-efficiency many-to-many voice conversion method under the condition of non-parallel text.
Drawings
FIG. 1 is a schematic diagram of the model Transactive STARGAN of the present method;
FIG. 2 is a network structure diagram of a generator of the model Transactive STARGAN of the method;
FIG. 3 is a spectrogram of the synthetic speech of the present method model sensitive STARGAN and the reference model STARGAN;
FIG. 4 is a graph of training loss for the generator networks of the model Transactive STARGAN and the reference model STARGAN of the present method.
Detailed Description
The invention applies the idea of the transmission network to the field of voice conversion, is used for transmitting information characteristics with different scales in a generator network and enhances the learning ability and the expression ability of the generator network. The invention utilizes the transmission network to compensate the semantic information lost by the generator in the encoding and decoding stages, so that the model can fully learn the deep characteristics of the frequency spectrum, thereby obtaining the frequency spectrum with richer details, avoiding the problem of fuzzy details of the frequency spectrum generated by the generator network, and further improving the frequency spectrum generation quality of the decoding network. The structure further reduces the difficulty of learning semantics by the generator network, thereby improving the naturalness and the definition of the converted voice.
As shown in fig. 1, the method implemented in this example is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.
The training stage comprises the following implementation steps:
1.1) obtaining a training corpus of a non-parallel text, wherein the training corpus is a corpus of multiple speakers and comprises a source speaker and a target speaker. The corpus is taken from the VCC2018 corpus. The corpus training set has 6 male and 6 female speakers, each speaker having 81 sentences of corpus.
1.2) trainingExtracting spectral envelope characteristic x, aperiodic characteristic and logarithmic fundamental frequency log f of each speaker sentence from corpus through WOR L D voice analysis/synthesis model0. Wherein, because Fast Fourier Transform (FFT) length is set to 1024, the obtained spectral envelope characteristic x and aperiodic characteristic are 1024/2+1 ═ 513 dimensions. Each voice block has 512 frames, each frame extracts the feature of 36-dimensional Mel cepstral coefficients (MCEP) as the spectral feature of a Transitive STARGAN model, and 8 voice blocks are obtained by one training. Thus, the corpus has dimensions 8 × 36 × 512.
1.3) the Transitive STARGAN network in the embodiment is based on a Cycle-GAN model, and improves the Cycle-GAN effect by improving the structure of GAN and combining a classifier. Transive STARGAN consists of three parts: a generator G for generating a true spectrum, a discriminator D for judging whether the input is a true spectrum or a generated spectrum, and a label for judging whether the generated spectrum belongs to ctThe classifier C of (1).
The objective function of a Transitive STARGAN network is:
Figure BDA0002408460870000061
wherein, LG(G) To generate the loss function of the generator:
Figure BDA0002408460870000062
wherein λ iscls>=0、λcyc>0 and λid>0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively,
Figure BDA0002408460870000063
Lcyc(G) and Lid(G) The countermeasure loss of the generator, the classification loss of the classifier optimization generator, the cycle consistency loss and the feature mapping loss are respectively expressed.
The loss function of the discriminator is:
Figure BDA0002408460870000064
wherein, D (x)t,ct) Indicating that discriminator D discriminates true spectral features, G (x)s,ct) Representing the spectral characteristics of the target speaker, i.e. x, generated by the generator Gtc,D(G(xs,ct),ct) Representing the spectral features discriminatively generated by the discriminator,
Figure BDA0002408460870000065
representing the expectation of the probability distribution generated by the generator G,
Figure BDA0002408460870000066
an expectation representing a true probability distribution;
the loss function of the classifier two-dimensional convolutional neural network is:
Figure BDA0002408460870000067
wherein p isC(ct|xt) C, representing the characteristic of the classifier for distinguishing the target speaker as a labeltOf the true spectrum of the spectrum.
1.4) extracting the frequency spectrum characteristic x of the source speaker extracted in 1.2)sAnd target speaker tag characteristics ctAs a combined feature (x)s,ct) Training the generator, making the loss function L of the generatorGAs small as possible, obtaining the frequency spectrum characteristic x of the generated target speakertc
As shown in fig. 2, the generator adopts a two-dimensional convolutional neural network, and is composed of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5. The decoding network comprises 5 deconvolution layers, the filter sizes of the 5 deconvolution layers are respectively 9 × 5, 3 × 5, 4 × 8 and 3 × 9, the step sizes are respectively 9 × 1, 1 × 1, 2 × 2 and 1 × 1, and the filter depths are respectively 64, 128, 64, 32 and 1; establishing a TransNet between the coding network and the decoding network, splicing the output of the first convolution layer of the coding network with the output of the fourth convolution layer of the decoding network, and inputting the spliced output to the fifth convolution layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.
1.5) generating the target speaker frequency spectrum characteristic x obtained in 1.4)tcAnd 1.2) the spectral feature x of the target speaker of the corpus obtainedtAnd target speaker tag ctTraining the discriminator as the input of the discriminator to make the discriminator lose function
Figure BDA0002408460870000071
As small as possible.
The discriminator adopts a two-dimensional convolutional neural network, which comprises 5 convolutional layers, the filter sizes of the 5 convolutional layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1;
the loss function of the discriminator is:
Figure BDA0002408460870000072
the optimization target is as follows:
Figure BDA0002408460870000073
1.6) obtaining the obtained frequency spectrum characteristic x of the target speakertcInputting the data into the coding network of the generator G again to obtain speaker-independent semanticsFeature G (x)tc) The semantic feature G (x) obtained above is usedtc) Source speaker tag feature csInputting the data into a decoding network of a generator G together for training, minimizing a loss function of the generator G in the training process, and obtaining the spectral feature x of the reconstructed source speakersc. The loss function of the generator is minimized in the training process, including the countermeasure loss, the cycle consistent loss and the classification loss of the feature mapping loss generator. Wherein the training cycle consistency loss is to make the source speaker spectral feature xsAfter passing through the generator G, the reconstructed spectral characteristics x of the source speakerscCan be mixed with xsAs consistent as possible. Loss of training feature mapping to guarantee xsSpeaker tag is still c after passing through generator GsThe classification loss refers to the target speaker spectrum x generated by the classifier discrimination generatortcBelongs to the label ctThe probability of loss.
The loss function of the generator is:
Figure BDA0002408460870000074
the optimization target is as follows:
Figure BDA0002408460870000075
wherein λcls>=0、λcyc>0 and λid>0 is a regularization parameter, representing the weight of the classification penalty, the cycle consistency penalty, and the feature mapping penalty, respectively.
Figure BDA0002408460870000081
Represents the penalty of the generator in GAN:
Figure BDA0002408460870000082
wherein the content of the first and second substances,
Figure BDA0002408460870000083
expressing the expectation of the probability distribution generated by the generator, G (x)s,ct) The representation generator generates spectral features.
Figure BDA0002408460870000084
And loss of discriminator
Figure BDA0002408460870000085
Together, form the common countermeasures losses in GAN that are used to discriminate whether the spectrum input to the discriminator is the true spectrum or the generated spectrum. During the training process
Figure BDA0002408460870000086
As small as possible, the generator is continuously optimized until a spectral feature G (x) is generated that can be spuriouss,ct) Making it difficult for the discriminator to discriminate between true and false.
Figure BDA0002408460870000087
For classifier C to optimize the classification loss of the generator:
Figure BDA0002408460870000088
wherein p isC(ct|G(xs,ct) Means that the classifier discriminates that the target speaker spectrum label belongs to ctProbability of (a), G (x)s,ct) Representing the target speaker spectrum generated by the generator. In the course of the training process,
Figure BDA0002408460870000089
as small as possible, so that the frequency spectrum G (x) generated by the generator Gs,ct) Can be correctly classified as label c by the classifiert
Lcyc(G) And Lid(G) By using the loss of the generator in the Cycle-GAN model, Lcyc(G) To generate cycle consistent losses in generator G:
Figure BDA00024084608700000810
wherein, G (G (x)s,ct),cs) For the reconstructed spectral features of the source speaker,
Figure BDA00024084608700000811
in order to reconstruct the loss expectation of the source speaker spectrum and the true source speaker spectrum, in the loss of the training generator, Lcyc(G) As small as possible, so that the target spectrum G (x) is generateds,ct) Source speaker tag csInputting the data into the generator again, and obtaining the reconstructed source speaker voice frequency spectrum as much as possiblesSimilarly, through training Lcyc(G) The semantic features of the speaker voice can be effectively ensured and are not lost after being coded by the generator.
Lid(G) To generate the feature mapping penalty for G:
Figure BDA00024084608700000812
wherein, G (x)s,cs) The source speaker frequency spectrum and the source speaker frequency spectrum characteristics obtained after the speaker label is input into the generator,
Figure BDA00024084608700000813
is xsAnd G (x)s,cs) Expected loss of training Lid(G) Label c capable of effectively ensuring input frequency spectrumsRemains unchanged after input to the generator.
1.7) generating the spectral characteristics x of the target speakertcAnd spectral feature x of the targeted speakertAnd inputting the classifier for training, and minimizing a loss function of the classifier.
The classifier uses a two-dimensional convolutional neural network C, including 5 convolutional layers, the filter sizes of the 5 convolutional layers are 4 × 4, 3 × 4, and 1 × 4, respectively, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.
The loss function of the classifier two-dimensional convolutional neural network is:
Figure BDA0002408460870000091
the optimization target is as follows:
Figure BDA0002408460870000092
1.8) repeating the steps 1.4), 1.5), 1.6) and 1.7) until the set iteration number is reached, thereby obtaining the trained Transive STARGAN network, wherein the generator parameter phi, the discriminator parameter theta and the classifier parameter psi are the trained parameters. The iteration times are different because the specific setting of the neural network is different and the performance of the experimental equipment is different. The number of iterations in this experiment was selected to be 100000.
1.9) use of the logarithmic fundamental frequency log f0The mean value and the mean variance of the pitch frequency are established to establish a fundamental frequency conversion relation, the mean value and the mean variance of the logarithmic fundamental frequency of each speaker are counted, and the logarithmic fundamental frequency log f of the source speaker is converted by utilizing the linear transformation of the logarithmic domain0sObtaining the logarithm fundamental frequency log f of the target speaker by conversion0t′。
The fundamental transfer function is:
Figure BDA0002408460870000093
wherein, musAnd σsMean and mean square error, mu, of the source speaker's fundamental frequency in the logarithmic domain, respectivelytAnd σtRespectively, the mean and mean square error of the fundamental frequency of the target speaker in the logarithmic domain.
The implementation steps of the conversion stage are as follows:
2.1) extracting the frequency spectrum characteristic x of different sentences of the source speaker by the WOR L D voice analysis/synthesis models', aperiodic character, fundamental frequency.
2.2) extracting the spectral feature x of the source speaker voice extracted in 2.1)s' with the objectSpeaker tag feature ct' as a joint feature (x)s′,ct') input 1.8) the trained Transive STARGAN network, thereby reconstructing the spectral feature x of the target speakertc′。
2.3) converting the fundamental frequency of the source speaker extracted in the step 2.1) into the fundamental frequency of the target speaker by the fundamental frequency conversion function obtained in the step 1.9).
2.4) comparing the spectral characteristics x of the target speaker generated in 2.2)tc', 2.3) and 2.1) synthesizing the converted speaker's speech through a WOR L D speech analysis/synthesis model.
Compared with the speech effect synthesized by the reference STARGAN model, the transient STARGAN model of the invention has the advantage that spectrogram analysis is shown in FIG. 3. It can be seen that compared with the reference STARGAN model, the sensitive STARGAN model has stronger semantic retention capability, and the spectrogram of the synthesized voice has clearer details, more complete fundamental tone and harmonic information, so that the synthesized voice is finer and more real.
The training speed of the Transitive STARGAN model of the present invention was compared with the training speed of the reference STARGAN model, and the training loss of the generator is shown in fig. 4. It can be seen that compared with the reference STARGAN model, the generator network of the Transitive STARGAN model can reach the convergence state with fewer iterations, and meanwhile, has smaller training loss, so that the Transitive STARGAN model can accelerate the training speed of the network.
The embodiments of the present invention, if implemented in the form of software functional modules and sold or used as independent products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. The computer program, when executed by a processor, may implement the aforementioned Transive STARGAN-based many-to-many speaker transformation method. For example, the computer storage medium is a computer-readable storage medium.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (6)

1. A many-to-many speaker conversion method based on Transitive STARGAN is characterized by comprising a training phase and a conversion phase, wherein the training phase comprises the following steps:
(1.1) acquiring a training corpus, wherein the training corpus consists of corpora of a plurality of speakers and comprises a source speaker and a target speaker;
(1.2) extracting the frequency spectrum characteristic x, the aperiodic characteristic and the fundamental frequency characteristic of the voice of each speaker by the WOR L D voice analysis/synthesis model of the training corpus;
(1.3) matching the spectral characteristics x of the source speakersSpectral feature x of the targeted speakertSource speaker tag csAnd target speaker tag ctThe method comprises the steps that the method is input into a Transive STARGAN network for training, the Transive STARGAN network consists of a generator G, a discriminator D and a classifier C, the generator G consists of a coding network and a decoding network, and a plurality of layers of TransNet used for optimizing a generator network structure are built between the coding network and the decoding network;
(1.4) in the training process, the loss function of the generator G, the loss function of the discriminator D and the loss function of the classifier C are made as small as possible until the set iteration number is reached, so that a trained Transive STARGAN network is obtained;
(1.5) constructing a fundamental frequency conversion function from the voice fundamental frequency of the source speaker to the voice fundamental frequency of the target speaker;
the transition phase comprises the steps of:
(2.1) extracting the frequency spectrum characteristic x from the voice of the source speaker in the corpus to be converted through a WOR L D voice analysis/synthesis models', aperiodic characteristics, and fundamental frequency characteristics;
(2.2) applying the spectral characteristics x of the source speakers', target speaker tag characteristics ct' inputting the trained Transive STARGAN network in the step (1.4) to obtain the spectrum feature x of the target speakertc′;
(2.3) converting the source speaker fundamental frequency feature extracted in the step (2.1) into the fundamental frequency feature of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);
(2.4) the target speaker spectrum characteristic x generated in the step (2.2)tc', the fundamental frequency characteristic of the target speaker obtained in the step (2.3) and the aperiodic characteristic extracted in the step (2.1) are synthesized through a WOR L D voice analysis/synthesis model to obtain the converted speaker voice.
2. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the encoding network of the generator G comprises 5 convolutional layers, the decoding network of the generator G comprises 5 deconvolution layers, the constructed TransNet is 4 layers, specifically, the output of the first convolutional layer of the encoding network is spliced with the output of the fourth convolutional layer of the decoding network and then input into the fifth convolutional layer of the decoding network; splicing the output of the second convolution layer of the coding network with the output of the third convolution layer of the decoding network, and then inputting the spliced output to the fourth convolution layer of the decoding network; splicing the output of the third convolutional layer of the coding network with the output of the second convolutional layer of the decoding network, and then inputting the spliced output to the third convolutional layer of the decoding network; the output of the fourth convolutional layer of the coding network is spliced with the output of the first convolutional layer of the decoding network and then input to the second convolutional layer of the decoding network.
3. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 2, wherein: the filter sizes of the 5 convolution layers of the coding network of the generator G are respectively 3 × 9, 4 × 8, 3 × 5 and 9 × 5, the step sizes are respectively 1 × 1, 2 × 2, 1 × 1 and 9 × 1, and the filter depths are respectively 32, 64, 128, 64 and 5; the filter sizes of the 5 deconvolution layers of the decoding network of the generator G are 9 × 5, 3 × 5, 4 × 8, 3 × 9, respectively, the step sizes are 9 × 1, 1 × 1, 2 × 2, 1 × 1, respectively, and the filter depths are 64, 128, 64, 32, 1, respectively; the discriminator D comprises 5 convolution layers, the filter sizes of the 5 convolution layers are respectively 3 × 9, 3 × 8, 3 × 6 and 36 × 5, the step sizes are respectively 1 × 1, 1 × 2 and 36 × 1, and the filter depths are respectively 32, 32 and 1; the classifier C includes 5 convolution layers, the filter sizes of the 5 convolution layers are 4 × 4, 3 × 4, and 1 × 4, the step sizes are 2 × 2, 1 × 2, and the filter depths are 8, 16, 32, 16, and 4, respectively.
4. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the training process in steps (1.3) and (1.4) comprises the following steps:
(1) the spectral characteristics x of the source speakersCoding network of input generator G to obtain speaker independent semantic features G (x)s);
(2) Generating the semantic feature G (x)s) Tag characteristics c with the targeted speakertThe decoded network input to the generator G is trained, and the loss function of the generator G is minimized in the training process, so that the spectral feature x of the target speaker is obtainedtc
(3) Generating the above-mentioned spectral characteristics x of the target speakertcInputting the data into the coding network of the generator G again to obtain the speaker independent semantic features G (x)tc);
(4) Generating the semantic feature G (x)tc) Source speaker tag feature csInputting the signal into a decoding network of a generator G for training, minimizing a loss function of the generator G in the training process, and obtaining a reconstructed frequency spectrum characteristic x of the source speakersc
(5) The target speaker generated in the step (2)Spectral feature x oftcTrue spectral feature x of the targeted speakertAnd the tag characteristics c of the targeted speakertInputting the two signals into a discriminator D for training, and minimizing a loss function of the discriminator D;
(6) the frequency spectrum characteristic x of the target speaker generated in the step (2) is comparedtcAnd the true spectral feature x of the targeted speakertInputting a classifier C for training, and minimizing a loss function of the classifier C;
(7) and (4) returning to the step (1) and repeating the steps until the set iteration number is reached, thereby obtaining the trained Transive STARGAN network.
5. The franssitive STARGAN-based many-to-many speaker transformation method according to claim 1, wherein: the input process in step (2.2) comprises the following steps:
(1) the spectral characteristics x of the source speakers' encoding network of input generator G, deriving speaker independent semantic features G (x)s′);
(2) Generating the semantic feature G (x)s') tag characteristics c with the targeted speakert' input to the decoding network of the generator G together to obtain the spectral feature x of the target speakertc′。
6. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 5.
CN202010168932.2A 2020-03-12 2020-03-12 Many-to-many speaker conversion method based on Transitive STARGAN Pending CN111429893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010168932.2A CN111429893A (en) 2020-03-12 2020-03-12 Many-to-many speaker conversion method based on Transitive STARGAN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010168932.2A CN111429893A (en) 2020-03-12 2020-03-12 Many-to-many speaker conversion method based on Transitive STARGAN

Publications (1)

Publication Number Publication Date
CN111429893A true CN111429893A (en) 2020-07-17

Family

ID=71546419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010168932.2A Pending CN111429893A (en) 2020-03-12 2020-03-12 Many-to-many speaker conversion method based on Transitive STARGAN

Country Status (1)

Country Link
CN (1) CN111429893A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259086A (en) * 2020-10-15 2021-01-22 杭州电子科技大学 Speech conversion method based on spectrogram synthesis
CN112863529A (en) * 2020-12-31 2021-05-28 平安科技(深圳)有限公司 Speaker voice conversion method based on counterstudy and related equipment
CN112862909A (en) * 2021-02-05 2021-05-28 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN113053354A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Method and equipment for improving voice synthesis effect
CN113096673A (en) * 2021-03-30 2021-07-09 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113688944A (en) * 2021-09-29 2021-11-23 南京览众智能科技有限公司 Image identification method based on meta-learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060657A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on SN
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060657A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on SN
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHAOJIN CAI 等: "Stain Style Transfer using Transitive Adversarial Networks" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112259086A (en) * 2020-10-15 2021-01-22 杭州电子科技大学 Speech conversion method based on spectrogram synthesis
CN112863529A (en) * 2020-12-31 2021-05-28 平安科技(深圳)有限公司 Speaker voice conversion method based on counterstudy and related equipment
WO2022142115A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Adversarial learning-based speaker voice conversion method and related device
CN112863529B (en) * 2020-12-31 2023-09-22 平安科技(深圳)有限公司 Speaker voice conversion method based on countermeasure learning and related equipment
CN112862909A (en) * 2021-02-05 2021-05-28 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN113053354A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Method and equipment for improving voice synthesis effect
CN113096673A (en) * 2021-03-30 2021-07-09 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113096673B (en) * 2021-03-30 2022-09-30 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113688944A (en) * 2021-09-29 2021-11-23 南京览众智能科技有限公司 Image identification method based on meta-learning
CN113688944B (en) * 2021-09-29 2022-12-27 南京览众智能科技有限公司 Image identification method based on meta-learning

Similar Documents

Publication Publication Date Title
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN110600047B (en) Perceptual STARGAN-based multi-to-multi speaker conversion method
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
CN111785261B (en) Cross-language voice conversion method and system based on entanglement and explanatory characterization
CN111816156B (en) Multi-to-multi voice conversion method and system based on speaker style feature modeling
CN111429893A (en) Many-to-many speaker conversion method based on Transitive STARGAN
CN109326283B (en) Many-to-many voice conversion method based on text encoder under non-parallel text condition
Gao et al. Nonparallel emotional speech conversion
CN109599091B (en) Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN111429894A (en) Many-to-many speaker conversion method based on SE-ResNet STARGAN
CN110060701B (en) Many-to-many voice conversion method based on VAWGAN-AC
CN111462768B (en) Multi-scale StarGAN voice conversion method based on shared training
CN110335587B (en) Speech synthesis method, system, terminal device and readable storage medium
CN110060657B (en) SN-based many-to-many speaker conversion method
CN111833855B (en) Multi-to-multi speaker conversion method based on DenseNet STARGAN
CN110060691B (en) Many-to-many voice conversion method based on i-vector and VARSGAN
CN112634920A (en) Method and device for training voice conversion model based on domain separation
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN112949255A (en) Word vector training method and device
Malik et al. A preliminary study on augmenting speech emotion recognition using a diffusion model
CN113593534B (en) Method and device for multi-accent speech recognition
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN115019785A (en) Streaming voice recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: Yuen Road Qixia District of Nanjing City, Jiangsu Province, No. 9 210003

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200717