CN111816156B

CN111816156B - Multi-to-multi voice conversion method and system based on speaker style feature modeling

Info

Publication number: CN111816156B
Application number: CN202010488776.8A
Authority: CN
Inventors: 李燕萍; 张成飞
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2023-07-21
Anticipated expiration: 2040-06-02
Also published as: CN111816156A

Abstract

The invention discloses a multi-to-multi voice conversion method and a system based on speaker style feature modeling, which firstly propose to add a multi-layer perceptron and a style encoder in a StarGAN neural network to realize effective extraction and constraint of speaker style features, and overcome the defect that one-hot vectors in the traditional model carry limited speaker information; then, a self-adaptive instance normalization method is adopted to realize the full fusion of semantic features and speaker personality features, so that a network can learn more semantic information and speaker personality information; a lightweight network module SKNet is further introduced into the generator residual error network, so that the network can adaptively adjust the size of the receptive field according to a plurality of scales of input information, the weight of each characteristic channel is adjusted through an attention mechanism, the learning capacity of spectrum characteristics is enhanced, and the details of the spectrum characteristics are refined.

Description

Multi-to-multi voice conversion method and system based on speaker style feature modeling

Technical Field

The invention relates to the technical field of voice conversion, in particular to a multi-to-multi voice conversion method based on speaker style feature modeling.

Background

Speech conversion is a branch of research in the field of speech signal processing, and is developed and extended based on research of speech analysis, synthesis and speaker recognition. The goal of voice conversion is to change the voice personality of the source speaker to have the personality of the target speaker, while retaining the semantic information unchanged, i.e., the voice of the source speaker sounds like the voice of the target speaker after conversion.

Through many years of research, the speech conversion technology has emerged in a number of classical conversion methods, classified according to the corpus of training, and divided into conversion methods under parallel text conditions and conversion methods under non-parallel text conditions. A large number of parallel training texts are collected in advance, so that time and labor are consumed, and parallel texts cannot be collected in cross-language conversion and medical auxiliary systems, and therefore, the voice conversion research under the non-parallel text condition has a larger application background and practical significance.

Existing speech conversion methods under non-parallel text conditions include a method based on a cyclic-consistent antagonism network (Cycle-Consistent Adversarial Networks, cycle-GAN), a method based on a conditional variable self-Encoder (Conditional Variational Auto-Encoder, C-VAE), and the like. The voice conversion method based on the C-VAE model directly utilizes the identity label of the speaker to establish a voice conversion system, wherein the encoder realizes the decoupling of semantic and personalized information on the voice, and the decoder realizes the reconstruction of the voice through the semantic and the speaker identity label, thereby being capable of releasing the dependence on parallel texts. However, since the C-VAE is based on ideal assumptions, it is believed that the observed data generally follows a Gaussian distribution, resulting in excessive smoothing of the output speech of the decoder and poor speech quality after conversion. The voice conversion method based on the Cycle-GAN model utilizes the contrast loss and the Cycle consistency loss, and simultaneously learns the positive mapping and the inverse mapping of the acoustic characteristics, so that the problem of over-smoothing can be effectively relieved, the quality of converted voice is improved, but the Cycle-GAN can only realize one-to-one voice conversion at present.

The voice conversion method based on star-shaped generation antagonism network (Star Generative Adversarial Network, starGAN) model has the advantages of C-VAE and Cycle-GAN, the generator of the method has a coding and decoding structure, can learn many-to-many mapping at the same time, and the attribute output by the generator is controlled by the speaker identity label, so that the voice conversion under the non-parallel text condition is realized, but three problems still exist, namely, firstly, the speaker identity label is only one-hot vector, but can not provide more speaker identity information although the voice conversion method has indication effect, the lack of the speaker identity information causes that the generator is difficult to reconstruct the converted voice with high personality similarity; secondly, the speaker identity tag in the decoding network of the generator only controls the output attribute through simple splicing, so that the sufficient fusion of the semantic features and the speaker individual features can not be well realized, and the deep semantic features and the speaker individual features in the frequency spectrum are easily lost in transmission; in addition, the coding network and the decoding network in the generator are mutually independent, and the simple network structure ensures that the generator lacks the capability of extracting deep features, so that the loss of information and the generation of noise are extremely easy to cause.

Disclosure of Invention

The invention aims to: in order to overcome the defects of the prior art, the invention provides a multi-to-multi voice conversion method based on speaker style feature modeling, which solves the problems of insufficient personal information of speaker labels, simple semantic feature and speaker feature splicing mode and fixed receptive field and channel weight in a residual network in the existing method, and on the other hand, the invention also provides a multi-to-multi voice conversion system based on speaker style feature modeling.

The technical scheme is as follows: according to a first aspect of the present invention, a multi-to-multi speech conversion method based on speaker style feature modeling is presented, comprising a training phase and a conversion phase, the training phase comprising the steps of:

(1.1) obtaining a training corpus, wherein the training corpus consists of corpora of a plurality of speakers, and the speakers comprise source speakers and target speakers;

(1.2) extracting spectral features x of voices of all speakers in the training corpus;

(1.3) spectral characteristics x of each speaker's voices and Source speaker Label c _s Targeted speaker tag c _t The random noise z obeying normal distribution is input into a SKNet StarGAN network for training, the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multi-layer perceptron M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built in a residual error network between the encoding network and the decoding network;

(1.4) the training process makes the loss function of the generator G and the loss function of the discriminator D as small as possible until the set iteration times are reached, so that a trained SKNet StarGAN network is obtained;

(1.5) constructing a fundamental frequency conversion function from the fundamental frequency of the voice of the source speaker to the fundamental frequency of the voice of the target speaker;

the conversion phase comprises the following steps:

(2.1) extracting spectral characteristics x from the voice of the source speaker in the corpus to be converted _s ' non-periodic features, fundamental frequency features;

(2.2) characterizing the source speaker spectrum x _s ' Tab feature of target speaker c _t 'and random noise z' subject to normal distribution are input into the SKNet StarGAN network trained in the step (1.4) to obtain the spectral characteristics x of the target speaker _st '；

(2.3) converting the fundamental frequency characteristics of the source speaker extracted in the step (2.1) into fundamental frequency characteristics of the target speaker through the fundamental frequency conversion function obtained in the step (1.5);

(2.4) characterizing the spectral features x of the target speaker generated in step (2.2) _st And (3) synthesizing the fundamental frequency characteristics of the target speaker obtained in the step (2.3) and the aperiodic characteristics extracted in the step (2.1) through a WORLD voice analysis/synthesis model to obtain the converted speaker voice.

Further, the method comprises the steps of:

and the SKNet built between the coding network and the decoding network is 6 layers.

Further, the method comprises the steps of:

the style encoder S comprises 6 one-dimensional convolutions, the filter sizes are 1, 1 and 16 respectively, the step sizes are 1, the filter depths are 32, 64, 128, 256, 512 and 512 respectively, the middle layer comprises 5 one-dimensional average pooling layers and 5 residual error networks, the filter size of each one-dimensional average pooling layer is 2, the step sizes are 2, each residual error network layer comprises 2 one-dimensional convolutions, the filter size of each one-dimensional convolution is 2, the step sizes are 2, and the depths are 2 times the depth of the filter of the last layer.

Further, the method comprises the steps of:

the multi-layer sensor M comprises 7 linear layers, wherein the input neurons of the input layers are 16, the output neurons of the input layers are 512, the input neurons and the output neurons of the 5 linear layers of the middle layer are 512, the input neurons of the output layers are 512, and the output neurons are 64 numbers of people with voice conversion.

Further, the method comprises the steps of:

the training process of the steps (1.3) and (1.4) comprises the following steps:

(1) Random noise z and target speaker tag characteristic c subject to normal distribution _t Inputting into a multi-layer perceptron M to obtain style characteristics s of a target speaker _t ；

(2) Spectral features x of source speaker _s Inputting the code network of the generator G to obtain semantic features G (x) irrelevant to the speaker;

(3) Combining the generated semantic features G (x) with style features s of the target speaker _t The decoding network input to the generator G is trained, and the loss function of the generator G is minimized in the training process, so that the spectral characteristic x of the target speaker is obtained _st ；

(4) Spectral features x of source speaker _s Source speaker tag feature c _s Input to a style encoder S to obtain the style indication characteristic of the speaker

(5) The generated spectral characteristics x of the target speaker _st Inputting the semantic features into the coding network of the generator G again to obtain speaker-independent semantic features G (x) _st )；

(6) The generated semantic feature G (x _st ) Style indicating features with speakerThe decoding network input to the generator G is used for training, the loss function of the generator G is minimized in the training process, and the spectrum characteristics of the reconstructed speaker are obtained>

(7) The spectral characteristics x of the target speaker generated in the step (3) are calculated _st Input to discriminator D and classifier CMinimizing the loss function of discriminator D and the loss function of classifier C;

(8) The spectral characteristics x of the target speaker generated in the step (3) are calculated _st Tab feature c of target speaker _t Inputting a style encoder S for training, and minimizing a style reconstruction loss function of the style encoder S;

(9) And (3) returning to the step (1) and repeating the steps until the set iteration times are reached, thereby obtaining the trained SKNet StarGAN network.

Further, the method comprises the steps of:

the style reconstruction loss function of the style encoder S is expressed as:

wherein,,representing the expectation of the probability distribution generated by the generator, S (·) being the style encoder, S _t Representing the style characteristics, G (x) _s ,s _t ) The representation generator generates the spectral features of the target speaker, x _s Is a spectral feature of the source speaker.

Further, the method comprises the steps of:

the input process of the step (2.2) comprises the following steps:

(1) Random noise z' subject to normal distribution and target speaker tag characteristic c _t ' are input into a multi-layer perceptron M to obtain style characteristics s of a target speaker _t '；

(2) Spectral features x of source speaker _s The' encoding network of input generator G, deriving speaker-independent semantic features G (x _s ')；

(3) -generating semantic features G (x _s ') style characteristics s of the target speaker _t ' the spectral characteristics x of the target speaker are obtained by inputting the target speaker into the decoding network of the generator G _st '。

Further, the method comprises the steps of:

the objective function of the SKNet StarGAN network is expressed as:

L _SKNetSTARGAN ＝L _G +L _D

wherein L is _G Loss function of generator, L _D A loss function for the discriminator;

loss function L of generator _G Expressed as:

wherein lambda is _cyc 、λ _ds 、λ _sty And lambda (lambda) _cls Is a group of regularized hyper-parameters, which respectively represent weights of a cyclic consistency loss, a style diversity loss, a style reconstruction loss and a classification loss,and->Representing the fight loss, the cycle consistency loss, the style diversity loss, the style reconstruction loss of the style encoder and the classification loss of the classifier of the generator respectively;

loss function L of discriminator _D The method comprises the following steps:

wherein lambda is _cls Is the weight of the classification loss,the challenge loss of the discriminator and the classification loss of the classifier are respectively.

On the other hand, the invention also provides a multi-to-multi-voice conversion system based on speaker style feature modeling, which comprises a training stage and a conversion stage, wherein the training stage comprises:

the corpus acquisition module is used for acquiring training corpus, wherein the training corpus consists of the corpus of a plurality of speakers, and the speakers comprise source speakers and target speakers;

the preprocessing module is used for extracting the spectral characteristics x of the voices of each speaker in the training corpus;

The network training module is used for training the spectral characteristics x of the voices of each speaker and the source speaker label c _s Targeted speaker tag c _t The random noise z obeying normal distribution is input into a SKNet StarGAN network for training, the SKNet StarGAN network comprises a generator G, a discriminator D, a classifier C, a style encoder S and a multi-layer perceptron M, the generator G comprises an encoding network, a decoding network and at least one SKNet layer, and the SKNet layer is built in a residual error network between the encoding network and the decoding network;

the training process makes the loss function of the generator G and the loss function of the discriminator D as small as possible until the set iteration times are reached, so that a trained SKNet StarGAN network is obtained;

a function construction module for constructing a fundamental frequency conversion function from a fundamental frequency of a voice of a source speaker to a fundamental frequency of a voice of a target speaker;

the conversion phase comprises:

the source voice processing module is used for extracting spectral characteristics x from voice of a source speaker in the corpus to be converted _s ' non-periodic features, fundamental frequency features;

a conversion module for converting the spectral characteristics x of the source speaker _s ' Tab feature of target speaker c _t 'and random noise z' subject to normal distribution are input into the SKNet StarGAN network trained in the step (1.4) to obtain the spectral characteristics x of the target speaker _st '；

The target feature acquisition module is used for converting the obtained fundamental frequency conversion function into fundamental frequency features of a target speaker;

a speaker voice acquisition module for generating a target speaker frequencySpectral features x _st The fundamental frequency characteristic and the aperiodic characteristic of the target speaker are synthesized through a WORLD voice analysis/synthesis model to obtain the converted speaker voice.

Furthermore, the present invention provides a computer storage medium having stored thereon a computer program characterized in that: which when executed by a computer processor, implements the method described above.

The beneficial effects are that: (1) According to the invention, the multi-layer perceptron and the wind pattern encoder are added to obtain more abundant speaker personality characteristics, the speaker personality characteristics are used for replacing speaker labels, the defect that one-hot vectors carry limited speaker information is overcome, the decoding network is facilitated to learn more speaker personality characteristics, the personality similarity of the converted voice is improved, and the more ideal converted voice is obtained; (2) According to the invention, by adopting an adaptive instance normalization mode, semantic features and speaker personality features can be fully fused, the learning capacity of a decoding network on spectrum features with different scales is improved, and meanwhile, an SKNet module is added between a generator coding network and the decoding network, so that the network can adaptively adjust the size of a receptive field according to a plurality of scales of input information, and the weight of each feature channel is adjusted through an attention mechanism, so that the detail of the spectrum features is refined, and the generated spectrum is clearer, natural and finer; (3) The training network is more stable and efficient, the normalization method can accelerate the network training speed, the problem that the gradient disappears or gradient explodes in the back propagation process of the network is avoided, and meanwhile, the residual network can effectively solve the network degradation problem in the training process; therefore, the SKNet StarGAN network realizes a multi-to-multi voice conversion method with high tone quality and high personality similarity under the non-parallel text condition, and has better application prospect in the fields of cross-language voice conversion, movie dubbing, voice translation and the like.

Drawings

FIG. 1 is a schematic diagram of the SKNet StarGAN of the method;

FIG. 2 is a network structure diagram of a generator of a model SKNet StarGAN of the present method;

FIG. 3 is a schematic diagram of the SKNet principle in the model SKNet StarGAN of the method;

FIG. 4 is a network structure diagram of a discriminator of the model SKNet StarGAN of the present method;

FIG. 5 is a network structure diagram of a sensor of a model SKNet StarGAN of the method;

FIG. 6 is a network structure diagram of a style encoder of the model SKNet StarGAN of the present method;

FIG. 7 is a graph comparing the speech spectra of the synthesized speech of the SKNet StarGAN model and the reference StarGAN model of the method under the condition of homopolarity conversion;

FIG. 8 is a graph comparing the speech spectra of the synthesized speech of the SKNet StarGAN model and the reference StarGAN model of the method under the condition of heterogenic transformation;

fig. 9 is a graph comparing convergence rates of generator reconstruction functions of the SKNet StarGAN model and the reference StarGAN model of the present method.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a multi-to-multi voice conversion method based on speaker style feature modeling, which is characterized in that a multi-layer perceptron and a style encoder are added on a traditional StarGAN neural network to realize effective extraction and constraint of speaker style features, and the speaker style features are used for replacing speaker label features, so that the defect that one-hot vectors in the traditional model carry limited speaker information is overcome; secondly, fully fusing semantic features and speaker personality features in a generator network in a self-adaptive instance normalization mode, and enhancing learning capacity and expression capacity of the generator network; further, an SKNet module is added between the generator coding network and the decoding network, so that the network can adaptively adjust the size of the receptive field according to a plurality of scales of input information, and the weight of each characteristic channel is adjusted through an attention mechanism, so that the spectrum characteristic details are refined; the SKNet StarGAN network based on speaker style feature modeling can realize high-quality and high-personality similarity converted voice. The present invention refers to the modified StarGAN as SKNet StarGAN.

As shown in fig. 1, the method implemented in this example is divided into two parts: the training section is used for obtaining parameters and conversion functions required for voice conversion, and the conversion section is used for realizing conversion of the voice of the source speaker into the voice of the target speaker.

The training stage comprises the following implementation steps:

1.1 A training corpus of non-parallel texts is obtained, wherein the training corpus is a corpus of a plurality of speakers and comprises a source speaker and a target speaker. The training corpus is taken from a VCC2018 speech corpus, in which there are 6 male and 6 female speakers, each speaker has 81 sentence training corpus, and 35 sentence test corpus. In this experiment, 4 female speakers and 4 male speakers were selected, namely, VCC2SF3, VCC2SF4, VCC2TF1, VCC2TF2, VCC2SM3, VCC2SM4, VCC2TM1, and VCC2TM2.

1.2 Training corpus to extract spectral envelope characteristic x, aperiodic characteristic and log fundamental frequency log f of each speaker sentence through WORLD voice analysis/synthesis model ₀ Wherein, since the length of the fast fourier transform (Fast Fourier Transformation, FFT) is set to 1024, the obtained spectral envelope characteristic x and the aperiodic characteristic are 1024/2+1=513 dimensions. Each speech block has 512 frames, and each frame extracts a Mel-cepstral coefficient (Mel-Cepstral Coefficients, MCEP) characteristic of 36 dimensions as a spectral characteristic of a SKNet StarGAN model, and 8 speech blocks are trained at a time. Thus, the dimension of the training corpus is 8×36×512.

1.3 On the one hand, the SKNet StarGAN network in the embodiment is based on a StarGAN model, on the other hand, the realization of effective modeling and extraction of speaker style characteristics by adding a style encoder and a multi-layer perceptron is proposed, and on the other hand, the realization of sufficient fusion of semantic characteristics and speaker style characteristics by a self-adaptive example normalization method is proposed, and a novel lightweight network module SKNet is further introduced to realize refinement treatment of spectrum characteristics. SKNet StarGAN consists of four parts: a generator G for generating a spectrum, a discriminator D for judging the source of the spectrum, a classifier C for judging the attribute of a label for generating the spectrum, a multi-layer perceptron M for generating the style characteristics of a speaker, and a style encoder S for restraining the style characteristics of the speaker.

The objective function of the SKNet StarGAN network is:

L _SKNetSTARGAN ＝L _G +L _D

wherein L is _G Loss function of generator:

wherein lambda is _cyc 、λ _ds 、λ _sty And lambda (lambda) _cls Is a group of regularized hyper-parameters, which respectively represent weights of a cyclic consistency loss, a style diversity loss, a style reconstruction loss and a classification loss,and->The counter loss, the loop consistency loss, the style diversity loss, the style reconstruction loss of the style encoder, and the classification loss of the classifier are represented by the generator, respectively.

The loss function of the discriminator is:

1.4 Random noise z subject to normal distribution and target speaker tag characteristic c) _t As a combined feature(z,c _t ) Inputting into a multi-layer sensor to obtain speaker style characteristics s _t 。

1.5 To extract spectral features x of the source speaker _s And the style characteristics s obtained in 1.4) _t As a joint feature (x _s ,s _t ) The input generator is trained to make the loss function L of the generator _G As small as possible to obtain the spectrum characteristic x of the target speaker _st 。

As shown in fig. 2, the Generator (Generator) employs a two-dimensional convolutional neural network, consisting of an encoding network, a decoding network, and several SKNet layers. The coding network comprises 3 two-dimensional convolution layers, wherein the sizes of filters (k) of the 3 two-dimensional convolution layers are 3*9, 4*8 and 4*8 respectively, the step sizes(s) are 1*1, 2 x 2 and 2 x 2 respectively, and the filter depths (c) are 64, 128 and 256 respectively; the decoding network comprises 2 two-dimensional deconvolution layers (ConvT 2), wherein the filter sizes of the 2 two-dimensional deconvolution layers are 4*4, the step sizes are 2 x 2, and the filter depths are 128 and 64 respectively; the output layer contained 1 two-dimensional convolution with a filter size of 3*9, a step size of 1*1, and a filter depth of 1; a plurality of SKNet layers are built in a residual error network between an encoding network and a decoding network, the output in the middle of each layer of the residual error network is spliced and input to the next layer through one SKNet, the SKNet is an abbreviation of Selective Kernel Networks, the SKNet is a lightweight embedded module, and the inspiration source is that the size of a visual cortex neuron receiving domain can be adjusted according to stimulation when objects with different sizes and different distances are seen.

In this embodiment, the SKNet layer is preferably 6 layers. The principle of SKNet is shown in figure 3, firstly, a network is Split into two branches through Split operation to be convolved respectively; accumulating the output results of the two branches through a Fuse operation, carrying out global average pooling, changing each two-dimensional characteristic channel into a real number with a global receptive field, realizing dimension reduction and dimension increase through two-layer convolution to obtain two groups of channel information, and further obtaining weights of the two groups of channels through a Softmax function; and finally, weighting the two groups of channel weights to each channel characteristic of the convolution output of the two branches through Select operation, thereby finishing recalibration in the channel dimension, and finally accumulating the two groups of recalibrated outputs and outputting the accumulated two groups of recalibrated outputs to the next layer.

The block structures of the front three layers of SKNet are the same, and the block structures are all: the method comprises a convolution layer (Conv 2), a normalization layer (Instance Norm), a modified linear unit (ReLU), a SKNet layer, a convolution layer and a normalization layer in sequence, wherein the filter size is 3*3, the depth is 256, and the step length is 1*1. The block structure of the back three layers of SKNet is slightly different from that of the front three layers of SKNet, the normalization layer is replaced by an adaptive instance normalization method (AdaIN), the adaptive instance normalization method can achieve full fusion of semantic features and speaker individual features, the size of a filter is 3*3, the depth is 256, and the step size is 1*1.

The filter sizes of the 3 two-dimensional convolution layers of the coding network of the generator G are 3*9, 4*8 and 4*8 respectively, the step sizes are 1*1, 2 x 2 and 2 x 2 respectively, and the filter depths are 64, 128 and 256 respectively; the filter sizes of the 2 two-dimensional deconvolution layers of the decoding network are 4*4, the step sizes are 2 x 2, and the filter depths are 128 and 64 respectively; the filter size of 1 two-dimensional convolution of the output layer was 3*9, the step size was 1*1, and the filter depth was 1. The filter sizes of the 5 two-dimensional convolution layers shared by the discriminator D and the classifier C are 4*4, the step sizes are 2 x 2, and the filter depths are 64, 128, 256, 512 and 1024 respectively; the two-dimensional convolution of the output layer of discriminator D has a filter size of 1 x 16, a step size of 1*1, and a filter depth of 1; the two-dimensional convolution of the output layer of classifier C has a filter size of 1*8, a step size of 1*1, and a filter depth of the number of transitions.

Specifically, as shown in fig. 6, the filter sizes of 6 one-dimensional convolutions of a Style Encoder (Style Encoder) S are 1, 16, the step sizes are 1, the filter depths are 32, 64, 128, 256, 512, respectively, the middle layer comprises 5 one-dimensional average pooling layers and 5 residual networks, the filter sizes of each one-dimensional average pooling layer are 2, the step sizes are 2, each residual network layer comprises 2 one-dimensional convolutions, the filter sizes of each one-dimensional convolution are 2, the step sizes are 2, and the depths are 2 times the depths of the filters of the previous layer.

As shown in fig. 5, the multi-layer sensor (Multilayer Perceptron) M includes 7 linear layers, the input neurons of the input layers are 16, and the output neurons are 512; the input neurons and the output neurons of the 5 linear layers in the middle layer are 512; the input neuron of the output layer is 512, and the output neuron is 64 voice converted people.

1.6 1.5) generating spectral features x of the target speaker _st And 1.2) target speaker spectral features x of the obtained training corpus _t Together as input to the discriminator, training the discriminator to match the loss-of-antagonism function of the discriminatorAs small as possible.

As shown in fig. 4, the Discriminator (Discriminator) adopts a two-dimensional convolutional neural network, which includes 6 two-dimensional convolutional layers, wherein the filter sizes of the first 5 two-dimensional convolutional layers are 4*4, the step sizes are 2×2, the filter depths are 64, 128, 256, 512 and 1024, the filter sizes of the two-dimensional convolutional layers of the output layer are 1×16, the step size is 1*1, and the filter depth is 1.

The loss function of the discriminator is:

Wherein D is _s (x _s ) Representing the discriminator D discriminating the true spectral features, C _t (c _t |G(x _s ,s _t ) A) the classifier C judges the attribution of the generated spectrum label, s _t Representing the style characteristics of the target speaker generated by the multi-layer perceptron M, i.e. M (z, c _t )＝s _t ，G(x _s ,s _t ) Representing spectral features of the targeted speaker generated by generator G, i.e., x _ts ，D _t (G(x _s ,s _t ) A) the representation discriminator discriminates the generated spectral features,representing the expectation, +.>Representing the expectation of a true probability distribution.

The optimization targets are as follows:

1.7 The spectrum characteristic x of the target speaker obtained by the method _st Inputting the semantic features into the coding network of the generator G again to obtain speaker-independent semantic features G (x) _st ) Spectral features x of the source speaker _s Source speaker tag feature c _s Inputting the voice data into a style encoder S to obtain the style indication characteristics of the source speakerThe resulting semantic feature G (x _st ) Style indication feature with source speaker>As a combined feature->The two signals are input into a decoding network of a generator G together for training, and the loss function of the generator G is minimized in the training process to obtain the spectral characteristics of the reconstructed source speaker +.>

The loss functions of the generator are minimized during training, including the antagonism loss of the generator, the cyclic consistency loss, the style reconstruction loss of the style encoder, the style diversity loss, and the classifier classification loss. Wherein the training cycle consistency loss is to make the spectrum characteristic x of the source speaker _s After passing through generator G, reconstructed source speaker spectral featuresCan sum to x _s As consistent as possible, the training style reconstruction loss is used for constraining the multi-layer perceptron to generate style characteristics s which are more consistent with the target speaker _t The training style diversity loss is used for ensuring that the generator realizes multi-speaker conversion, and the classification loss refers to the target speaker frequency spectrum x generated by the classifier discrimination generator _st Belonging to label c _t Probability loss of (c).

The loss function of the generator is:

the optimization targets are as follows:

wherein lambda is _cyc 、λ _ds 、λ _sty And lambda (lambda) _cls Is a set of regularized hyper-parameters that represent the weights of the loop consistency penalty, the style diversity penalty, the style reconstruction penalty, and the classification penalty, respectively.

Representing the loss of immunity of the generator in GAN:

wherein,,s represents the expectation of the probability distribution generated by the generator _t Representing the style characteristics of the target speaker generated by the multi-layer perceptron M, i.e. M (z, c _t )＝s _t ，G(x _s ,s _t ) The representation generator generates spectral features->And loss of discriminator->Together constitute the common countermeasures in GAN to determine whether the spectrum input to the discriminator is a true spectrum or a generated spectrum. During training +.>As small as possible, the generator is continuously optimized until a spectral feature G (x _s ,c _s ) Making it difficult for the discriminator to discriminate between true and false.

Cyclic coincidence loss in generator G:

wherein,,indicating the style of the source speaker, i.e +.>Reconstructed source speaker spectral features, +.>Loss expectations for reconstructed source speaker spectrum and real source speaker spectrum. In the loss of training generator, < >>As small as possible, so that a target spectrum G (x _s ,s _t ) Style indicating feature of source speaker->After being input into the generator again, the obtained voice spectrum of the reconstructed source speaker is as far as possible consistent with x _s Similarly. By training->Semantic features of the speaker voice can be effectively guaranteed, and the semantic features are not lost after being encoded by the generator.

For style diversity loss, to ensure that the generator implements multi-speaker conversion:

wherein z is ₁ ,z ₂ Are random noise which obeys normal distribution, s _t1 ,s _t2 The style characteristics of the target speaker generated for the multi-layer perceptron M, i.e., M (z ₁ ,c _t )＝s _t1 ，M(z ₂ ,c _t )＝s _t2 During the training process, the training device can automatically and automatically perform the training process,as small as possible, the conversion from multi-speaker to multi-speaker is realized.

For the style reconstruction loss of the style encoder S, for optimizing the style characteristics S _t ：

Wherein s is _t Representing the style characteristics of the target speaker, G (x) _s ,s _t ) The representation generator generates spectral features of the target speaker.

Spectral features G (x) _s ,s _t ) Inputting the reconstructed style characteristics into a style encoder S, and generating the style characteristics S of the target speaker with the multi-layer perceptron M _t The absolute value is calculated, and during the training process,as small as possible, so that the style characteristics s of the target speaker generated by the multi-layer perceptron M _t The personality characteristics of the target speaker can be fully expressed.

Classification loss for classifier C:

wherein C is _t (c _t |G(x _s ,s _t ) A classifier is used for judging the attribution of the generated spectrum label,as small as possible, minimizing the loss function of the classifier.

1.8 Repeating the steps 1.4-1.7 until the set iteration times are reached, thereby obtaining a trained SKNet StarGAN network, wherein the generator parameter phi, the discriminator parameter theta, the classifier parameter phi and the multi-layer perceptron parameter phiAnd the style encoder parameter delta is a trained parameter. The iteration times are different because of different specific settings of the neural network and different performances of experimental equipment. The number of iterations selected in this experiment was 200000.

1.9 Using log fundamental frequency log f ₀ Establishing a pitch frequency conversion relation by means of mean and mean square error of the log fundamental frequency of each speaker, counting the mean and mean square error of the log fundamental frequency of each speaker, and using log domain linear conversion to convert the log fundamental frequency of the source speaker _0s The log fundamental frequency log f of the target speaker is obtained through conversion _0t '。

The fundamental frequency transfer function is:

wherein mu _s Sum sigma _s Mean and mean square error, mu, of fundamental frequency of source speaker in logarithmic domain _t Sum sigma _t The mean and the mean square error of the fundamental frequency of the target speaker in the logarithmic domain are respectively shown.

The implementation steps of the conversion stage are as follows:

2.1 Extracting the spectrum characteristics x of different sentences of the source speaker by using the source speaker voice through a WORLD voice analysis/synthesis model _s ' non-periodic features, fundamental frequency.

2.2 Random noise z' to be subjected to normal distribution, target speaker tag characteristic c) _t ' are input into a multi-layer perceptron M to obtain style characteristics s of a target speaker _t '。

2.3 2.1) extracting spectral features x of the source speaker speech _s ' AND 2.2) extracted style characteristics s of the target speaker _t ' as a joint feature (x _s ',s _t ') inputting 1.8) into the SKNet StarGAN network trained to reconstruct the spectral feature x of the target speaker _st '。

2.4 The fundamental frequency of the source speaker extracted in 2.1) is converted into the fundamental frequency of the target speaker by the fundamental frequency conversion function obtained in 1.9).

2.5 2.3) characterizing the spectral features x of the targeted speaker generated in 2) _st '2.4) and 2.1) the extracted non-periodic features, the converted speaker's voice is synthesized by a WORLD voice analysis/synthesis model.

In order to illustrate the advantages of the method, three comparison block diagrams are selected for comparison, and as can be seen from the figures, the similarity between the spectral feature details of the converted voice and the target voice is higher.

Fig. 8a and 8b are respectively a group of speech spectra of source speech and target speech under different conditions, and fig. 8d and 8c are speech spectra comparison diagrams of speech synthesized by the model and the reference StarGAN model according to the present invention, in order to illustrate the advantages of the method adopted by the present invention in detail, spectral feature details in three comparison block diagrams are selected for comparison, and as can be seen from the figures, the similarity between the spectral feature details of the converted speech and the target speech is higher. As shown in fig. 9, the method used in the present invention has a faster convergence speed and a smaller reconstruction loss with the increase of the iteration number.

On the other hand, the invention also provides a SKNet StarGAN multi-to-multi-voice conversion system based on speaker style characteristic modeling, which is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises:

the conversion phase comprises:

a conversion module for converting the spectral characteristics x of the source speaker _s ' Tab feature of target speaker c _t 'and random noise z' obeying normal distribution are input into a trained SKNet StarGAN network to obtain the spectral characteristics x of a target speaker _st '；

a speaker voice acquisition module for generating a target speaker spectrum characteristic x _st The fundamental frequency characteristic and the aperiodic characteristic of the target speaker are synthesized through a WORLD voice analysis/synthesis model to obtain the converted speaker voice.

Embodiments of the invention, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored on a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, the present examples are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. When the computer program is executed by a processor, the aforementioned SKNet StarGAN many-to-many speaker conversion method based on speaker personality trait modeling may be implemented. The computer storage medium is, for example, a computer-readable storage medium.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. The multi-to-multi voice conversion method based on speaker style feature modeling is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps of:

the conversion phase comprises the following steps:

2. The speaker-personality-feature-modeling-based many-to-many voice conversion method of claim 1 wherein SKNet built between the encoding and decoding networks is 6 layers.

3. The speaker-personality-feature-modeling-based multi-to-multi-speech conversion method according to claim 1, wherein the style encoder S comprises 6 one-dimensional convolutions, the filter sizes are 1, 16, the step sizes are 1, the filter depths are 32, 64, 128, 256, 512, the middle layer comprises 5 one-dimensional average pooling layers and 5 residual networks, the filter size of each one-dimensional average pooling layer is 2, the step sizes are 2, each residual network layer comprises 2 one-dimensional convolutions, the filter size of each one-dimensional convolution is 2, the step sizes are 2, and the depths are 2 times the depth of the previous layer.

4. The method for multi-to-multi speech conversion based on speaker personality characteristic modeling according to claim 1, wherein the multi-layer sensor M comprises 7 linear layers, the input neurons of the input layers are 16, the output neurons are 512, the input neurons and the output neurons of the intermediate layers of 5 linear layers are 512, the input neurons of the output layers are 512, and the output neurons are 64 numbers of people with speech conversion.

5. The method of many-to-many speech conversion based on speaker personality characteristics modeling according to claim 1, wherein the training process of steps (1.3) and (1.4) comprises the steps of:

(6) The generated semantic feature G (x _st ) And (3) withStyle indicating features for speakersThe decoding network input to the generator G is used for training, the loss function of the generator G is minimized in the training process, and the spectrum characteristics of the reconstructed speaker are obtained>

(7) The spectral characteristics x of the target speaker generated in the step (3) are calculated _st Inputting the data into a discriminator D and a classifier C for training, and minimizing the loss function of the discriminator D and the loss function of the classifier C;

6. The method of many-to-many speech conversion based on speaker personality characterization according to claim 5, wherein the style reconstruction loss function of the style encoder S is expressed as:

7. The method of many-to-many voice conversion based on speaker personality characteristics modeling of claim 1, wherein the input process of step (2.2) includes the steps of:

8. The speaker-personality-feature-modeling-based many-to-many voice conversion method of claim 1 wherein the objective function of the SKNet StarGAN network is expressed as:

L _SKNetSTARGAN ＝L _G +L _D

loss function L of generator _G Expressed as:

loss function L of discriminator _D The method comprises the following steps:

9. A many-to-many speech conversion system based on speaker style feature modeling, comprising a training phase and a conversion phase, the training phase comprising:

the conversion phase comprises:

10. A computer storage medium having a computer program stored thereon, characterized by: the computer program implementing the method of any of claims 1 to 8 when executed by a computer processor.