CN108777140B

CN108777140B - Voice conversion method based on VAE under non-parallel corpus training

Info

Publication number: CN108777140B
Application number: CN201810393556.XA
Authority: CN
Inventors: 李燕萍; 凌云志
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2020-07-28
Anticipated expiration: 2038-04-27
Also published as: CN108777140A

Abstract

The invention discloses a voice conversion method based on VAE under the condition of non-parallel corpus training, which extracts Bottleneck characteristics, namely Bottleneck characteristics, through a deep neural network under the condition of non-parallel texts, then realizes the learning and modeling of a conversion function based on a variational self-coding model, and can realize the conversion of multiple speakers to multiple speakers in a conversion stage. The advantages of the invention are three: 1) the dependence on parallel texts is removed, and no alignment operation is required in the training process; 2) the conversion system of a plurality of source-target speaker pairs can be integrated in one conversion model to realize many-to-many conversion; 3) the many-to-many conversion system under the condition of non-parallel texts provides technical support for the technology to actual voice interaction.

Description

Voice conversion method based on VAE under non-parallel corpus training

Technical Field

The invention belongs to the field of voice signal processing, and particularly relates to a voice conversion method based on a Variational self-encoding (VAE) model under non-parallel corpus training.

Background

The speech conversion technology is a research branch of speech signal processing, which covers the contents of the fields of speaker recognition, speech synthesis and the like, and is intended to change the personalized information of the speech under the condition of keeping the original semantic information unchanged, so that the speech of a specific speaker (namely, a source speaker) sounds like the speech of another specific speaker (namely, a target speaker). The main task of voice conversion includes extracting the characteristic parameters of the voices of two specific speakers, mapping and converting, and then decoding and reconstructing the converted parameters into converted voices. In the process, whether the hearing quality of the obtained converted voice and the personality characteristics after conversion are accurate or not is ensured. Research on voice conversion technology has been developed for many years, and the voice conversion field has emerged with various methods, wherein statistical conversion methods represented by gaussian mixture models have become the classic methods in the field. However, such algorithms still have some drawbacks, such as: the classical method of using a gaussian mixture model to perform voice conversion is mostly based on a one-to-one voice conversion task, the contents of training sentences used by a source speaker and a target speaker are required to be the same, Dynamic Time Warping (DTW) is required to be performed on spectral features to align frame by frame, and then the mapping relation between the spectral features can be obtained through model training, so that the voice conversion method is not flexible enough in practical application; when the mapping function is trained by using the Gaussian mixture model, the global variables are considered, the calculated amount is increased suddenly by iterating the training data, and the Gaussian mixture model can achieve a good conversion effect only when the training data is sufficient, which is not suitable for limited computing resources and equipment.

In recent years, research in the field of deep learning accelerates the training speed of a deep neural network and the effectiveness of the network, and researchers continuously provide new models and new learning methods, so that the modeling capability is strong, and deeper features can be learned from complex data.

The AHOcoder feature parameter extraction model is a speech codec (speech analysis/synthesis system) developed by the AHO L AB signal processing laboratory at Baske university of Daniel Erro.AHOcoder decomposes 16kHz, 16bits of monophonic wav speech into three parts, fundamental frequency (F0), spectrum (Mel cepstral coefficient MFCC), and maximum voiced frequency.

The fundamental frequency is an important parameter influencing the prosodic characteristics of the voice, and the voice conversion method designed by the invention adopts the traditional Gaussian normalization conversion method aiming at the conversion of the fundamental frequency. Assuming that the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker obey Gaussian distributions, then, the mean and variance of the Gaussian distributions of the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker are respectively calculated. Then, the following formula is used to realize the conversion from the logarithmic fundamental frequency of the voiced sound segment of the source speaker to the logarithmic fundamental frequency of the voiced sound segment of the target speaker, and the unvoiced sound segment is not changed.

Wherein the mean and variance of the logarithmic fundamental frequency of the voiced segment of the source speaker are respectively expressed by mu_srcAnd σ_srcPresentation, target speaker voiced segment logarithmic basisThe mean and variance of the frequency are respectively represented by mu_tgtAnd σ_tgtIs shown, and FO_srcRepresenting the fundamental frequency, FO, of the originating speaker_convRepresenting the converted fundamental frequency.

Disclosure of Invention

In order to solve the problems, the invention provides a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility, and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment.

The invention adopts the following technical scheme that a voice conversion method based on VAE under non-parallel corpus training comprises the following steps:

training:

1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;

2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter X_n；

3) Using joint feature parameters x_nAnd speaker class label feature y_nTraining a Deep Neural Network (DNN), adjusting the weight of the DNN to reduce classification errors until the network converges to obtain a DNN based on a speaker recognition task, and extracting a Bottleneck characteristic of each frame, namely a Bottleneck characteristic b_n；

4) Using joint feature parameters x_nAnd a Bottleneck feature b corresponding to each frame_nTraining the VAE model until the model training converges, extracting a Variational auto-encoder (VAE) model, namely a sampling feature z of each frame of a hidden space z of the VAE model_n；

5) Sampling feature z_nAnd the tag feature y of the speaker corresponding to each frame_nSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and performing the methodBottleneck feature b of each frame_nThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;

the trained DNN network, VAE network and Bottleneck feature mapping network are combined to form a voice conversion system based on VAE and Bottleneck features;

a voice conversion step:

6) joint feature parameter X of speech to be converted_pObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE model_n；

7) Sampling feature z_nAnd the tag characteristics y of the targeted speaker_nSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker

8) Will Bottleneck feature

And a sampling characteristic z_nReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicing_p'；

9) The speech signal is reconstructed using an AHOcoder sound codec.

Preferably, the extracting the mel cepstrum features of the speeches participating in the training in the step 1) is to extract the mel cepstrum features of the speeches participating in the training respectively by using an AHOcoder sound codec, and read the mel cepstrum features into a Matlab platform.

Preferably, the obtained joint characteristic parameters in the step 2) are specifically: carrying out first order difference and second order difference on the extracted characteristic parameter X of each frame, and splicing the characteristic parameter X with the original characteristic to obtain the characteristic parameter X_t＝(X,ΔX,Δ²X) splicing the obtained characteristic parameters X) in the time domain_tSplicing with the characteristic parameters of each frame to form a combined characteristic parameter x_n＝(X_t-1,X_t,X_t+1)。

Preferably, the Bottleneck feature b is extracted from the pair in the step 3)_nThe method comprises the following steps:

31) obtaining the combined characteristic parameter x on the MAT L AB platform_nThe classification label characteristic y of the speaker corresponding to each frame_n；

32) Carrying out unsupervised pre-training on the DNN by using a layer-by-layer greedy pre-training method, wherein an activation function of a hidden layer adopts a Re L U function;

33) setting the DNN network output layer as softmax classification output, and labeling the classification label characteristic y of the speaker_nAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimized_nUntil convergence, obtaining a DNN network based on the speaker recognition task;

34) combining the characteristic parameters x by a feed-forward algorithm_nInputting DNN network frame by frame, extracting the activation value of Bottleneck layer corresponding to each frame, namely Bottleneck feature b corresponding to Mel cepstrum feature parameter of each frame_n。

Preferably, the training of the VAE model in step 4) includes the following steps:

41) combining the characteristic parameters x_nTraining data, Bottleneck feature b as VAE model encoder module_nTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE model_nAs control information of the voice spectrum reconstruction process, i.e. Bottleneck feature b_nAnd a sampling characteristic z_nSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;

42) k L divergence and mean square error in the parameter estimation process of the VAE model are optimized by using an ADAM optimizer to adjust the network weight of the VAE model, so that a VAE voice spectrum conversion model is obtained;

43) combining the characteristic parameters x_nInputting VAE voice frequency spectrum conversion model frame by frame and obtaining the model through sampling processImplicit sampling feature z_n。

Preferably, the obtaining of the bottleeck feature mapping network in the step 5) includes the following steps:

51) sampling feature z_nClassification label characteristic y corresponding to speaker of each frame_nSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts the structure of an input layer, a hidden layer and an output layer, the hidden layer activation function is a sigmoid function, and the output layer is linear output;

52) according to the mean square error minimization criterion, a random gradient descent algorithm of backward error propagation is adopted to optimize the weight of the Bottleneck feature mapping network, and the Bottleneck feature output by the minimization network

Bottleneck feature b corresponding to each frame_nThe error between.

Preferably, the joint feature parameter X of the speech to be converted is obtained in the step 6)_pExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be converted_p。

Preferably, the reconstructing the speech signal in step 9) is specifically: the voice characteristic parameter X obtained after conversion_pThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.

The invention has the following beneficial effects: the invention relates to a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment. The invention has the advantages that:

1) the advantage that the phoneme information irrelevant to the personality of the speaker in the voice frequency spectrum characteristics can be separated from the hidden layer by the VAE model through modeling learning is utilized, so that the VAE model can learn voice conversion through non-parallel voice data, the limitation that the source and the target speaker need to be trained through parallel corpus data in the traditional voice conversion model and the voice frequency spectrum characteristics need to be aligned is eliminated, the practicability and the flexibility of a voice conversion system are greatly improved, and convenience is provided for designing a cross-language voice conversion system;

2) the voice conversion network obtained by training the VAE model can complete various conversion situations, and compared with the traditional one-to-one voice conversion system, the voice conversion network can complete various conversion tasks only by training one model, thereby greatly improving the efficiency of voice conversion model training;

3) in the decoder module of the VAE model, the Bottleneck feature b is used_nAs the personality characteristics of the speaker, the voice frequency spectrum characteristics after the reconstruction conversion are compared with the characteristics y of the classification label of the speaker_nAs a system for representing the individual information characteristics of the speaker, the finally obtained converted voice has better conversion effect and sound quality.

Drawings

FIG. 1 is a block diagram of the system training process of the present invention;

FIG. 2 is a block diagram of the system conversion process of the present invention;

FIG. 3 is a block diagram of a DNN network based on speaker recognition tasks in accordance with the present invention;

FIG. 4 is a block diagram of a VAE voice spectral feature conversion network of the present invention;

FIG. 5 is a block diagram of a Bottleneeck feature mapping network of the present invention;

FIG. 6 is a schematic diagram of a VAE model variational Bayesian process parameter estimation;

FIG. 7 is a comparison graph of MCD values of converted speech under different conversion situations based on a VAE model using different features to characterize the personality of a speaker.

Detailed Description

The technical solution of the present invention is further explained with reference to the embodiments according to the drawings.

The invention adopts the following technical scheme that a voice conversion method based on VAE under the training of non-parallel linguistic data extracts the Mel cepstrum characteristics of voice through an AHOcoder voice codec and splices the Mel cepstrum characteristics with first-order difference and second-order difference characteristics on an MAT L AB platform, and then splices the characteristic parameters of each frame in front and at the back to form a combined characteristic parameter x_n(ii) a X is to be_nTraining by using DNN based on speaker recognition task as training data, and after the network training is finished and convergence is reached, x is calculated_nInputting DNN network frame by frame and obtaining Bottleneck layer output of each frame, namely Bottleneck characteristic parameter b containing speaker personality characteristics_n(ii) a X is to be_nTraining data as VAE model encoder Module, b_nTraining the VAE model as training data during decoding reconstruction of the decoder module, so that the VAE model can obtain phoneme information z with semantic features in an implicit space z through the encoder module_nI.e. sampling features, the phoneme information z containing the semantic features is passed through the decoder module_nAnd Bottleneck feature b containing speaker personality features_nReconstructing the voice frequency spectrum characteristics; the phoneme information z containing semantic features_nClass label feature with speaker y_nThe combined features formed by splicing are used as training data of a BP network to train a Bottleneck feature mapping network of a target speaker, and the expected network outputs Bottleneck features b corresponding to each frame_nThe error between is minimal; when in conversion, firstly, the spectrum characteristics of the voice to be converted are extracted through an encoder module of the VAE model to obtain phoneme information z correspondingly containing semantic characteristics_nAnd the classification label characteristic y of the target speaker is matched with the classification label characteristic y of the target speaker frame by frame_nSplicing to form a combined characteristic, inputting the combined characteristic into a BP network to obtain the Bottleneck characteristic of each frame of the target speaker

Then the phoneme information z containing semantic features_nBottleneck features with frames of the target speaker

Reconstructing the joint characteristics spliced frame by frame into converted voice spectrum characteristics through a decoder module of a VAE model, and finally synthesizing voice through an AHOdecoder; the method specifically comprises a training step and a voice conversion step:

FIG. 1 is a block diagram of a training process of a system according to the present invention, the training steps being:

extracting Mel cepstrum characteristics of the speaker voices participating in training by respectively extracting the Mel cepstrum characteristics of the speaker voices participating in training by using an AHOcoder sound codec, and reading the Mel cepstrum characteristics into a Matlab platform; the invention adopts the 19-dimensional Mel cepstrum characteristics, the voice content of each speaker can be different, and DTW alignment is not needed.

2) Carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameters with the original characteristic parameters, and splicing the characteristic parameters obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter X_n；

Carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the difference with the original characteristic to obtain 57-dimensional difference characteristic parameter X_t＝(X,ΔX,Δ²X) splicing the obtained characteristic parameters X) in the time domain_tSplicing the characteristic parameters of the previous frame and the next frame to form a 171-dimensional combined characteristic parameter x_n＝(X_t-1,X_t,X_t+1)。

3) Using joint feature parameters x_nAnd speaker class label feature y_nTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting Bottleneck characteristics b of each frame_n；

The structure of the bottleeck feature extraction DNN network used in the present invention is shown in fig. 3, where the inputs to the network areThe number of the nodes of the layer corresponds to the dimension of the voice frequency spectrum characteristic participating in training, the output is the softmax classified output of the speaker, and the number of the nodes is determined according to the number of the speakers participating in training. Extracting Bottleneck characteristics b_nThe method comprises the following steps:

31) obtaining the combined characteristic parameter x on the MAT L AB platform_nThe classification label characteristic y of the speaker corresponding to each frame_n(ii) a At this time, the source speaker and the target speaker are not distinguished, and only the speaker classification label characteristic y is used for the characteristic parameter of each frame_nDistinguishing;

32) the DNN network is a fully-connected neural network, adopts a DNN model of a 9-layer network, has 171 input-layer nodes and corresponds to x_n171 dimensional characteristics of each frame, 7 hidden layers in the middle, the number of nodes of each layer is 1200,57,1200 and 1200, wherein the hidden layer with less number of nodes is a Bottleneck layer, a connection weight between nodes of each layer of a DNN network is unsupervised and pre-trained by a layer-by-layer greedy pre-training method, and an activation function of the hidden layer adopts a Re L U function which is closer to a brain neuron in biological angle, namely:

f(x)＝max(0,x)

the Re L U function is considered to have the expressive power of more primitive features because of its unilateral inhibition, sparse activation, and relatively broad excitatory boundaries.

The activation value of the (k + 1) th hidden layer is as follows: h is_k+1＝f(w_kh_k+B_k)

Wherein h is_k+1、h_kActivation values, w, for the k +1 th and k-th hidden layers, respectively_kIs the connection weight between the k +1 th layer and the k layer, B_kIs the bias of the k-th layer.

33) Setting a DNN network output layer as softmax classified output, selecting spectral characteristic parameters of 100 sentences of voice of each speaker of 5 speakers for training, so that the number of nodes of the output layer is 5, corresponding to the label characteristics of the 5 speakers, and classifying the label characteristics y of the speakers_nAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimized_nUntil convergence, obtaining a DNN network based on the speaker recognition task, namely a Bottleneck feature extraction network;

34) combining the characteristic parameters x by a feed-forward algorithm_nInputting the DNN network frame by frame, and extracting the activation value of the Bottleneck layer corresponding to each frame, namely the Bottleneck characteristic b corresponding to the characteristic parameter of each frame_nIn the invention, the Bottleneck layer is a fourth hidden layer, namely:

b_n＝f(w₃h₃+B₃)

wherein h is₃Is the activation value of the hidden layer of layer 3, w₃Is the connection weight between layer 3 and layer 4, B₃Is the layer 3 bias.

4) Utilizing joint feature parameters x_nAnd a Bottleneck feature b corresponding to each frame_nTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE model_n；

The Variational Auto-encoder (VAE) used in the present invention is a generative learning method, and the concrete structure of the model used in the present invention is shown in fig. 4, where x is_s,nA characteristic parameter representing the source speech is determined,

characteristic parameters representing the converted speech of the target speaker, b_nRepresenting Bottleneck characteristics of a frame corresponding to a target speaker, mu and sigma are respectively vector representations of mean values and covariance of components of Gaussian distribution, z represents a hidden space of a VAE model obtained through a sampling process, and z represents_nI.e. the sampling characteristic. The parameter estimation process for VAE model training is shown in FIG. 6. The VAE model training comprises the following steps:

41) combining the characteristic parameters x_nTraining data, Bottleneck feature b as VAE model encoder module_nTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE model_nAs control information for the speech spectral reconstruction process, i.e. Bottleenck feature b_nAnd a sampling characteristic z_nSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;

the encoder input layer of the VAE model is 171 nodes and then comprises two hidden layers, wherein the first layer is 500 nodes, the second layer is 64 nodes, in the second layer nodes, the first 32 nodes calculate the mean value of each component of mixed Gaussian distribution, and the last 32 nodes calculate the variance of each component (at the moment, the Gaussian mixed distribution which can be better fitted with an input signal is calculated through a neural network);

42) optimizing K L (Kullback-L eibler variation) Divergence and mean square error in the parameter estimation process of the VAE model shown in the figure 4 by using an ADAM optimizer according to the variational Bayes principle in the VAE model to adjust the network weight of the VAE model so as to obtain a VAE voice frequency spectrum conversion model;

43) combining the characteristic parameters x_nInputting VAE voice frequency spectrum conversion model frame by frame, and obtaining implicit sampling characteristic z through sampling process_n。

The method is more intuitive, namely, the decoder module of the VAE model is used for processing the phoneme information z containing the semantic features_nAdding individual character b of speaker_nModulation of (3).

5) Sampling feature z_nAnd the classification label characteristic y of the speaker corresponding to each frame_nSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and using Bottleneck features b of each frame_nThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;

the Bottleneeck feature mapping network of the target speaker used in the invention adopts a BP network, and the structure is shown in FIG. 5, wherein the input parameter is z_n+y_nWherein z is_nFor variational self-encoder hidden layer features, y_nA label characteristic for a speaker participating in training; output as Bottleneck feature b of the target speaker_n. The method for obtaining the Bottleneck feature mapping network comprises the following steps:

51) sampling characteristic z of hidden space of VAE_nClassification label characteristic y corresponding to speaker of each frame_nSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts a three-layer feedforward fully-connected neural network and comprises an input layer, a hidden layer and an output layer, the number of nodes of the input layer is 37, and 32 nodes correspond to sampling features z in a VAE model _n5 nodes correspond to the classification label characteristic y of the 5-dimensional speaker formed by the five speakers participating in the training_n(ii) a The output layer is 57 nodes and corresponds to 57-dimensional Bottleneck characteristics; the middle part of the system comprises a hidden layer, the number of nodes is 1200, a hidden layer activation function is a sigmoid function to introduce nonlinear change, and an output layer is linear output; the expression of the sigmoid function is:

f(x)＝1/(1+e^x)

Bottleneck feature b corresponding to each frame_nThe error between, namely:

optimizing the weight of the whole network to finally obtain a passable sampling characteristic z_nClass label feature y with targeted speaker_nGet the Bottleneck characteristics of the target speaker

BP of (3) maps the network.

the voice conversion is realized according to the spectrum conversion flow shown in fig. 2, and the voice conversion step includes:

6) joint feature parameter X of the source speaker's voice to be converted_pObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE model_n；

Obtaining a combined feature parameter X of the speech to be converted_pIn particular, the joint characteristic parameter X to the speech to be converted_pExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be converted_p。

7) Sampling feature z_nAnd the classification label characteristic y of the target speaker_nSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker

8) Will Bottleneck feature

9) The speech signal is reconstructed using an AHOcoder sound codec.

The reconstructed speech signal is specifically: the voice characteristic parameter X obtained after conversion_pThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then a voice coder-decoder AHOcoder is used for synthesizing the converted voice.

Mel-Cepstrum Distortion (MCD) is an objective measure of the quality of speech conversion in speech conversion. The smaller the MCD value between the converted speech and the target speech, the better the conversion performance of the corresponding speech conversion system. Fig. 7 is a comparison of MCD values of converted voices under different conversion conditions obtained by a non-parallel corpus training VAE model when different characteristic parameters are used to characterize the personality of a speaker, and it can be seen from the figure that the voice conversion using the Bottleneck characteristic to characterize the personality of the speaker has better performance than the conversion system using the speaker tag to characterize the personality of the speaker.

Compared with other deep learning concepts such as a deep confidence network DBN, a convolutional neural network CNN and the like, the variable self-encoder VAE can learn probability distribution conforming to an original input signal in an encoder process of the VAE model in a training process through a variable Bayes principle, obtain characteristics of an original signal implicit space through a sampling process, and reconstruct the original signal through a decoder process by utilizing the sampling characteristics, so that errors between the reconstructed signal and the original signal are as small as possible (or the probability distribution difference is small). The characteristics of the VAE model can be applied to style migration, and in voice conversion, the phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to the semantic characteristics can be separated in the hidden space through the VAE model, and the voice spectrum signal can be reconstructed by combining the hidden space information with the parameters for representing the individual characteristics of the speaker. According to the invention, the individual characteristics of a speaker are represented by using the Bottleneck characteristics extracted by a DNN (digital noise network) based on a speaker recognition task, the mapping relation between the combined characteristics consisting of phoneme information and speaker labels and the Bottleneck characteristics is obtained through a mapping network obtained by BP (back propagation) network training, so that the Bottleneck characteristics of a target speaker are obtained indirectly through the voice spectrum characteristics of a source speaker, and finally the phoneme information in an implicit space and the Bottleneck characteristics of the target speaker are reconstructed into converted voice spectrum characteristics through a decoder module of VAE (voice over adaptive algorithm).

The invention aims at the traditional Gaussian mixture model conversion methodThe invention also discloses a method for realizing the voice conversion under the non-parallel language material by utilizing the BP network, which is provided by combining the characteristics of the VAE model and solves the problems that the voice spectrum conversion method needs to use the parallel language material and needs to carry out DTW alignment and then model training, and has three key points: firstly, extracting Bottleneck characteristics representing individual characteristics of speakers by utilizing DNN (DNN network) based on speaker recognition task

Secondly, establishing sampling characteristic z by utilizing BP neural network_nSpeaker classification label feature y_nCombined features of composition with Bottleneck features

The mapping relationship between the two; thirdly, using decoder module of trained VAE model to convert Bottleneck features

And a sampling characteristic z_nThe combined features of the components are reconstructed into transformed speech spectral features.

The method has the innovation points that ① utilizes the characteristics of the VAE model to separate phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to semantic characteristics from an implied space, so that the voice conversion under the non-parallel corpus training can be realized, the method can complete various conversion tasks aiming at different speakers through one-time model training, ② utilizes the Bottleneck characteristics extracted from a DNN network based on the speaker recognition task as the individual characteristics of the speaker to participate in the decoder module reconstruction process of the VAE model, and the voice conversion performance is improved.

For some medical auxiliary systems, such as patients who can not normally sound because of physiological defects or diseases of a sound-producing organ, when providing sound-producing auxiliary equipment for the patients, some principles of the method related to the invention can be adopted; the invention has better expansibility and provides a solution for solving specific problems in voice conversion, including the problem of many-to-many (M2M) voice conversion.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications can be made without departing from the spirit of the invention, and such modifications are to be considered as within the scope of the invention.

Claims

1. A voice conversion method based on VAE under the training of non-parallel corpus is characterized by comprising the following steps:

training:

2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter X obtained by splicing in the time domain_tSplicing with the characteristic parameters of each frame to form a combined characteristic parameter x_n；

3) Using joint feature parameters x_nAnd speaker class label feature y_nTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting the bottleneck characteristic b of each frame_n；

4) Using joint feature parameters x_nAnd a bottleneck characteristic b corresponding to each frame_nTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE model_n；

5) Sampling feature z_nAnd the classification label characteristic y of the speaker corresponding to each frame_nSplicing to obtain training data of bottleneck characteristic mapping network, and using bottleneck characteristic b of every frame_nThe bottleneck characteristic mapping network is obtained by taking the monitoring information as a guide to train the bottleneck characteristic mapping network and minimizing the output error of the bottleneck characteristic mapping network through a random gradient descent algorithm;

a voice conversion step:

6) joint feature parameter X of speech to be converted_pThrough the encoder module of the VAE model,obtaining the sampling characteristic z of each frame of the implicit space z_n；

7) Sampling feature z_nAnd the classification label characteristic y of the target speaker_nPerforming frame-by-frame splicing to input bottleneck characteristic mapping network to obtain the bottleneck characteristic of the target speaker

8) Characterizing a bottleneck

And a sampling characteristic z_nReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicing_p′；

9) The speech signal is reconstructed using an AHOcoder sound codec.

2. The method according to claim 1, wherein the extracting Mel cepstral features of the speaker's voice involved in the training in step 1) is performed by using an AHOcoder voice codec to extract Mel cepstral features of the speaker's voice involved in the training, and reading the Mel cepstral features into a Matlab platform.

3. The method according to claim 1, wherein the obtaining of the joint feature parameters in step 2) specifically comprises: carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the extracted frame characteristic parameter X with the original characteristic parameter X to obtain the characteristic parameter X_t＝(X，ΔX，Δ²X) splicing the obtained characteristic parameters X) in the time domain_tSplicing with the characteristic parameters of each frame to form a combined characteristic parameter x_n＝(X_t-1，X_t，X_t+1)。

4. The method according to claim 1, wherein said method comprises a voice conversion based on VAE under non-parallel corpus trainingExtracting the bottleneck characteristic b in the step 3)_nThe method comprises the following steps:

34) combining the characteristic parameters x by a feed-forward algorithm_nInputting a DNN network frame by frame, wherein the DNN network is a fully-connected neural network, a DNN model of a 9-layer network is adopted, the number of nodes of an input layer is 171, and the DNN network corresponds to x_n171 dimensional characteristics of each frame, 7 hidden layers in the middle, the node number of each layer is 1200,57,1200, wherein the hidden layer with less node number is a bottleneck layer, and the activation value of the bottleneck layer corresponding to each frame is extracted, namely the bottleneck characteristic b corresponding to the Mel cepstrum characteristic parameter of each frame_n。

5. The method according to claim 1, wherein the VAE model training in step 4) comprises the following steps:

41) combining the characteristic parameters x_nTraining data as VAE model encoder module, bottleneck characteristics b_nTraining the VAE model as the training data when decoding and reconstructing the decoder module, and training the bottleneck characteristic b in the decoder module of the VAE model_nAs control information for the speech spectral reconstruction process, i.e. the bottleneck feature b_nAnd a sampling characteristic z_nSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;

6. The method according to claim 1, wherein the obtaining of the bottleneck feature mapping network in step 5) comprises the following steps:

51) sampling characteristic z of VAE voice spectrum conversion model_nClassification label characteristic y corresponding to speaker of each frame_nSplicing is carried out to be used as training data of a bottleneck feature mapping network, the bottleneck feature mapping network adopts a structure of an input layer, a hidden layer and an output layer, a hidden layer activation function is a sigmoid function, and the output layer is linear output;

52) optimizing the bottleneck characteristic mapping network weight by adopting a random gradient descent algorithm of backward error propagation according to a mean square error minimization criterion, and minimizing the output bottleneck characteristic of the network

Bottleneck characteristics b corresponding to each frame_nThe error between.

7. The method according to claim 1, wherein the joint feature parameter X of the speech to be converted in step 6) is obtained from the VAE-based speech conversion under the training of non-parallel corpus_pExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be converted_p。

8. The method according to claim 1, wherein the reconstructing the speech signal in step 9) specifically comprises: the voice characteristic parameter X obtained after conversion_pThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.