CN108777140B - Voice conversion method based on VAE under non-parallel corpus training - Google Patents
Voice conversion method based on VAE under non-parallel corpus training Download PDFInfo
- Publication number
- CN108777140B CN108777140B CN201810393556.XA CN201810393556A CN108777140B CN 108777140 B CN108777140 B CN 108777140B CN 201810393556 A CN201810393556 A CN 201810393556A CN 108777140 B CN108777140 B CN 108777140B
- Authority
- CN
- China
- Prior art keywords
- characteristic
- frame
- bottleneck
- training
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 94
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000008569 process Effects 0.000 claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 claims abstract description 8
- 238000013507 mapping Methods 0.000 claims description 38
- 238000005070 sampling Methods 0.000 claims description 36
- 238000001228 spectrum Methods 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 12
- 238000012544 monitoring process Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 7
- 230000002194 synthesizing effect Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 abstract description 12
- 230000008901 benefit Effects 0.000 abstract description 3
- 239000000284 extract Substances 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 7
- 239000000203 mixture Substances 0.000 description 6
- 238000000605 extraction Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000002964 excitative effect Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a voice conversion method based on VAE under the condition of non-parallel corpus training, which extracts Bottleneck characteristics, namely Bottleneck characteristics, through a deep neural network under the condition of non-parallel texts, then realizes the learning and modeling of a conversion function based on a variational self-coding model, and can realize the conversion of multiple speakers to multiple speakers in a conversion stage. The advantages of the invention are three: 1) the dependence on parallel texts is removed, and no alignment operation is required in the training process; 2) the conversion system of a plurality of source-target speaker pairs can be integrated in one conversion model to realize many-to-many conversion; 3) the many-to-many conversion system under the condition of non-parallel texts provides technical support for the technology to actual voice interaction.
Description
Technical Field
The invention belongs to the field of voice signal processing, and particularly relates to a voice conversion method based on a Variational self-encoding (VAE) model under non-parallel corpus training.
Background
The speech conversion technology is a research branch of speech signal processing, which covers the contents of the fields of speaker recognition, speech synthesis and the like, and is intended to change the personalized information of the speech under the condition of keeping the original semantic information unchanged, so that the speech of a specific speaker (namely, a source speaker) sounds like the speech of another specific speaker (namely, a target speaker). The main task of voice conversion includes extracting the characteristic parameters of the voices of two specific speakers, mapping and converting, and then decoding and reconstructing the converted parameters into converted voices. In the process, whether the hearing quality of the obtained converted voice and the personality characteristics after conversion are accurate or not is ensured. Research on voice conversion technology has been developed for many years, and the voice conversion field has emerged with various methods, wherein statistical conversion methods represented by gaussian mixture models have become the classic methods in the field. However, such algorithms still have some drawbacks, such as: the classical method of using a gaussian mixture model to perform voice conversion is mostly based on a one-to-one voice conversion task, the contents of training sentences used by a source speaker and a target speaker are required to be the same, Dynamic Time Warping (DTW) is required to be performed on spectral features to align frame by frame, and then the mapping relation between the spectral features can be obtained through model training, so that the voice conversion method is not flexible enough in practical application; when the mapping function is trained by using the Gaussian mixture model, the global variables are considered, the calculated amount is increased suddenly by iterating the training data, and the Gaussian mixture model can achieve a good conversion effect only when the training data is sufficient, which is not suitable for limited computing resources and equipment.
In recent years, research in the field of deep learning accelerates the training speed of a deep neural network and the effectiveness of the network, and researchers continuously provide new models and new learning methods, so that the modeling capability is strong, and deeper features can be learned from complex data.
The AHOcoder feature parameter extraction model is a speech codec (speech analysis/synthesis system) developed by the AHO L AB signal processing laboratory at Baske university of Daniel Erro.AHOcoder decomposes 16kHz, 16bits of monophonic wav speech into three parts, fundamental frequency (F0), spectrum (Mel cepstral coefficient MFCC), and maximum voiced frequency.
The fundamental frequency is an important parameter influencing the prosodic characteristics of the voice, and the voice conversion method designed by the invention adopts the traditional Gaussian normalization conversion method aiming at the conversion of the fundamental frequency. Assuming that the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker obey Gaussian distributions, then, the mean and variance of the Gaussian distributions of the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker are respectively calculated. Then, the following formula is used to realize the conversion from the logarithmic fundamental frequency of the voiced sound segment of the source speaker to the logarithmic fundamental frequency of the voiced sound segment of the target speaker, and the unvoiced sound segment is not changed.
Wherein the mean and variance of the logarithmic fundamental frequency of the voiced segment of the source speaker are respectively expressed by musrcAnd σsrcPresentation, target speaker voiced segment logarithmic basisThe mean and variance of the frequency are respectively represented by mutgtAnd σtgtIs shown, and FOsrcRepresenting the fundamental frequency, FO, of the originating speakerconvRepresenting the converted fundamental frequency.
Disclosure of Invention
In order to solve the problems, the invention provides a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility, and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment.
The invention adopts the following technical scheme that a voice conversion method based on VAE under non-parallel corpus training comprises the following steps:
training:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter Xn;
3) Using joint feature parameters xnAnd speaker class label feature ynTraining a Deep Neural Network (DNN), adjusting the weight of the DNN to reduce classification errors until the network converges to obtain a DNN based on a speaker recognition task, and extracting a Bottleneck characteristic of each frame, namely a Bottleneck characteristic bn;
4) Using joint feature parameters xnAnd a Bottleneck feature b corresponding to each framenTraining the VAE model until the model training converges, extracting a Variational auto-encoder (VAE) model, namely a sampling feature z of each frame of a hidden space z of the VAE modeln;
5) Sampling feature znAnd the tag feature y of the speaker corresponding to each framenSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and performing the methodBottleneck feature b of each framenThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;
the trained DNN network, VAE network and Bottleneck feature mapping network are combined to form a voice conversion system based on VAE and Bottleneck features;
a voice conversion step:
6) joint feature parameter X of speech to be convertedpObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE modeln;
7) Sampling feature znAnd the tag characteristics y of the targeted speakernSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker
8) Will Bottleneck featureAnd a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp';
9) The speech signal is reconstructed using an AHOcoder sound codec.
Preferably, the extracting the mel cepstrum features of the speeches participating in the training in the step 1) is to extract the mel cepstrum features of the speeches participating in the training respectively by using an AHOcoder sound codec, and read the mel cepstrum features into a Matlab platform.
Preferably, the obtained joint characteristic parameters in the step 2) are specifically: carrying out first order difference and second order difference on the extracted characteristic parameter X of each frame, and splicing the characteristic parameter X with the original characteristic to obtain the characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
Preferably, the Bottleneck feature b is extracted from the pair in the step 3)nThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen;
32) Carrying out unsupervised pre-training on the DNN by using a layer-by-layer greedy pre-training method, wherein an activation function of a hidden layer adopts a Re L U function;
33) setting the DNN network output layer as softmax classification output, and labeling the classification label characteristic y of the speakernAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting DNN network frame by frame, extracting the activation value of Bottleneck layer corresponding to each frame, namely Bottleneck feature b corresponding to Mel cepstrum feature parameter of each framen。
Preferably, the training of the VAE model in step 4) includes the following steps:
41) combining the characteristic parameters xnTraining data, Bottleneck feature b as VAE model encoder modulenTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE modelnAs control information of the voice spectrum reconstruction process, i.e. Bottleneck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
42) k L divergence and mean square error in the parameter estimation process of the VAE model are optimized by using an ADAM optimizer to adjust the network weight of the VAE model, so that a VAE voice spectrum conversion model is obtained;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame and obtaining the model through sampling processImplicit sampling feature zn。
Preferably, the obtaining of the bottleeck feature mapping network in the step 5) includes the following steps:
51) sampling feature znClassification label characteristic y corresponding to speaker of each framenSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts the structure of an input layer, a hidden layer and an output layer, the hidden layer activation function is a sigmoid function, and the output layer is linear output;
52) according to the mean square error minimization criterion, a random gradient descent algorithm of backward error propagation is adopted to optimize the weight of the Bottleneck feature mapping network, and the Bottleneck feature output by the minimization networkBottleneck feature b corresponding to each framenThe error between.
Preferably, the joint feature parameter X of the speech to be converted is obtained in the step 6)pExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp。
Preferably, the reconstructing the speech signal in step 9) is specifically: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.
The invention has the following beneficial effects: the invention relates to a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment. The invention has the advantages that:
1) the advantage that the phoneme information irrelevant to the personality of the speaker in the voice frequency spectrum characteristics can be separated from the hidden layer by the VAE model through modeling learning is utilized, so that the VAE model can learn voice conversion through non-parallel voice data, the limitation that the source and the target speaker need to be trained through parallel corpus data in the traditional voice conversion model and the voice frequency spectrum characteristics need to be aligned is eliminated, the practicability and the flexibility of a voice conversion system are greatly improved, and convenience is provided for designing a cross-language voice conversion system;
2) the voice conversion network obtained by training the VAE model can complete various conversion situations, and compared with the traditional one-to-one voice conversion system, the voice conversion network can complete various conversion tasks only by training one model, thereby greatly improving the efficiency of voice conversion model training;
3) in the decoder module of the VAE model, the Bottleneck feature b is usednAs the personality characteristics of the speaker, the voice frequency spectrum characteristics after the reconstruction conversion are compared with the characteristics y of the classification label of the speakernAs a system for representing the individual information characteristics of the speaker, the finally obtained converted voice has better conversion effect and sound quality.
Drawings
FIG. 1 is a block diagram of the system training process of the present invention;
FIG. 2 is a block diagram of the system conversion process of the present invention;
FIG. 3 is a block diagram of a DNN network based on speaker recognition tasks in accordance with the present invention;
FIG. 4 is a block diagram of a VAE voice spectral feature conversion network of the present invention;
FIG. 5 is a block diagram of a Bottleneeck feature mapping network of the present invention;
FIG. 6 is a schematic diagram of a VAE model variational Bayesian process parameter estimation;
FIG. 7 is a comparison graph of MCD values of converted speech under different conversion situations based on a VAE model using different features to characterize the personality of a speaker.
Detailed Description
The technical solution of the present invention is further explained with reference to the embodiments according to the drawings.
The invention adopts the following technical scheme that a voice conversion method based on VAE under the training of non-parallel linguistic data extracts the Mel cepstrum characteristics of voice through an AHOcoder voice codec and splices the Mel cepstrum characteristics with first-order difference and second-order difference characteristics on an MAT L AB platform, and then splices the characteristic parameters of each frame in front and at the back to form a combined characteristic parameter xn(ii) a X is to benTraining by using DNN based on speaker recognition task as training data, and after the network training is finished and convergence is reached, x is calculatednInputting DNN network frame by frame and obtaining Bottleneck layer output of each frame, namely Bottleneck characteristic parameter b containing speaker personality characteristicsn(ii) a X is to benTraining data as VAE model encoder Module, bnTraining the VAE model as training data during decoding reconstruction of the decoder module, so that the VAE model can obtain phoneme information z with semantic features in an implicit space z through the encoder modulenI.e. sampling features, the phoneme information z containing the semantic features is passed through the decoder modulenAnd Bottleneck feature b containing speaker personality featuresnReconstructing the voice frequency spectrum characteristics; the phoneme information z containing semantic featuresnClass label feature with speaker ynThe combined features formed by splicing are used as training data of a BP network to train a Bottleneck feature mapping network of a target speaker, and the expected network outputs Bottleneck features b corresponding to each framenThe error between is minimal; when in conversion, firstly, the spectrum characteristics of the voice to be converted are extracted through an encoder module of the VAE model to obtain phoneme information z correspondingly containing semantic characteristicsnAnd the classification label characteristic y of the target speaker is matched with the classification label characteristic y of the target speaker frame by framenSplicing to form a combined characteristic, inputting the combined characteristic into a BP network to obtain the Bottleneck characteristic of each frame of the target speakerThen the phoneme information z containing semantic featuresnBottleneck features with frames of the target speakerReconstructing the joint characteristics spliced frame by frame into converted voice spectrum characteristics through a decoder module of a VAE model, and finally synthesizing voice through an AHOdecoder; the method specifically comprises a training step and a voice conversion step:
FIG. 1 is a block diagram of a training process of a system according to the present invention, the training steps being:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
extracting Mel cepstrum characteristics of the speaker voices participating in training by respectively extracting the Mel cepstrum characteristics of the speaker voices participating in training by using an AHOcoder sound codec, and reading the Mel cepstrum characteristics into a Matlab platform; the invention adopts the 19-dimensional Mel cepstrum characteristics, the voice content of each speaker can be different, and DTW alignment is not needed.
2) Carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameters with the original characteristic parameters, and splicing the characteristic parameters obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter Xn;
Carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the difference with the original characteristic to obtain 57-dimensional difference characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing the characteristic parameters of the previous frame and the next frame to form a 171-dimensional combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
3) Using joint feature parameters xnAnd speaker class label feature ynTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting Bottleneck characteristics b of each framen;
The structure of the bottleeck feature extraction DNN network used in the present invention is shown in fig. 3, where the inputs to the network areThe number of the nodes of the layer corresponds to the dimension of the voice frequency spectrum characteristic participating in training, the output is the softmax classified output of the speaker, and the number of the nodes is determined according to the number of the speakers participating in training. Extracting Bottleneck characteristics bnThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen(ii) a At this time, the source speaker and the target speaker are not distinguished, and only the speaker classification label characteristic y is used for the characteristic parameter of each framenDistinguishing;
32) the DNN network is a fully-connected neural network, adopts a DNN model of a 9-layer network, has 171 input-layer nodes and corresponds to xn171 dimensional characteristics of each frame, 7 hidden layers in the middle, the number of nodes of each layer is 1200,57,1200 and 1200, wherein the hidden layer with less number of nodes is a Bottleneck layer, a connection weight between nodes of each layer of a DNN network is unsupervised and pre-trained by a layer-by-layer greedy pre-training method, and an activation function of the hidden layer adopts a Re L U function which is closer to a brain neuron in biological angle, namely:
f(x)=max(0,x)
the Re L U function is considered to have the expressive power of more primitive features because of its unilateral inhibition, sparse activation, and relatively broad excitatory boundaries.
The activation value of the (k + 1) th hidden layer is as follows: h isk+1=f(wkhk+Bk)
Wherein h isk+1、hkActivation values, w, for the k +1 th and k-th hidden layers, respectivelykIs the connection weight between the k +1 th layer and the k layer, BkIs the bias of the k-th layer.
33) Setting a DNN network output layer as softmax classified output, selecting spectral characteristic parameters of 100 sentences of voice of each speaker of 5 speakers for training, so that the number of nodes of the output layer is 5, corresponding to the label characteristics of the 5 speakers, and classifying the label characteristics y of the speakersnAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task, namely a Bottleneck feature extraction network;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting the DNN network frame by frame, and extracting the activation value of the Bottleneck layer corresponding to each frame, namely the Bottleneck characteristic b corresponding to the characteristic parameter of each framenIn the invention, the Bottleneck layer is a fourth hidden layer, namely:
bn=f(w3h3+B3)
wherein h is3Is the activation value of the hidden layer of layer 3, w3Is the connection weight between layer 3 and layer 4, B3Is the layer 3 bias.
4) Utilizing joint feature parameters xnAnd a Bottleneck feature b corresponding to each framenTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE modeln;
The Variational Auto-encoder (VAE) used in the present invention is a generative learning method, and the concrete structure of the model used in the present invention is shown in fig. 4, where x iss,nA characteristic parameter representing the source speech is determined,characteristic parameters representing the converted speech of the target speaker, bnRepresenting Bottleneck characteristics of a frame corresponding to a target speaker, mu and sigma are respectively vector representations of mean values and covariance of components of Gaussian distribution, z represents a hidden space of a VAE model obtained through a sampling process, and z representsnI.e. the sampling characteristic. The parameter estimation process for VAE model training is shown in FIG. 6. The VAE model training comprises the following steps:
41) combining the characteristic parameters xnTraining data, Bottleneck feature b as VAE model encoder modulenTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE modelnAs control information for the speech spectral reconstruction process, i.e. Bottleenck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
the encoder input layer of the VAE model is 171 nodes and then comprises two hidden layers, wherein the first layer is 500 nodes, the second layer is 64 nodes, in the second layer nodes, the first 32 nodes calculate the mean value of each component of mixed Gaussian distribution, and the last 32 nodes calculate the variance of each component (at the moment, the Gaussian mixed distribution which can be better fitted with an input signal is calculated through a neural network);
42) optimizing K L (Kullback-L eibler variation) Divergence and mean square error in the parameter estimation process of the VAE model shown in the figure 4 by using an ADAM optimizer according to the variational Bayes principle in the VAE model to adjust the network weight of the VAE model so as to obtain a VAE voice frequency spectrum conversion model;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame, and obtaining implicit sampling characteristic z through sampling processn。
The method is more intuitive, namely, the decoder module of the VAE model is used for processing the phoneme information z containing the semantic featuresnAdding individual character b of speakernModulation of (3).
5) Sampling feature znAnd the classification label characteristic y of the speaker corresponding to each framenSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and using Bottleneck features b of each framenThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;
the Bottleneeck feature mapping network of the target speaker used in the invention adopts a BP network, and the structure is shown in FIG. 5, wherein the input parameter is zn+ynWherein z isnFor variational self-encoder hidden layer features, ynA label characteristic for a speaker participating in training; output as Bottleneck feature b of the target speakern. The method for obtaining the Bottleneck feature mapping network comprises the following steps:
51) sampling characteristic z of hidden space of VAEnClassification label characteristic y corresponding to speaker of each framenSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts a three-layer feedforward fully-connected neural network and comprises an input layer, a hidden layer and an output layer, the number of nodes of the input layer is 37, and 32 nodes correspond to sampling features z in a VAE model n5 nodes correspond to the classification label characteristic y of the 5-dimensional speaker formed by the five speakers participating in the trainingn(ii) a The output layer is 57 nodes and corresponds to 57-dimensional Bottleneck characteristics; the middle part of the system comprises a hidden layer, the number of nodes is 1200, a hidden layer activation function is a sigmoid function to introduce nonlinear change, and an output layer is linear output; the expression of the sigmoid function is:
f(x)=1/(1+ex)
52) according to the mean square error minimization criterion, a random gradient descent algorithm of backward error propagation is adopted to optimize the weight of the Bottleneck feature mapping network, and the Bottleneck feature output by the minimization networkBottleneck feature b corresponding to each framenThe error between, namely:
optimizing the weight of the whole network to finally obtain a passable sampling characteristic znClass label feature y with targeted speakernGet the Bottleneck characteristics of the target speakerBP of (3) maps the network.
The trained DNN network, VAE network and Bottleneck feature mapping network are combined to form a voice conversion system based on VAE and Bottleneck features;
the voice conversion is realized according to the spectrum conversion flow shown in fig. 2, and the voice conversion step includes:
6) joint feature parameter X of the source speaker's voice to be convertedpObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE modeln;
Obtaining a combined feature parameter X of the speech to be convertedpIn particular, the joint characteristic parameter X to the speech to be convertedpExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp。
7) Sampling feature znAnd the classification label characteristic y of the target speakernSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker
8) Will Bottleneck featureAnd a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp';
9) The speech signal is reconstructed using an AHOcoder sound codec.
The reconstructed speech signal is specifically: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then a voice coder-decoder AHOcoder is used for synthesizing the converted voice.
Mel-Cepstrum Distortion (MCD) is an objective measure of the quality of speech conversion in speech conversion. The smaller the MCD value between the converted speech and the target speech, the better the conversion performance of the corresponding speech conversion system. Fig. 7 is a comparison of MCD values of converted voices under different conversion conditions obtained by a non-parallel corpus training VAE model when different characteristic parameters are used to characterize the personality of a speaker, and it can be seen from the figure that the voice conversion using the Bottleneck characteristic to characterize the personality of the speaker has better performance than the conversion system using the speaker tag to characterize the personality of the speaker.
Compared with other deep learning concepts such as a deep confidence network DBN, a convolutional neural network CNN and the like, the variable self-encoder VAE can learn probability distribution conforming to an original input signal in an encoder process of the VAE model in a training process through a variable Bayes principle, obtain characteristics of an original signal implicit space through a sampling process, and reconstruct the original signal through a decoder process by utilizing the sampling characteristics, so that errors between the reconstructed signal and the original signal are as small as possible (or the probability distribution difference is small). The characteristics of the VAE model can be applied to style migration, and in voice conversion, the phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to the semantic characteristics can be separated in the hidden space through the VAE model, and the voice spectrum signal can be reconstructed by combining the hidden space information with the parameters for representing the individual characteristics of the speaker. According to the invention, the individual characteristics of a speaker are represented by using the Bottleneck characteristics extracted by a DNN (digital noise network) based on a speaker recognition task, the mapping relation between the combined characteristics consisting of phoneme information and speaker labels and the Bottleneck characteristics is obtained through a mapping network obtained by BP (back propagation) network training, so that the Bottleneck characteristics of a target speaker are obtained indirectly through the voice spectrum characteristics of a source speaker, and finally the phoneme information in an implicit space and the Bottleneck characteristics of the target speaker are reconstructed into converted voice spectrum characteristics through a decoder module of VAE (voice over adaptive algorithm).
The invention aims at the traditional Gaussian mixture model conversion methodThe invention also discloses a method for realizing the voice conversion under the non-parallel language material by utilizing the BP network, which is provided by combining the characteristics of the VAE model and solves the problems that the voice spectrum conversion method needs to use the parallel language material and needs to carry out DTW alignment and then model training, and has three key points: firstly, extracting Bottleneck characteristics representing individual characteristics of speakers by utilizing DNN (DNN network) based on speaker recognition taskSecondly, establishing sampling characteristic z by utilizing BP neural networknSpeaker classification label feature ynCombined features of composition with Bottleneck featuresThe mapping relationship between the two; thirdly, using decoder module of trained VAE model to convert Bottleneck featuresAnd a sampling characteristic znThe combined features of the components are reconstructed into transformed speech spectral features.
The method has the innovation points that ① utilizes the characteristics of the VAE model to separate phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to semantic characteristics from an implied space, so that the voice conversion under the non-parallel corpus training can be realized, the method can complete various conversion tasks aiming at different speakers through one-time model training, ② utilizes the Bottleneck characteristics extracted from a DNN network based on the speaker recognition task as the individual characteristics of the speaker to participate in the decoder module reconstruction process of the VAE model, and the voice conversion performance is improved.
For some medical auxiliary systems, such as patients who can not normally sound because of physiological defects or diseases of a sound-producing organ, when providing sound-producing auxiliary equipment for the patients, some principles of the method related to the invention can be adopted; the invention has better expansibility and provides a solution for solving specific problems in voice conversion, including the problem of many-to-many (M2M) voice conversion.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications can be made without departing from the spirit of the invention, and such modifications are to be considered as within the scope of the invention.
Claims (8)
1. A voice conversion method based on VAE under the training of non-parallel corpus is characterized by comprising the following steps:
training:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter X obtained by splicing in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn;
3) Using joint feature parameters xnAnd speaker class label feature ynTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting the bottleneck characteristic b of each framen;
4) Using joint feature parameters xnAnd a bottleneck characteristic b corresponding to each framenTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE modeln;
5) Sampling feature znAnd the classification label characteristic y of the speaker corresponding to each framenSplicing to obtain training data of bottleneck characteristic mapping network, and using bottleneck characteristic b of every framenThe bottleneck characteristic mapping network is obtained by taking the monitoring information as a guide to train the bottleneck characteristic mapping network and minimizing the output error of the bottleneck characteristic mapping network through a random gradient descent algorithm;
a voice conversion step:
6) joint feature parameter X of speech to be convertedpThrough the encoder module of the VAE model,obtaining the sampling characteristic z of each frame of the implicit space zn;
7) Sampling feature znAnd the classification label characteristic y of the target speakernPerforming frame-by-frame splicing to input bottleneck characteristic mapping network to obtain the bottleneck characteristic of the target speaker
8) Characterizing a bottleneckAnd a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp′;
9) The speech signal is reconstructed using an AHOcoder sound codec.
2. The method according to claim 1, wherein the extracting Mel cepstral features of the speaker's voice involved in the training in step 1) is performed by using an AHOcoder voice codec to extract Mel cepstral features of the speaker's voice involved in the training, and reading the Mel cepstral features into a Matlab platform.
3. The method according to claim 1, wherein the obtaining of the joint feature parameters in step 2) specifically comprises: carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the extracted frame characteristic parameter X with the original characteristic parameter X to obtain the characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
4. The method according to claim 1, wherein said method comprises a voice conversion based on VAE under non-parallel corpus trainingExtracting the bottleneck characteristic b in the step 3)nThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen;
32) Carrying out unsupervised pre-training on the DNN by using a layer-by-layer greedy pre-training method, wherein an activation function of a hidden layer adopts a Re L U function;
33) setting the DNN network output layer as softmax classification output, and labeling the classification label characteristic y of the speakernAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting a DNN network frame by frame, wherein the DNN network is a fully-connected neural network, a DNN model of a 9-layer network is adopted, the number of nodes of an input layer is 171, and the DNN network corresponds to xn171 dimensional characteristics of each frame, 7 hidden layers in the middle, the node number of each layer is 1200,57,1200, wherein the hidden layer with less node number is a bottleneck layer, and the activation value of the bottleneck layer corresponding to each frame is extracted, namely the bottleneck characteristic b corresponding to the Mel cepstrum characteristic parameter of each framen。
5. The method according to claim 1, wherein the VAE model training in step 4) comprises the following steps:
41) combining the characteristic parameters xnTraining data as VAE model encoder module, bottleneck characteristics bnTraining the VAE model as the training data when decoding and reconstructing the decoder module, and training the bottleneck characteristic b in the decoder module of the VAE modelnAs control information for the speech spectral reconstruction process, i.e. the bottleneck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
42) k L divergence and mean square error in the parameter estimation process of the VAE model are optimized by using an ADAM optimizer to adjust the network weight of the VAE model, so that a VAE voice spectrum conversion model is obtained;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame, and obtaining implicit sampling characteristic z through sampling processn。
6. The method according to claim 1, wherein the obtaining of the bottleneck feature mapping network in step 5) comprises the following steps:
51) sampling characteristic z of VAE voice spectrum conversion modelnClassification label characteristic y corresponding to speaker of each framenSplicing is carried out to be used as training data of a bottleneck feature mapping network, the bottleneck feature mapping network adopts a structure of an input layer, a hidden layer and an output layer, a hidden layer activation function is a sigmoid function, and the output layer is linear output;
52) optimizing the bottleneck characteristic mapping network weight by adopting a random gradient descent algorithm of backward error propagation according to a mean square error minimization criterion, and minimizing the output bottleneck characteristic of the networkBottleneck characteristics b corresponding to each framenThe error between.
7. The method according to claim 1, wherein the joint feature parameter X of the speech to be converted in step 6) is obtained from the VAE-based speech conversion under the training of non-parallel corpuspExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp。
8. The method according to claim 1, wherein the reconstructing the speech signal in step 9) specifically comprises: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810393556.XA CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810393556.XA CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108777140A CN108777140A (en) | 2018-11-09 |
CN108777140B true CN108777140B (en) | 2020-07-28 |
Family
ID=64026673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810393556.XA Active CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108777140B (en) |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377978B (en) * | 2018-11-12 | 2021-01-26 | 南京邮电大学 | Many-to-many speaker conversion method based on i vector under non-parallel text condition |
CN109326283B (en) * | 2018-11-23 | 2021-01-26 | 南京邮电大学 | Many-to-many voice conversion method based on text encoder under non-parallel text condition |
CN109377986B (en) * | 2018-11-29 | 2022-02-01 | 四川长虹电器股份有限公司 | Non-parallel corpus voice personalized conversion method |
CN109584893B (en) * | 2018-12-26 | 2021-09-14 | 南京邮电大学 | VAE and i-vector based many-to-many voice conversion system under non-parallel text condition |
CN109671442B (en) * | 2019-01-14 | 2023-02-28 | 南京邮电大学 | Many-to-many speaker conversion method based on STARGAN and x vectors |
CN109599091B (en) * | 2019-01-14 | 2021-01-26 | 南京邮电大学 | Star-WAN-GP and x-vector based many-to-many speaker conversion method |
CN110033096B (en) * | 2019-03-07 | 2021-04-02 | 北京大学 | State data generation method and system for reinforcement learning |
CN110070895B (en) * | 2019-03-11 | 2021-06-22 | 江苏大学 | Mixed sound event detection method based on factor decomposition of supervised variational encoder |
CN110060690B (en) * | 2019-04-04 | 2023-03-24 | 南京邮电大学 | Many-to-many speaker conversion method based on STARGAN and ResNet |
CN110047501B (en) * | 2019-04-04 | 2021-09-07 | 南京邮电大学 | Many-to-many voice conversion method based on beta-VAE |
CN110060691B (en) * | 2019-04-16 | 2023-02-28 | 南京邮电大学 | Many-to-many voice conversion method based on i-vector and VARSGAN |
CN110085254A (en) * | 2019-04-22 | 2019-08-02 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
US11854562B2 (en) | 2019-05-14 | 2023-12-26 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
CN110164463B (en) * | 2019-05-23 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Voice conversion method and device, electronic equipment and storage medium |
CN110211575B (en) * | 2019-06-13 | 2021-06-04 | 思必驰科技股份有限公司 | Voice noise adding method and system for data enhancement |
CN110648658B (en) * | 2019-09-06 | 2022-04-08 | 北京达佳互联信息技术有限公司 | Method and device for generating voice recognition model and electronic equipment |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
CN111724809A (en) * | 2020-06-15 | 2020-09-29 | 苏州意能通信息技术有限公司 | Vocoder implementation method and device based on variational self-encoder |
CN112309365B (en) * | 2020-10-21 | 2024-05-10 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112017644B (en) | 2020-10-21 | 2021-02-12 | 南京硅基智能科技有限公司 | Sound transformation system, method and application |
CN112382271B (en) * | 2020-11-30 | 2024-03-26 | 北京百度网讯科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN113032558B (en) * | 2021-03-11 | 2023-08-29 | 昆明理工大学 | Variable semi-supervised hundred degree encyclopedia classification method integrating wiki knowledge |
CN113299267B (en) * | 2021-07-26 | 2021-10-15 | 北京语言大学 | Voice stimulation continuum synthesis method and device based on variational self-encoder |
CN113571039B (en) * | 2021-08-09 | 2022-04-08 | 北京百度网讯科技有限公司 | Voice conversion method, system, electronic equipment and readable storage medium |
CN113763987A (en) * | 2021-09-06 | 2021-12-07 | 中国科学院声学研究所 | Training method and device of voice conversion model |
CN113763924B (en) * | 2021-11-08 | 2022-02-15 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
CN114360557B (en) * | 2021-12-22 | 2022-11-01 | 北京百度网讯科技有限公司 | Voice tone conversion method, model training method, device, equipment and medium |
CN115457969A (en) * | 2022-09-06 | 2022-12-09 | 平安科技(深圳)有限公司 | Speech conversion method, apparatus, computer device and medium based on artificial intelligence |
WO2024069726A1 (en) * | 2022-09-27 | 2024-04-04 | 日本電信電話株式会社 | Learning device, conversion device, training method, conversion method, and program |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2880592B2 (en) * | 1990-10-30 | 1999-04-12 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Editing apparatus and method for composite audio information |
CN102063899B (en) * | 2010-10-27 | 2012-05-23 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103258531B (en) * | 2013-05-29 | 2015-11-11 | 安宁 | A kind of harmonic characteristic extracting method of the speech emotion recognition had nothing to do for speaker |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104361620B (en) * | 2014-11-27 | 2017-07-28 | 韩慧健 | A kind of mouth shape cartoon synthetic method based on aggregative weighted algorithm |
WO2016207978A1 (en) * | 2015-06-23 | 2016-12-29 | 株式会社大入 | Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform |
US20170069306A1 (en) * | 2015-09-04 | 2017-03-09 | Foundation of the Idiap Research Institute (IDIAP) | Signal processing method and apparatus based on structured sparsity of phonological features |
CN106778700A (en) * | 2017-01-22 | 2017-05-31 | 福州大学 | One kind is based on change constituent encoder Chinese Sign Language recognition methods |
CN107301859B (en) * | 2017-06-21 | 2020-02-21 | 南京邮电大学 | Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering |
CN107274029A (en) * | 2017-06-23 | 2017-10-20 | 深圳市唯特视科技有限公司 | A kind of future anticipation method of interaction medium in utilization dynamic scene |
-
2018
- 2018-04-27 CN CN201810393556.XA patent/CN108777140B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108777140A (en) | 2018-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108777140B (en) | Voice conversion method based on VAE under non-parallel corpus training | |
CN107545903B (en) | Voice conversion method based on deep learning | |
Sun | End-to-end speech emotion recognition with gender information | |
Morgan | Deep and wide: Multiple layers in automatic speech recognition | |
JP6911208B2 (en) | Speaking style transfer | |
Hossain et al. | Implementation of back-propagation neural network for isolated Bangla speech recognition | |
CN112767958B (en) | Zero-order learning-based cross-language tone conversion system and method | |
Sun et al. | Voice conversion using deep bidirectional long short-term memory based recurrent neural networks | |
US11538455B2 (en) | Speech style transfer | |
Luo et al. | Emotional voice conversion using deep neural networks with MCC and F0 features | |
Azizah et al. | Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages | |
Luo et al. | Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform | |
Pascual et al. | Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation | |
CN110930981A (en) | Many-to-one voice conversion system | |
Moon et al. | Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer | |
Cai et al. | Research on English pronunciation training based on intelligent speech recognition | |
Lai et al. | Phone-aware LSTM-RNN for voice conversion | |
Bi et al. | Deep feed-forward sequential memory networks for speech synthesis | |
Swain et al. | A DCRNN-based ensemble classifier for speech emotion recognition in Odia language | |
Zheng et al. | An improved speech emotion recognition algorithm based on deep belief network | |
Zen | Deep learning in speech synthesis. | |
Luo et al. | Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform. | |
Zhao et al. | Research on voice cloning with a few samples | |
Xie et al. | Voice conversion with SI-DNN and KL divergence based mapping without parallel training data | |
Banerjee et al. | Intelligent stuttering speech recognition: A succinct review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |