CN113096675B - Audio style unification method based on generation type countermeasure network - Google Patents
Audio style unification method based on generation type countermeasure network Download PDFInfo
- Publication number
- CN113096675B CN113096675B CN202110351514.1A CN202110351514A CN113096675B CN 113096675 B CN113096675 B CN 113096675B CN 202110351514 A CN202110351514 A CN 202110351514A CN 113096675 B CN113096675 B CN 113096675B
- Authority
- CN
- China
- Prior art keywords
- audio
- network
- style
- spectrum
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000001228 spectrum Methods 0.000 claims abstract description 99
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000012360 testing method Methods 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 30
- 230000009466 transformation Effects 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 18
- 238000012952 Resampling Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 10
- 230000013016 learning Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 206010070834 Sensitisation Diseases 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008313 sensitization Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000001308 synthesis method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/20—Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Telephonic Communication Services (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method for unifying audio styles based on a generated type countermeasure network, which comprises the following steps of: acquiring an initial data set and a noise data set; step 2: preprocessing an initial data set and a noise data set to generate noise mixed audio and style template audio and determining a training data set and a test data set related to the noise mixed audio and style template audio; step 3: building a generating network model, wherein a training generator network G is used for unifying audio styles, inputting noise mixed audio and style template audio, and outputting audio of a target style and frequency spectrums of the target style; step 4: building a judging network model, and training a judging network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator; step 5: constructing a loss function model and training a generated countermeasure network; according to the scheme, the method for generating the unified audio styles of the countermeasure network can adjust the styles of other input audio according to the audio styles selected by the user.
Description
Technical Field
The invention relates to the technical field of deep learning, in particular to a method for unifying audio styles based on a generated type countermeasure network.
Background
The style unification of audio refers to the addition of the characteristics of a certain speaker, such as timbre, sub-language (emotion and intonation), to synthesized audio, and is also called speech style transfer. The research of the voice style transfer not only can promote the theoretical research of voice signal processing, but also can promote the fusion of the theory and the application of the cross field, and has important position.
Currently, speech style transfer technology has been developed for decades, and along with the development of speech conversion technology, speech style transfer technology also has achieved a lot of results. A method for realizing male and female voice conversion based on a time domain fundamental frequency synchronous superposition technology is proposed by initial sensitization et al (initial sensitization, lv Shinan. A synthesis method combining a PSOLA algorithm and a voice sine model [ C ]// a fifth national conference of human-computer communication academy of sciences, 1998.). Desai et al propose to use BP neural network method to realize voice conversion (Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial Neural Networks[J].2009.). benefits from development of deep learning, people make new modification to the former model, such as Sun, lifa, et al, use long and short time sequence memory network to realize voice conversion (Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.). in order to further improve voice conversion quality, chris Donahue et al propose to use WaveGAN based on deep convolution generation countermeasure network to realize voice conversion (Donahue C, mcauley J, puckette M.Universal Audio Synthesis [ J ].2018 ]), but because voice signals are processed directly and simply into spectrograms, experimental effect is not ideal.
Reference to the literature
[1] Primary sensitization Lv Shinan A synthesis method combining PSOLA algorithm with a speech sinusoidal model [ C ]// fifth national academy of human-computer communication academy of sciences, 1998.
[2]Desai S,Raghavendra E V,Yegnanarayana B,et al.Voice conversion using Artificial NeuralNetworks[J].2009.
[3]Greff K,Srivastava R K,Koutnik J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks&Learning Systems,2016,28(10):2222-2232.
[4]Donahue C,Mcauley J,Puckette M.Adversarial Audio Synthesis[J].2018.
Disclosure of Invention
In view of the above, the present invention aims to provide a method for unifying audio styles based on a generated countermeasure network, which has the advantages of less manual intervention, easy automation, reliable and convenient implementation, and rapid processing.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a method of audio style unification based on a generative countermeasure network, comprising:
S01, acquiring an initial data set and a noise data set;
s02, preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;
S03, acquiring style template audio;
S04, constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;
s05, acquiring a style template frequency spectrum corresponding to the style template audio;
S06, constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, inputting the target style spectrum and the style template spectrum into the discriminator network D, discriminating the target style spectrum and the style template spectrum by the discriminator network D, and outputting probability scores mapped between [0,1 ];
S07, constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;
And S08, carrying out unified conversion of the audio style on the audio to be converted by the generating type countermeasure network, and outputting the style converted audio.
As a possible implementation, further, the initial data set includes a set of clean audio in the university of bloom chinese phonetic data set THCHS;
The noise dataset includes a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset THCHS.
As a possible implementation manner, the style template spectrum is a spectrum obtained by performing fourier positive transformation on style template audio.
As a preferred alternative embodiment, in step S02, the method for preprocessing the initial data set and the noise data set according to the preset condition to generate the noise mixed audio is as follows:
S021, resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing the initial data set and the noise data set by using 4 seconds as interval lengths respectively;
S022, generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:
Z=C+N*r
Wherein C represents a section of audio in the initial dataset after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio.
Preferably, the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.
As a preferred implementation manner, preferably, 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.
As a preferred alternative embodiment, the generator network G preferably comprises a noise-mixing audio encoder, a style-template audio encoder and a decoder;
The generator network G has two input ends and two output ends, wherein one input end is used for inputting the spectrum of the noise mixed audio after fourier transformation, the size of the spectrum is 257 x 513 x1, and the other input end is used for inputting the style template spectrum, the size of the spectrum is 257 x 513 x 1; one output end is used for outputting a target style frequency spectrum with the size of 257 x 513 x1, the target style frequency spectrum is used for being input into the discriminator network D for comparison, and the other output end is used for outputting the audio after the target style frequency spectrum is subjected to Fourier inverse transformation, namely the target style audio;
In addition, the noise mixed audio encoder includes 8 encoder units, each encoder unit has a convolution kernel size specification of 3*3, a stride of 2, an activation function of ReLu, and the number of convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024, 2048 in sequence, the first encoder unit is used for inputting a frequency spectrum of the noise mixed audio after fourier positive transformation, the size of the first encoder unit is 257×513×1, the input feature of each encoder unit after the first encoder unit is the output feature of the last encoder unit, and the output scale of the last encoder unit is 2×3×2048;
the style template audio encoder comprises 8 encoder units, wherein the size specification of the convolution kernel of each encoder unit is 3*3, the stride is 2, the activation function is ReLu, the number of the convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024 and 2048 in sequence, the first encoder unit is used for inputting a style template frequency spectrum, the size of the first encoder unit is 257 x 513 x 1, the input characteristic of each encoder unit after the first encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2x 3 x 2048;
The decoder comprises 8 decoder units, the deconvolution kernel of each decoder unit has the size 3*3, the stride is2, the activation function is ReLu, the number of deconvolution kernels of each decoder unit is 1024, 512, 256, 128, 64, 32, 16 and 8 in sequence, the first decoder unit is used for inputting the result of tensor splicing of the output characteristics of the noise mixed audio encoder and the output characteristics of the style template audio encoder, the input characteristics of each decoder unit after the first decoder unit are the output characteristics of the last decoder unit, and the output scale of the last decoder unit is 257 x 513 x 1.
As a preferred alternative embodiment, the arbiter network D preferably comprises 6 convolutional layers and 5 fully connected layers;
The discriminator network D has two input terminals and one output terminal, wherein one input terminal is used for inputting the target style spectrum output by the generator network G, the size of the target style spectrum is 257 x 513 x 1, the other input terminal is used for inputting style template spectrum, the size of the target style spectrum is 257 x 513 x 1, the output terminal is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in the form of probability scores between [0,1 ];
in addition, before entering a convolution layer, data input by an input end of a discriminator network D also carries out tensor splicing processing on a target style spectrum and a style template spectrum, the characteristics of 257 x 513 x 2 are processed to form a feature, the feature is sent to the convolution layer, the convolution kernel size of each convolution layer is 3*3, the stride is 2, the convolution is standardized in BatchNorm batches before convolution, an activation function is ReLu, channels of each convolution layer are 32, 64, 128, 256, 512 and 1024 in sequence, the first convolution layer is the result of tensor splicing processing on the input target style spectrum and the style template spectrum, the input feature of each convolution layer after the first convolution layer is the output feature of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024;
The number of each layer of neurons of the full-connection layer is 46080, 1024, 256, 64 and 1 in sequence, wherein the last layer adopts sigmoid as an activation function, the other layers adopt ReLu as an activation function, the input end of the full-connection layer is used for inputting the characteristic result after the output of the last convolution layer is straightened, the output end of the full-connection layer is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in a probability score form between [0 and 1 ].
As a preferred alternative implementation manner, preferably, before the audio style unified conversion is performed on the audio converted by the style conversion treatment by the generated type countermeasure network, the network parameters of the generated type countermeasure network are further optimized to obtain the parameters with optimal network performance.
As a preferred alternative implementation mode, preferably, a loss function model is constructed, a generating network model and a judging network model are accessed, the loss degree of information is calculated through a generator network G in the generating network model, the loss degree of style is judged through a judging network D of the judging network model, and then the specific method for obtaining the generating type countermeasure network by training is as follows:
(1) The loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(2) The loss function L G of the generator network G consists of two parts, one part is L GD output by the discriminator network D, and the other part is the difference between the target style audio output by the generator network G and the audio of the initial dataset, which is recorded as Wherein,
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the initial data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a super parameter used for controlling the weight of the two loss parts;
(3) Optimizing the generator network G by adopting an Adam algorithm with the learning rate of 0.001; and optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, so that the parameters with the optimal performance of the generated type countermeasure network are obtained by optimizing the parameters of the generated type countermeasure network.
By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that: the network based on the generated countermeasure idea is provided, the network utilizes the discriminator network to monitor and train the generator network, finally, the style of noise mixed audio and the style of style template audio can be unified, the generator network model adopts the full convolution structure of the encoder-decoder, unified processing can be rapidly carried out, manual intervention is reduced through training of the network, automation is easy to realize, and the style of other input audio can be conveniently adjusted according to the audio style selected by a user.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a partial schematic system flow diagram of an aspect of the present invention;
FIG. 2 is a network block diagram of the scheme generator of the present invention;
fig. 3 is a network configuration diagram of the inventive scheme discriminator.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.
In the embodiment, a set of clean audios in a Chinese phonetic dataset THCHS of Qinghua university is taken as an initial dataset, and is set as an experimental dataset;
taking a collection of 3 kinds of noise audios in a Chinese voice dataset THCHS of Qinghua university as a noise dataset for example;
The style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-constructed style template audio library.
In this embodiment, a system block diagram of implementation steps of a method for unifying audio styles based on a generated countermeasure network is shown in fig. 1, and the implementation steps are as follows:
1. An experimental dataset and a noise dataset are obtained. The experimental dataset is a collection of clean audio in the university of Qinghua chinese phonetic dataset (THCHS); the noise dataset is a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset (THCHS).
2. The experimental data set and the noise data set are preprocessed to generate noise mixed audio and style template audio and to determine training data sets and test data sets associated therewith.
The method comprises the following steps:
(2.1) resampling the experimental and noise data sets to 16.384kHz, respectively, and dividing this at 4 second intervals.
The formula of the noise mixed audio generated in (2.2) is: z=c+n x r; wherein C represents a section of audio in the experimental data set after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio; the generated style template audio is randomly extracted from the experiment data set after resampling and segmentation.
(2.3) Randomly extracting 85% from the noise mixed audio and the style template audio as a training data set, and the remaining 15% as a test data set.
3. And constructing a generating network model, wherein the training generator network G is used for unifying audio styles, inputting the noise mixed audio and style template audio, and outputting the audio of a target style and the frequency spectrum of the target style.
As shown in fig. 2, this step specifically includes:
(3.1) the generator network G is comprised of a noise-mixing audio encoder, a style template audio encoder and a decoder. The generator network G has two inputs and two outputs. The input end is the frequency spectrum of the noise mixed audio after Fourier positive transformation, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output end is the frequency spectrum of the target style audio, the size is 257 x 513 x1, and the frequency spectrum is input into a discrimination network for comparison; the other is the audio after the inverse fourier transform of the spectrum of the target style audio.
(3.2) The noise-mixed audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the noise mixed audio after fourier transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.
(3.3) A style template audio encoder consists of 8 encoder units. The convolution kernel size for each encoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of convolution kernels for each encoder unit is 16,32,64,128,256,512,1024,2048 in turn. The input of the first encoder unit is the frequency spectrum of the style template audio after being subjected to Fourier positive transformation, the size is 257 x 513 x 1, then the input of each encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2 x 3 x 2048.
(3.4) The decoder consists of 8 decoder units. The deconvolution kernel size for each decoder unit is 3*3, the stride is 2, the activation function is ReLu, and the number of deconvolution kernels for each decoder unit is 1024,512,256,128,64,32,16,8 in turn. The input of the first decoder unit is the result of the concatenation of the input of the noise mixed audio encoder and the output tensor of the style template audio encoder, then the input of each decoder unit is the result of the concatenation of the output characteristics of the last decoder unit and the output tensor of the encoder unit with the same size as the decoder unit in the noise mixed audio encoder, and the output scale of the last decoder unit is 257 x 513 x 1.
4. Building a discrimination network model, training a discriminator network D to measure the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the style template output by the generator, inputting the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum output by the generator, discriminating the frequency spectrum of the target style and the frequency spectrum of the style template audio frequency spectrum, and outputting a probability score mapped between [0,1 ].
As shown in fig. 3, this step specifically includes:
(4.1) the arbiter network D consists of 6 convolutional layers and 5 fully connected layers. The arbiter network D has two inputs and one output. The input end is the frequency spectrum of the target style output by the generator, and the size is 257 x 513 x 1; the other is the spectrum of style template audio after fourier transformation, with the size of 257 x 513 x 1. The output is a probability score between 0, 1.
And (4.2) performing tensor splicing on the spectrum of the target style output by the generator and the spectrum of the style template audio after Fourier positive transformation before the convolution layer to form an input with the size of 257 x 513 x2, and sending the input to the convolution layer. The convolution kernel size for each convolution layer was 3*3, the stride was 2, the batch normalization by BatchNorm, the activation function was ReLu, and the number of channels per convolution layer was 32,64,128,256,512,1024 in turn. The input of the first convolution layer is the result of tensor stitching, the input of each subsequent convolution layer is the output characteristic of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024.
(4.3) The number of neurons per layer of the fully connected layer is 46080,1024,256,64,1 in turn, with the last layer using sigmoid as the activation function and the other layers using ReLu as the activation function. The input of the full-connection layer is the result of the output of the last convolution layer after being straightened, and the output of the full-connection layer is the probability fraction between [0,1] for measuring the similarity degree of the frequency spectrum of the target style and the frequency spectrum of the wind pattern template output by the generator.
5. Constructing a loss function model, wherein the loss function consists of two parts, one part is generated by a generator network G, and the loss degree of information is calculated; the other part is generated by the arbiter network D for evaluating the degree of style loss. And then training the generated type countermeasure network, and finding out the parameter with optimal network performance by optimizing the network parameter.
The method specifically comprises the following steps:
(5.1) the loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(5.2) the loss function L G of the generator network G consists of two parts, one part being the output L GD of the arbiter and the other part being the difference between the output of the generator and the experimental dataset audio, noted as
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the experimental data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a superparameter used to control the weight of the two partial losses.
(5.3) Optimizing the generator network G by adopting an Adam algorithm with a learning rate of 0.001; optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, and finding out the parameter with the optimal network performance by optimizing the network parameter.
6. The method comprises the steps that through the adoption of the generated type countermeasure network with optimal parameters, audio frequency style unified conversion is carried out on audio frequency to be converted, and style conversion audio frequency is output.
The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.
Claims (10)
1. A method for unifying audio styles based on a generated countermeasure network, comprising:
Acquiring an initial data set and a noise data set;
Preprocessing an initial data set and a noise data set according to preset conditions to generate noise mixed audio;
Acquiring style template audio;
Constructing a generating network model, training to obtain a generator network G, wherein the generator network G is used for unifying audio styles, and outputting target style audio and target style frequency spectrums after the noise mixed audio and style template audio are input into the generator network G;
acquiring a style template frequency spectrum corresponding to the style template audio;
constructing a discrimination network model, training to obtain a discriminator network D, wherein the discriminator network D is used for measuring the similarity degree of a target style spectrum and a style template spectrum output by a generator network G, discriminating the target style spectrum and the style template spectrum by the discriminator network D after the target style spectrum and the style template spectrum are input into the discriminator network D, and outputting probability scores mapped between [0,1 ];
constructing a loss function model, accessing a generating network model and a judging network model, calculating the loss degree of information through a generator network G in the generating network model, judging the style loss degree through a judging network D of the judging network model, and training to obtain a generating type countermeasure network;
and carrying out audio style unified conversion on the audio to be style converted through the generated countermeasure network, and outputting style converted audio.
2. The method of claim 1, wherein the initial data set comprises a collection of clean audio from a university chinese phonetic dataset THCHS;
The noise dataset includes a collection of 3 types of noise audio in the university of bloom chinese phonetic dataset THCHS.
3. The method for unifying audio styles based on a generative countermeasure network according to claim 1, wherein the style template frequency spectrum is a frequency spectrum of style template audio after fourier positive transformation.
4. The method for unifying audio styles based on a generation type countermeasure network according to claim 3, wherein the method for preprocessing the initial data set and the noise data set according to a preset condition to generate noise mixed audio is as follows:
resampling the initial data set and the noise data set to 16.384kHz respectively, and dividing them by 4 second interval length respectively;
generating noise mixed audio according to a preset formula, wherein the formula for generating the noise mixed audio is as follows:
Z=C+N*r
Wherein C represents a section of audio in the initial dataset after resampling and segmentation; n represents a section of audio in the noise data set after resampling and segmentation; r represents a random number between [0.1,0.3 ]; z represents a segment of audio in the generated noise-mixed audio.
5. The method of claim 4, wherein the style template audio is randomly extracted from the initial data set after resampling and segmentation or extracted from a pre-built style template audio library.
6. The method for unifying audio styles based on a generative countermeasure network according to claim 5, wherein 85% of audio units in the noise mixed audio and the style template audio are also randomly extracted as training data sets, and the remaining 15% are used as test data sets; the training data set and the test data set are used for training or testing of the generator network G and/or the arbiter network D.
7. The method for generating a unified audio style for an countermeasure network according to any of claims 4 to 6, wherein the generator network G includes a noise-mixing audio encoder, a style template audio encoder, and a decoder;
The generator network G has two input ends and two output ends, wherein one input end is used for inputting the spectrum of the noise mixed audio after fourier transformation, the size of the spectrum is 257 x 513 x1, and the other input end is used for inputting the style template spectrum, the size of the spectrum is 257 x 513 x 1; one output end is used for outputting a target style frequency spectrum with the size of 257 x 513 x1, the target style frequency spectrum is used for being input into the discriminator network D for comparison, and the other output end is used for outputting the audio after the target style frequency spectrum is subjected to Fourier inverse transformation, namely the target style audio;
In addition, the noise mixed audio encoder includes 8 encoder units, each encoder unit has a convolution kernel size specification of 3*3, a stride of 2, an activation function of ReLu, and the number of convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024, 2048 in sequence, the first encoder unit is used for inputting a frequency spectrum of the noise mixed audio after fourier positive transformation, the size of the first encoder unit is 257×513×1, the input feature of each encoder unit after the first encoder unit is the output feature of the last encoder unit, and the output scale of the last encoder unit is 2×3×2048;
the style template audio encoder comprises 8 encoder units, wherein the size specification of the convolution kernel of each encoder unit is 3*3, the stride is 2, the activation function is ReLu, the number of the convolution kernels of each encoder unit is 16, 32, 64, 128, 256, 512, 1024 and 2048 in sequence, the first encoder unit is used for inputting a style template frequency spectrum, the size of the first encoder unit is 257 x 513 x 1, the input characteristic of each encoder unit after the first encoder unit is the output characteristic of the last encoder unit, and the output scale of the last encoder unit is 2x 3 x 2048;
The decoder comprises 8 decoder units, the deconvolution kernel of each decoder unit has the size 3*3, the stride is2, the activation function is ReLu, the number of deconvolution kernels of each decoder unit is 1024, 512, 256, 128, 64, 32, 16 and 8 in sequence, the first decoder unit is used for inputting the result of tensor splicing of the output characteristics of the noise mixed audio encoder and the output characteristics of the style template audio encoder, the input characteristics of each decoder unit after the first decoder unit are the output characteristics of the last decoder unit, and the output scale of the last decoder unit is 257 x 513 x 1.
8. The method for unifying audio styles based on a generative countermeasure network of claim 7, wherein the discriminator network D comprises 6 convolutional layers and 5 fully-connected layers;
The discriminator network D has two input terminals and one output terminal, wherein one input terminal is used for inputting the target style spectrum output by the generator network G, the size of the target style spectrum is 257 x 513 x 1, the other input terminal is used for inputting style template spectrum, the size of the target style spectrum is 257 x 513 x 1, the output terminal is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in the form of probability scores between [0,1 ];
in addition, before entering a convolution layer, data input by an input end of a discriminator network D also carries out tensor splicing processing on a target style spectrum and a style template spectrum, the characteristics of 257 x 513 x 2 are processed to form a feature, the feature is sent to the convolution layer, the convolution kernel size of each convolution layer is 3*3, the stride is 2, the convolution is standardized in BatchNorm batches before convolution, an activation function is ReLu, channels of each convolution layer are 32, 64, 128, 256, 512 and 1024 in sequence, the first convolution layer is the result of tensor splicing processing on the input target style spectrum and the style template spectrum, the input feature of each convolution layer after the first convolution layer is the output feature of the last convolution layer, and the output scale of the last convolution layer is 5 x 9 x 1024;
The number of each layer of neurons of the full-connection layer is 46080, 1024, 256, 64 and 1 in sequence, wherein the last layer adopts sigmoid as an activation function, the other layers adopt ReLu as an activation function, the input end of the full-connection layer is used for inputting the characteristic result after the output of the last convolution layer is straightened, the output end of the full-connection layer is used for outputting the similarity degree of the target style spectrum and the style template spectrum, and the similarity degree result is output in a probability score form between [0 and 1 ].
9. The method for unifying audio styles based on a generative countermeasure network according to claim 8, wherein the parameters with optimal network performance are obtained by optimizing the network parameters of the generative countermeasure network before the audio styles are uniformly converted by the generative countermeasure network for the audio converted by the style.
10. The method for unifying audio styles based on the generated type countermeasure network according to claim 9, wherein the specific method for constructing a loss function model, accessing a generated network model and a discrimination network model, calculating the loss degree of information through a generator network G in the generated network model, evaluating the style loss degree through a discriminator network D of the discrimination network model, and then training to obtain the generated type countermeasure network is as follows:
(1) The loss function L D of the arbiter network D is defined as:
LD=(D(c,x)-1)2+(D(G(z,x),x))2 (1)
(2) The loss function L G of the generator network G consists of two parts, one part is L GD output by the discriminator network D, and the other part is the difference between the target style audio output by the generator network G and the audio of the initial dataset, which is recorded as Wherein,
LGD=D(G(z,x),x) (2)
In the formulas (1), (2), (3) and (4), n is the number of matrix elements in the spectrum of the target style output by the generator network G; c is a frequency spectrum after Fourier positive transformation is carried out on a section of audio in the initial data set; z is the frequency spectrum of the noise mixed audio after Fourier transformation; x is the frequency spectrum of style template audio after Fourier positive transformation; k is a super parameter used for controlling the weight of the two loss parts;
(3) Optimizing the generator network G by adopting an Adam algorithm with the learning rate of 0.001; and optimizing the arbiter network D by adopting an Adam algorithm with the learning rate of 0.0001, so that the parameters with the optimal performance of the generated type countermeasure network are obtained by optimizing the parameters of the generated type countermeasure network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110351514.1A CN113096675B (en) | 2021-03-31 | 2021-03-31 | Audio style unification method based on generation type countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110351514.1A CN113096675B (en) | 2021-03-31 | 2021-03-31 | Audio style unification method based on generation type countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113096675A CN113096675A (en) | 2021-07-09 |
CN113096675B true CN113096675B (en) | 2024-04-23 |
Family
ID=76672582
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110351514.1A Active CN113096675B (en) | 2021-03-31 | 2021-03-31 | Audio style unification method based on generation type countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113096675B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114299969B (en) * | 2021-08-19 | 2024-06-11 | 腾讯科技(深圳)有限公司 | Audio synthesis method, device, equipment and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110473154A (en) * | 2019-07-31 | 2019-11-19 | 西安理工大学 | A kind of image de-noising method based on generation confrontation network |
CN110992252A (en) * | 2019-11-29 | 2020-04-10 | 北京航空航天大学合肥创新研究院 | Image multi-format conversion method based on latent variable feature generation |
CN111816156A (en) * | 2020-06-02 | 2020-10-23 | 南京邮电大学 | Many-to-many voice conversion method and system based on speaker style feature modeling |
CN112216257A (en) * | 2020-09-29 | 2021-01-12 | 南方科技大学 | Music style migration method, model training method, device and storage medium |
CN112466316A (en) * | 2020-12-10 | 2021-03-09 | 青海民族大学 | Zero-sample voice conversion system based on generation countermeasure network |
CN112562728A (en) * | 2020-11-13 | 2021-03-26 | 百果园技术(新加坡)有限公司 | Training method for generating confrontation network, and audio style migration method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106847294B (en) * | 2017-01-17 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | Audio-frequency processing method and device based on artificial intelligence |
US11854562B2 (en) * | 2019-05-14 | 2023-12-26 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
-
2021
- 2021-03-31 CN CN202110351514.1A patent/CN113096675B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110473154A (en) * | 2019-07-31 | 2019-11-19 | 西安理工大学 | A kind of image de-noising method based on generation confrontation network |
CN110992252A (en) * | 2019-11-29 | 2020-04-10 | 北京航空航天大学合肥创新研究院 | Image multi-format conversion method based on latent variable feature generation |
CN111816156A (en) * | 2020-06-02 | 2020-10-23 | 南京邮电大学 | Many-to-many voice conversion method and system based on speaker style feature modeling |
CN112216257A (en) * | 2020-09-29 | 2021-01-12 | 南方科技大学 | Music style migration method, model training method, device and storage medium |
CN112562728A (en) * | 2020-11-13 | 2021-03-26 | 百果园技术(新加坡)有限公司 | Training method for generating confrontation network, and audio style migration method and device |
CN112466316A (en) * | 2020-12-10 | 2021-03-09 | 青海民族大学 | Zero-sample voice conversion system based on generation countermeasure network |
Non-Patent Citations (1)
Title |
---|
基于CQT和梅尔频谱的带有人声的音乐风格转换方法;叶洪良;朱皖宁;洪蕾;计算机科学;20211231(0S1);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113096675A (en) | 2021-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
US11908455B2 (en) | Speech separation model training method and apparatus, storage medium and computer device | |
CN110600047B (en) | Perceptual STARGAN-based multi-to-multi speaker conversion method | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
US20220253700A1 (en) | Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium | |
CN110853656B (en) | Audio tampering identification method based on improved neural network | |
CN110047501B (en) | Many-to-many voice conversion method based on beta-VAE | |
CN115294970B (en) | Voice conversion method, device and storage medium for pathological voice | |
CN110189766B (en) | Voice style transfer method based on neural network | |
CN111724806B (en) | Double-visual-angle single-channel voice separation method based on deep neural network | |
CN113096675B (en) | Audio style unification method based on generation type countermeasure network | |
JP7124373B2 (en) | LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM | |
Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
Yin et al. | An investigation of fusion strategies for boosting pig cough sound recognition | |
CN113077783B (en) | Method and device for amplifying small language speech corpus, electronic equipment and storage medium | |
Ong et al. | Speech emotion recognition with light gradient boosting decision trees machine | |
CN111860246A (en) | Deep convolutional neural network-oriented data expansion method for heart sound signal classification | |
Ge et al. | Explainable deepfake and spoofing detection: an attack analysis using SHapley Additive exPlanations | |
CN113707172B (en) | Single-channel voice separation method, system and computer equipment of sparse orthogonal network | |
Choi et al. | Adversarial speaker-consistency learning using untranscribed speech data for zero-shot multi-speaker text-to-speech | |
CN113112969B (en) | Buddhism music notation method, device, equipment and medium based on neural network | |
Wan et al. | Deep neural network based chinese dialect classification | |
CN113053356B (en) | Voice waveform generation method, device, server and storage medium | |
Li et al. | Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion. | |
Zhipeng et al. | Voiceprint recognition based on BP Neural Network and CNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |