CN113205794B - Virtual bass conversion method based on generation network - Google Patents

Virtual bass conversion method based on generation network Download PDF

Info

Publication number
CN113205794B
CN113205794B CN202110468881.XA CN202110468881A CN113205794B CN 113205794 B CN113205794 B CN 113205794B CN 202110468881 A CN202110468881 A CN 202110468881A CN 113205794 B CN113205794 B CN 113205794B
Authority
CN
China
Prior art keywords
virtual bass
training
data
network
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110468881.XA
Other languages
Chinese (zh)
Other versions
CN113205794A (en
Inventor
史创
郭嘉祺
杨浩聪
陶盛奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110468881.XA priority Critical patent/CN113205794B/en
Publication of CN113205794A publication Critical patent/CN113205794A/en
Application granted granted Critical
Publication of CN113205794B publication Critical patent/CN113205794B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a virtual bass conversion method based on a generating network, and belongs to the technical field of audio processing. Setting an initial virtual bass generation network, two generators and two discriminators on the basis of a cycle generation network, training the initial virtual bass generation network on the basis of set training data, and taking a first generator of the initial virtual bass generation network as a virtual bass generation network when a convergence condition is met; and inputting original audio data to be converted into the audio data, and obtaining a conversion result based on the output of the virtual bass generation network. The time domain waveform of the virtual bass generated by the invention is nearly identical to the time domain waveform of the virtual bass generated by the traditional method on the bass contour. In addition, the invention only needs to generate the network based on the trained virtual bass, and does not need to carry out complicated parameter setting and adjustment during each generation.

Description

Virtual bass conversion method based on generation network
Technical Field
The invention belongs to the technical field of audio processing, and particularly relates to a virtual bass conversion method.
Background
Due to the limitation of the manufacturing process, the general speaker generally has strict working bandwidth limitation, especially the low frequency part of the sound signal, so that the general speaker cannot restore the whole frequency band of the original signal without loss. In today's diversified speaker systems, low cost speakers still dominate, and there is a need for a general solution or a suitable alternative to frequency components outside the bandwidth limitation. In such a background, a technique called Virtual Bass (Virtual Bass) has been developed.
Virtual bass, also known as "Missing fundamental frequency components", was originally proposed by j.c.r.licklir in the 1951 paper "a duplex the lay of pitch performance". This psychoacoustic-based study shows that the human auditory system can perceive a bass Fundamental frequency from high-frequency harmonics of the Fundamental frequency component (Fundamental component) of the sound signal. For example, if a person listens to a sequence of harmonics with frequencies of 200hz,300hz,400hz respectively (Harmonic series), the brain can effectively perceive the 100Hz common difference frequency between them, i.e. the desired fundamental frequency component.
Virtual bass technology was first implemented by Nonlinear device systems (Nonlinear device systems), of which The most widely used set of NLD-based systems is The maxxbas proposed by b.t. daniel in 1999 "The effect of The MaxxBass psychological base enhancement system on loud speaker design". Referring to fig. 1, an input signal is first passed through a Low Pass Filter (LPF) to obtain a desired low frequency component, and then this low pass signal (low frequency component) is processed by a non-linear element (NLD) to generate a harmonic component. These successive harmonic components are passed through a special band pass filter (BFP) to obtain the appropriate frequency band and add a gain of size G. After that, these processed signals will be superimposed with the original signals after delay processing, and finally outputted.
After that, M.R.BAI proposed in 2006 the paper Synthesis and Implementation of Virtual base systems with a Phase-Vocoder apparatus to generate harmonics instead of non-linear elements. The phase vocoder samples the input signal by a small time window, and performs correlation processing after Fast Fourier Transform (FFT) conversion, thereby effectively ensuring the consistency of signal phases. Compared with a nonlinear element system, the phase vocoder almost completely acts on the frequency domain of the signal, and therefore intermodulation distortion of the system output can be effectively avoided. However, the phase vocoder has a disadvantage in that it is difficult to balance the resolution relationship between the time domain and the frequency domain: f. of res =1/t w Wherein f is res Is the frequency domain resolution (Hz), t w Is the selected window length. Since the virtual bass conversion has a high requirement for the frequency domain resolution, the choice of the length of the time window is in fact a difficult matter to choose.
Finally, a scholars a.j.hill from the university of eiscoss, england, proposed a Hybrid virtual bass (Hybrid virtual bass) approach that mixes the two previous signal processing approaches. The method combines the advantages of high sensitivity of a nonlinear element system to time domain signal change and good processing effect of a phase encoder on a non-Transient signal, and designs a Transient content detector (Transient content detector). In other words, the so-called Hybrid system (Hybrid system) essentially assigns weights to the outputs of two systems in the same time window. Although the hybrid virtual bass system effectively combines the advantages of the non-linear element system and the phase vocoder, it has the disadvantages of time inefficiency, and the need to set a large number of parameters. This directly results in the inability of the technology to be widely used.
Disclosure of Invention
The invention aims to: in view of the above problems, a method for implementing virtual bass conversion by generating a network is provided, which simplifies the setting process of virtual bass and reduces the processing time of virtual bass.
The virtual bass conversion method based on the generation network comprises the following steps:
step 1: setting a network structure of an initial virtual bass generation network based on a cycle generation network:
the virtual bass generating network comprises a generator G X→Y And generator G Y→X And a discriminator D X And a discriminator D Y (ii) a Wherein, the generator G X→Y Respectively with the generator G Y→X And a discriminator D Y Connected, generator G Y→X Respectively with a discriminator D X And a discriminator D Y Connecting, wherein X represents a characteristic space where input data are located, and Y represents a characteristic space where output data are located;
step 2: deep learning training is carried out on the initial virtual bass generation network:
step 201: setting a first training data set:
collecting an original audio signal set, wherein the original audio signal set comprises a plurality of frames of original audio signals;
performing fast Fourier transform on an original audio signal of a current frame to obtain a frequency domain signal, and performing low-pass filtering on the frequency domain signal based on a preset cut-off frequency to obtain an original low-frequency signal;
according to a preset first virtual bass processing mode (based on hardware implementation, namely a traditional virtual bass processing mode), carrying out first virtual bass processing on an original audio signal of a current frame to obtain a first virtual bass signal;
adding the original low-frequency signal of the current frame and the first virtual bass signal to obtain a first reconstructed virtual bass signal of the current frame;
taking an original audio signal of a current frame as training sample data, and taking a first reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a first training data set;
step 202: performing first network parameter training on the initial virtual bass generation network based on a first training data set:
sampling the current training sample data x i Respectively input generator G X→Y And a discriminator D X
Training sample data generator G X→Y Obtaining the generated audio G X→Y (x i ) Then, the audio G will be generated X→Y (x i ) Respectively input generator G Y→X And a discriminator D Y
The generating audio G X→Y (x i ) Warp generator G Y→X Obtaining the generated audio G Y→X (G X→Y (x i ));
Target data y of current training sample data i Respectively input to a discriminator D Y And generator G Y→X Target data y i Warp generator G Y→X Obtaining the generated audio G Y→X (y i );
Generating the audio G Y→X (y i ) Respectively input into the generator G X→Y And a discriminator D X Generating an audio G (y) via a generator G X→Y Obtaining the generated audio G X→Y (G Y→X (y i ));
Wherein, a discriminator D X For decision-making of audio G Y→X (y i ) And training sample data x i Whether there is a difference therebetween, a discriminator D Y For decision G X→Y (x i ) With the target data y i Whether there is a difference between them;
during training, the loss function adopted is L full
L full =L adv (G X→Y ,D Y )+L adv (G Y→X ,D X )+λ cyc L cyc (G X→Y ,G Y→X )+λ id L id (G X→Y ,G Y→X )
Wherein λ is cyc And λ id Respectively representing a loss function L cyc (G X→Y ,G Y→X ) And L id (G X→Y ,G Y→X ) The weight of (c);
loss function
Figure BDA0003044542530000031
Loss function
Figure BDA0003044542530000032
Loss function
Figure BDA0003044542530000033
Loss function
Figure BDA0003044542530000034
Wherein, E2]Representing a mathematical expectation, P Data () Denotes the distribution of objects in parentheses, D Y (y i ) Represents the scoring of the real target sample by the discriminator, D Y (G X→Y (x i ) Denotes the score of the arbiter on the generation target sample, D X (x i ) Represents the scoring of the real original sample by the discriminator, D X (G Y→X (y i ) Represents the scoring of the arbiter to generate the original sample, | | | survival | |) 1 Represents the L1 norm;
when the preset convergence condition of the first network parameter training is met, the generator G is used X→Y As a virtual bass-generating network;
and step 3: framing the original audio signal to be converted, and then performing fast Fourier transform on a single frame to enable the obtained single frame data to be matched with the input of the virtual bass generating network obtained by training in the step 2;
inputting the data of each frame into the virtual bass generating network to obtain the network output signal of the current frame;
and carrying out high-pass filtering processing on the network output signals of each frame to obtain virtual bass data of each frame, and splicing the single-frame virtual bass data subjected to the fast inverse Fourier transform according to the time sequence of the single-frame data to obtain a virtual bass signal corresponding to the original audio signal to be converted.
Further, step 2 of the present invention further comprises:
step 201 further includes dividing the original audio signal set into two parts, wherein the data volume of one part is larger than that of the other part, and marking the part with larger number as a first original audio signal subset, and the part with smaller data volume as a second original audio signal subset;
in step 202, only performing a first virtual bass process on each original audio signal in the first subset of original audio signals to obtain a first virtual bass signal; and when the preset convergence condition of the first network parameter training is met, executing step 203;
the step 203 comprises:
setting a second training data set:
performing second virtual bass processing on a second original audio signal subset according to a preset second virtual bass processing mode (based on mixed virtual bass parameter implementation, such as manual adjustment on the existing mixed virtual bass parameters) to obtain a second virtual bass signal of the current frame; adding the original low-frequency signal of the current frame and the second virtual bass signal to obtain a second reconstructed virtual bass signal of the current frame;
taking the original audio signal of the current frame as training sample data, and taking the second reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a second training data set;
setting a second training data set to perform second network parameter training (namely, transfer learning) on the initial virtual bass generation network trained in the step 202, wherein during training, the data processing process is the same as that in the step 202, and training sample data is changed;
i.e. the loss function used during training is L full When the preset convergence condition of the second network parameter training is met, the generator G is used X→Y As virtual bassA network is generated.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the time domain waveform of the virtual bass generated by the invention is nearly identical to the time domain waveform of the virtual bass generated by the traditional method on the bass contour. The invention can greatly shorten the time required in the signal processing process, overcomes the defect of insufficient real-time performance of the virtual bass technology and further expands the application range of the virtual bass technology. In addition, the traditional virtual bass technology needs to perform complicated parameter setting and adjustment, but the invention only needs to be based on the trained virtual bass generation network, and does not need to perform complicated parameter setting and adjustment every time the virtual bass is generated, namely, an original audio signal is input into the trained virtual bass generation network, and a corresponding virtual bass signal can be obtained based on the output of the original audio signal.
Drawings
FIG. 1 is a conventional virtual bass process flow;
FIG. 2 is a schematic diagram of a processing procedure for processing audio data by using a CycleGan network according to an embodiment;
FIG. 3 is a schematic diagram of a forward-reverse network cycle consistency loss in an embodiment;
FIG. 4 is a flowchart illustrating a virtual bass production method according to the present invention;
FIG. 5 is a time domain waveform of an original signal in an embodiment;
FIG. 6 is a time domain waveform of a virtual bass signal generated by a countering network in an embodiment;
fig. 7 is a time domain waveform for generating a virtual bass signal processed by a conventional method in accordance with an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the embodiments and the accompanying drawings.
In the present invention, the generation network is used to learn a mapping relationship from input x to output y. For each input audiox i Belongs to X, the invention can find a corresponding sample y i E.g. Y. Wherein i represents the serial number of a group of parallel data, X represents the feature space in which the input data is located, and Y represents the feature space in which the output data is located.
Referring to fig. 2, in the present embodiment, during training, a processing process of processing audio data by using a CycleGan network includes:
inputting original audio X into Generator G X-Y Obtaining a generated audio G (X), and inputting the generated audio G (X) into a Generator G Y-X Thereby obtaining the generated audio Cyclic X; the conventional method generated audio y is input to the Generator Generator G Y-X Obtaining a generated audio G (y), and inputting the generated audio G (y) into the generator G X-Y Thereby obtaining the generated audio cycle y. Simultaneously inputting the original audio X and the generated audio G (y) into a discriminator D X The device is used for distinguishing and outputting the difference between the two; and inputting the generated audio G (X) and the audio y into the discriminator D at the same time Y And the judgment module is used for judging the difference between the two and outputting the difference.
The loss function adopted mainly comprises two parts of the antagonistic loss (adaptive loss) and the cyclic-consistency loss (Cycle-consistency loss).
Conversion data G as a measure against loss x→Y (x i ) With the target data y i An important parameter for the distinction between them can be expressed as:
Figure BDA0003044542530000051
wherein G is X→Y Representing a mapping function from X to Y, i.e. a generator from X to Y, G X→Y () Representation generator G X→Y () I.e. the generated audio, D, of the generator output Y Denotes a discriminator about Y, discriminator D Y Is represented as D Y (),E[]Representing a mathematical expectation, P Data () Indicating the distribution of objects in parentheses. I.e. G X→Y (x i ) Representation generator G X→Y Will attempt to generate a new segment of the audio signalG(x i ) To make it sound like its corresponding audio y pre-processed by virtual bass technique i At the same time, the discriminator D Y Will attempt to distinguish the generated audio signal G (x) i ) And the actual signal y i The relationship can be expressed by a mathematical expression as:
Figure BDA0003044542530000061
in the same way, a cyclic consistency loss L can be obtained adv (G Y→X ,D X ) Expression (c):
Figure BDA0003044542530000062
wherein G is Y→X A generator representing Y to X, the output of which is represented as G Y→X ();D X Denotes a discriminator about X, the output of which is denoted D X ()。
Due to the limitation of resistance loss, it is not guaranteed that the neural network can input x by a single input i Mapping to the desired output y i The above. Therefore, to further narrow down the space of possible mappings, cycleGan networks guarantee a mapping function (mapping function) G by introducing a cyclic consistency penalty X→Y And G Y→X Should be cycle-consistent, as shown in FIG. 3.
The present invention expresses this relationship as follows:
Figure BDA0003044542530000063
wherein | | | purple hair 1 Representing the L1 norm.
In addition, in order to ensure the integrity of language information, L is introduced into the existing CycleGan VC id To represent the mapping loss (Identity-mapping loss):
Figure BDA0003044542530000064
in summary, the invention combines equations (1) - (4) to set the total loss function L of the network full Comprises the following steps:
L full =L adv (G X→Y ,D Y )+L adv (G Y→X ,D X )+λ cyc L cyc (G X→Y ,G Y→X )+λ id L id (G X→Y ,G Y→X ) (5)
wherein λ is cyc And λ id Two weight parameters for regulating the cycle consistency loss L cyc And a mapping penalty L id Total loss to network L full Of the relative importance of.
When the network converges (the training times reach the preset maximum training times or the total loss of the network reaches the specified value), the audio data to be processed can be input into the Generator G X-Y And then converted virtual bass data is obtained based on the output thereof.
Referring to fig. 4, the present invention is based on the above setting process for the network, and the specific process for generating the virtual bass based on the set generation countermeasure network is as follows:
preprocessing training set data:
in order to improve the quality of the generated audio and to shorten the time required for training as much as possible.
The invention first randomly extracts 100 samples from the original data (original audio signal X [ n ]) as test data, which will not participate in the training.
Then, in the remaining data, the following data are obtained according to 7:1 into two data sets, one large and one small, and then generating virtual bass (in the manner shown in fig. 1) from the large data set, i.e. the most part of data, using default parameter settings to obtain an output x di [n]. Using the adjusted parameters to generate virtual bass from the rest of small data to obtain output x ai [n]。
Wherein, the adjusted parameters to generate the virtual bass may be: the nonlinear equation adopted by the harmonic generator is modified from an exponential to an arctangent square root; and raising the Phase Vocoder (PV) highest harmonic component (start harm) to ensure that satisfactory high frequency harmonics can be generated.
In other words, in the embodiment of the present invention, in order to improve the accuracy of the network, two sets of training data sets are obtained by respectively using a mixed virtual bass method based on a nonlinear element and a parameter-adjusted method, where the two sets of training data sets correspond to first and second network parameter training for generating the network, where the first network parameter training is conventional training, and the second network parameter training is transfer learning training. In the two modes of acquiring the virtual bass, the accuracy of the mixed virtual bass is higher than that of the mode based on the nonlinear element system, the virtual bass acquired in the mode with higher accuracy is used as the generated audio y during the transfer learning, and the virtual bass acquired in the mode with lower accuracy is used as the generated audio y during the first network parameter training.
Then, for x di [n]And x ai [n]X can be obtained by Fast Fourier Transform (FFT) di [k]And X ai [k]. Cut-off frequency (Cutoff frequency) f in the present embodiment cutoff =120Hz, the Sampling rate (Sampling rate) for all the tones is adjusted to 8000Hz, and the single tone frame length is 32ms. It should be noted that, in this example, the frequency component of the audio sample lower than the cut-off frequency of 120Hz is the low-frequency component X l
Then, for a frame X [ n ] of single audio samples]Performing fast Fourier transform to obtain frequency domain signal X [ k ]]And the frequency domain signal X [ k ]]Corresponding low frequency component X in li [k]Extracted (frequency band less than cut-off frequency) and combined with X ai [k]And X di [k]After addition, the target generation signals of the two final data sets are obtained, which is to supplement X ai [k]And X di [k]The missing low frequency part facilitates convergence of the network. Since the values of the low-frequency part at the positions above the cut-off frequency are all 0, the purpose of addition (para-position addition) is to supplement the low frequency of the audio missing after the virtual bass processing, thereby ensuring that the network is not easy to generate numerical value during training. That is, in the present embodiment, X [ k ] is]As a first raw input signal X for the input Generator Generator G X-Y To obtain a generated audio G (X), and inputting the first original signal into a discriminator D X While the low-frequency component X is converted into li [k]And X di [k]The added result is used as a corresponding target virtual bass signal, namely, the generated audio y obtained by the traditional method, and the first network parameter training of the generating network is realized. And X [ k ]]As a second raw input signal X for the input Generator Generator G Y-X To obtain the generated audio G (y), and X li [k]And x ai [n]The addition result is used as a corresponding real audio, namely a generated audio y obtained by the parameter-adjusted mixed virtual bass method, so that the second network parameter training of the generated network is realized.
Training: first, a large data set is used as a training set, and 3 rounds of training are performed, each round including 300 epochs, in this example, the entire audio signal is not directly used, but a data segment (256 points) with a fixed frame length is extracted from paired data.
In addition, in the present embodiment, λ is set cyc And λ id Set to 10 and 5, respectively id Present only in the front 10 4 In the second iteration, it is used to guide the network training, but it is set to zero to further reduce the amount of computation. During training, the optimizer is selected as Adam Optimize and the batch size is set to 1, the learning rate of the generator is 0.0005, and the learning rate decay is equal to 2.5 × 10 -9 (ii) a The learning rate of the discriminator is 0.0001, and the learning rate attenuation is 5 × 10 -10 . At the first 2X 10 of each round 5 In the secondary iteration, the learning rate is kept constant at the initial value, and then the learning rate is linearly attenuated in the iteration until the learning rate is 0, so that a good convergence effect is obtained. On the basis of the obtained model, migration learning is carried out by using a small data set (generated by using a traditional manual parameter adjusting method), a final network model is further obtained, and a trained generator G is used X-Y As a final virtual bass producing network.
And finally, inputting the test data into the trained virtual bass generating network to obtain a corresponding network output signal, carrying out high-pass filtering (filtering out a low-frequency part) on the network output signal, and then carrying out restoration from a frequency domain to a time domain to obtain a virtual bass signal of the test data. Since virtual bass sounds like low frequencies, although it is not, it is essential that the network output signal is high-pass filtered.
In order to verify the generation performance of the present invention, pre-separated verification data is used, which is subjected to virtual bass processing by a conventional method (mixing virtual bass) and the generation network of the present invention, respectively, and after low-frequency components in the obtained virtual bass data are removed, an audio signal for verifying the virtual bass conversion performance is obtained. The time domain waveforms of the generated audio data are shown in fig. 7 and 6, respectively, and compared with the original signal shown in fig. 5 and the conventional manner, it can be found that the time domain waveform of the virtual bass generated by the present invention is nearly identical in bass profile to the time domain waveform of the virtual bass generated by the conventional method. Although the sound quality is to be improved, about 40 seconds are consumed for processing a 10-second audio signal by using a traditional method, and only about 3 seconds are required for processing an audio signal with the same length, so that the time required in the signal processing process can be greatly shortened, the defect of insufficient real-time performance of a virtual bass technology is overcome, and the application range of the virtual bass technology is further expanded. In addition, the traditional virtual bass technology needs to carry out fussy parameter setting and adjustment, and has certain requirements on professional knowledge capability of users. In conclusion, the audio signal processing method provided by the invention has considerable practical utilization value.
Where mentioned above are merely embodiments of the invention, any feature disclosed in this specification may, unless stated otherwise, be replaced by alternative features serving equivalent or similar purposes; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims (7)

1. The virtual bass conversion method based on the generation network is characterized by comprising the following steps:
step 1: setting a network structure of an initial virtual bass generation network based on a cycle generation network:
the virtual bass generating network comprises a generator G X→Y Sum generator G Y→X And a discriminator D X And a discriminator D Y (ii) a Wherein, the generator G X→Y Respectively with the generator G Y→X And a discriminator D Y Connected generator G Y→X Respectively and discriminator D X And a discriminator D Y Connecting, wherein X represents a feature space where input data are located, and Y represents a feature space where output data are located;
step 2: carrying out deep learning training on the initial virtual bass generation network:
step 201: setting a first training data set:
collecting an original audio signal set, wherein the original audio signal set comprises a plurality of frames of original audio signals;
performing fast Fourier transform on an original audio signal of a current frame to obtain a frequency domain signal, and performing low-pass filtering on the frequency domain signal based on a preset cut-off frequency to obtain an original low-frequency signal;
dividing an original audio signal set into two parts, wherein the data volume of one part is larger than that of the other part, marking the part with larger quantity as a first original audio signal subset, and marking the part with smaller data volume as a second original audio signal subset;
according to a preset first virtual bass processing mode, carrying out first virtual bass processing on original audio signals of a current frame in a first original audio signal subset to obtain first virtual bass signals;
adding the original low-frequency signal of the current frame and the first virtual bass signal to obtain a first reconstructed virtual bass signal of the current frame;
taking an original audio signal of a current frame as training sample data, and taking a first reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a first training data set;
wherein, the first virtual bass processing mode is as follows: generating a virtual bass with default parameter settings based on the non-linear element;
step 202: performing first network parameter training on the initial virtual bass generation network based on a first training data set:
sampling the current training sample data x i Respectively input generator G X→Y And a discriminator D X
Training sample data generator G X→Y Obtaining the generated audio G X→Y (x i ) Then, the audio G will be generated X→Y (x i ) Respectively input generator G Y→X And a discriminator D Y
The generation of the audio G X→Y (x i ) Warp generator G Y→X Obtaining the generated audio G Y→X (G X→Y (x i ));
Target data y of current training sample data i Respectively input to a discriminator D Y And generator G Y→X Target data y i Warp generator G Y→X Obtaining the generated audio G Y→X (y i );
Generating the audio G Y→X (y i ) Respectively input into the generator G X→Y And a discriminator D X Generating an audio G Y→X (y i ) Warp generator G X→Y Obtaining the generated audio G X→Y (G Y→X (y i ));
During training, the loss function adopted is L full
L full =L adv (G X→Y ,D Y )+L adv (G Y→X ,D X )+λ cyc L cyc (G X→Y ,G Y→X )+λ id L id (G X→Y ,G Y→X )
Wherein λ is cyc And λ id Respectively representing a loss function L cyc (G X→Y ,G Y→X ) And L id (G X→Y ,G Y→X ) The weight of (c);
loss function
Figure FDA0003742586520000021
Loss function
Figure FDA0003742586520000022
Loss function
Figure FDA0003742586520000023
Loss function
Figure FDA0003742586520000024
Wherein, E2]Representing a mathematical expectation, P Data () Denotes the distribution of objects in parentheses, D Y (y i ) Representation discriminator D Y Scoring of real target samples, D Y (G X→Y (x i ) ) represents a discriminator D Y Scoring to generate target samples, D X (x i ) Representation discriminator D X Scoring of real original samples, D X (G Y→X (y i ) ) represents a discriminator D X Scoring to generate original sample, | | | | calness 1 Represents the L1 norm;
when the preset convergence condition of the first network parameter training is met, the generator G is used X→Y As a virtual bass production network;
and step 3: framing the original audio signal to be converted, and then performing fast Fourier transform on a single frame to enable the obtained single frame data to be matched with the input of the virtual bass generating network obtained by training in the step 2;
inputting the data of each frame into the virtual bass generating network to obtain the network output signal of the current frame;
and carrying out high-pass filtering processing on the network output signals of each frame to obtain virtual bass data of each frame, and splicing the single-frame virtual bass data subjected to the fast inverse Fourier transform according to the time sequence of the single-frame data to obtain a virtual bass signal corresponding to the original audio signal to be converted.
2. The method of claim 1, wherein step 2 further comprises:
in step 202, when a preset convergence condition of the first network parameter training is satisfied, executing step 203;
the step 203 comprises:
setting a second training data set:
performing second virtual bass processing on the single-frame original audio signals in the second original audio signal subset according to a preset second virtual bass processing mode to obtain second virtual bass signals of the current frame; adding the original low-frequency signal of the current frame and the second virtual bass signal to obtain a second reconstructed virtual bass signal of the current frame;
taking the original audio signal of the current frame as training sample data, and taking the second reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a second training data set;
performing second network parameter training on the initial virtual bass generation network trained in the step 202 based on a second training data set, wherein a loss function adopted in the training is L full When the preset convergence condition of the second network parameter training is met, the generator G is used X→Y As a virtual bass production network;
wherein, the second virtual bass processing mode is as follows: and generating the virtual bass by adopting the adjusted parameters based on the nonlinear element.
3. The method of claim 2, wherein the weight λ is used when the number of training times reaches a value specified by the number of training times when the first or second network parameter training is performed id The value of (d) is set to 0.
4. A method according to claim 3, wherein the specified number of training sessions is of the order of 10 4
5. The method of claim 3 or 4, wherein the weight λ is weighted when the number of training times does not reach a designated number of training times cyc And λ id Are set to 10 and 5, respectively.
6. A method according to claim 1 or 2, characterized in that in step 201, the length of a single frame is 32ms and the cut-off frequency of the low-pass filtering is 120Hz.
7. The method of claim 1, wherein the data volume ratio of the first subset of original audio signals to the second subset of original audio signals is 7.
CN202110468881.XA 2021-04-28 2021-04-28 Virtual bass conversion method based on generation network Expired - Fee Related CN113205794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110468881.XA CN113205794B (en) 2021-04-28 2021-04-28 Virtual bass conversion method based on generation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110468881.XA CN113205794B (en) 2021-04-28 2021-04-28 Virtual bass conversion method based on generation network

Publications (2)

Publication Number Publication Date
CN113205794A CN113205794A (en) 2021-08-03
CN113205794B true CN113205794B (en) 2022-10-14

Family

ID=77029771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110468881.XA Expired - Fee Related CN113205794B (en) 2021-04-28 2021-04-28 Virtual bass conversion method based on generation network

Country Status (1)

Country Link
CN (1) CN113205794B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101964190A (en) * 2009-07-24 2011-02-02 敦泰科技(深圳)有限公司 Method and device for restoring signal under speaker cut-off frequency to original sound
CN102354500A (en) * 2011-08-03 2012-02-15 华南理工大学 Virtual bass boosting method based on harmonic control
CN105632509A (en) * 2014-11-07 2016-06-01 Tcl集团股份有限公司 Audio processing method and audio processing device
CN106653049A (en) * 2015-10-30 2017-05-10 国光电器股份有限公司 Addition of virtual bass in time domain
CN108877832A (en) * 2018-05-29 2018-11-23 东华大学 A kind of audio sound quality also original system based on GAN
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8971551B2 (en) * 2009-09-18 2015-03-03 Dolby International Ab Virtual bass synthesis using harmonic transposition
CN110832881B (en) * 2017-07-23 2021-05-28 波音频有限公司 Stereo virtual bass enhancement
DE102018121309A1 (en) * 2018-08-31 2020-03-05 Sennheiser Electronic Gmbh & Co. Kg Method and device for audio signal processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101964190A (en) * 2009-07-24 2011-02-02 敦泰科技(深圳)有限公司 Method and device for restoring signal under speaker cut-off frequency to original sound
CN102354500A (en) * 2011-08-03 2012-02-15 华南理工大学 Virtual bass boosting method based on harmonic control
CN105632509A (en) * 2014-11-07 2016-06-01 Tcl集团股份有限公司 Audio processing method and audio processing device
CN106653049A (en) * 2015-10-30 2017-05-10 国光电器股份有限公司 Addition of virtual bass in time domain
CN108877832A (en) * 2018-05-29 2018-11-23 东华大学 A kind of audio sound quality also original system based on GAN
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
"A hybrid virtual bass system for optimized steady-state and transient performance";Hill A J;《Computer Science & Electronic Engineering Conference》;20101231;全文 *
"A hybrid virtual bass system with improved phase vocoder and high efficiency";Zhang S;《International Symposium on Chinese Spoken Language Processing》;20141231;全文 *
"Analytical and Perceptual Evaluation of Nonlinear Devices for Virtual Bass System";Oo N;《audio engineering society convention》;20101231;全文 *
"Synthesis and Implementation of Virtual Bass System with a Phase-Vocoder Approach";Bai M;《Journal of the Audio Engineering Society》;20061231;全文 *
"The Effect of MaxxBass Psychoacoustic Bass Enhancement on Loudspeaker Design";Ben-Tzur D;《 Preprint of Aes Convention Munic. audio Eng.soc》;19991231;全文 *
"Virtual bass system based on a multiband harmonic generation";Lee T;《IEEE International Conference on Consumer Electronics》;20131231;全文 *
"基于谐波控制的虚拟低音算法";吴东海;《中国优秀硕士学位论文全文数据库信息科技辑》;20130115;全文 *
"虚拟低音的研究与实现";郑荣辉;《中国优秀硕士学位论文全文数据库信息科技辑》;20160915;全文 *
"虚拟低音算法的设计与实现";王红梅;《电声技术》;20141231;全文 *

Also Published As

Publication number Publication date
CN113205794A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN107358966B (en) No-reference speech quality objective assessment method based on deep learning speech enhancement
CN105741849B (en) The sound enhancement method of phase estimation and human hearing characteristic is merged in digital deaf-aid
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN104485114B (en) A kind of method of the voice quality objective evaluation based on auditory perception property
Kaya et al. A temporal saliency map for modeling auditory attention
CN108490349A (en) Motor abnormal sound detection method based on Mel frequency cepstral coefficients
CN107767859A (en) The speaker's property understood detection method of artificial cochlea's signal under noise circumstance
CN108615533A (en) A kind of high-performance sound enhancement method based on deep learning
CN112185410B (en) Audio processing method and device
CN109920439A (en) The variable-speed motor that subtracts based on tone energy and human ear frequency selectivity is uttered long and high-pitched sounds evaluation method
CN109473091A (en) A kind of speech samples generation method and device
CN103413557A (en) Voice signal bandwidth expansion method and device thereof
EP0997003A2 (en) A method of noise reduction in speech signals and an apparatus for performing the method
CN113205794B (en) Virtual bass conversion method based on generation network
Shifas et al. A non-causal FFTNet architecture for speech enhancement
US6453253B1 (en) Impulse response measuring method
CN112837670B (en) Speech synthesis method and device and electronic equipment
CN113066466A (en) Audio injection regulation sound design method based on band-limited noise
CN111816208B (en) Voice separation quality assessment method, device and computer storage medium
CN103971697B (en) Sound enhancement method based on non-local mean filtering
Sabin et al. A method for rapid personalization of audio equalization parameters
Lei et al. A low-latency hybrid multi-channel speech enhancement system for hearing aids
Marolt Adaptive oscillator networks for partial tracking and piano music transcription
Stahl et al. SIDIQ: Computational Quality Assessment of Enhanced Speech Based on Auditory Figure-Ground Segregation, Similarity, and Disturbance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221014

CF01 Termination of patent right due to non-payment of annual fee