CN113205794B

CN113205794B - Virtual bass conversion method based on generation network

Info

Publication number: CN113205794B
Application number: CN202110468881.XA
Authority: CN
Inventors: 史创; 郭嘉祺; 杨浩聪; 陶盛奇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2022-10-14
Anticipated expiration: 2041-04-28
Also published as: CN113205794A

Abstract

The invention discloses a virtual bass conversion method based on a generating network, and belongs to the technical field of audio processing. Setting an initial virtual bass generation network, two generators and two discriminators on the basis of a cycle generation network, training the initial virtual bass generation network on the basis of set training data, and taking a first generator of the initial virtual bass generation network as a virtual bass generation network when a convergence condition is met; and inputting original audio data to be converted into the audio data, and obtaining a conversion result based on the output of the virtual bass generation network. The time domain waveform of the virtual bass generated by the invention is nearly identical to the time domain waveform of the virtual bass generated by the traditional method on the bass contour. In addition, the invention only needs to generate the network based on the trained virtual bass, and does not need to carry out complicated parameter setting and adjustment during each generation.

Description

Virtual bass conversion method based on generation network

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a virtual bass conversion method.

Background

Due to the limitation of the manufacturing process, the general speaker generally has strict working bandwidth limitation, especially the low frequency part of the sound signal, so that the general speaker cannot restore the whole frequency band of the original signal without loss. In today's diversified speaker systems, low cost speakers still dominate, and there is a need for a general solution or a suitable alternative to frequency components outside the bandwidth limitation. In such a background, a technique called Virtual Bass (Virtual Bass) has been developed.

Virtual bass, also known as "Missing fundamental frequency components", was originally proposed by j.c.r.licklir in the 1951 paper "a duplex the lay of pitch performance". This psychoacoustic-based study shows that the human auditory system can perceive a bass Fundamental frequency from high-frequency harmonics of the Fundamental frequency component (Fundamental component) of the sound signal. For example, if a person listens to a sequence of harmonics with frequencies of 200hz,300hz,400hz respectively (Harmonic series), the brain can effectively perceive the 100Hz common difference frequency between them, i.e. the desired fundamental frequency component.

Virtual bass technology was first implemented by Nonlinear device systems (Nonlinear device systems), of which The most widely used set of NLD-based systems is The maxxbas proposed by b.t. daniel in 1999 "The effect of The MaxxBass psychological base enhancement system on loud speaker design". Referring to fig. 1, an input signal is first passed through a Low Pass Filter (LPF) to obtain a desired low frequency component, and then this low pass signal (low frequency component) is processed by a non-linear element (NLD) to generate a harmonic component. These successive harmonic components are passed through a special band pass filter (BFP) to obtain the appropriate frequency band and add a gain of size G. After that, these processed signals will be superimposed with the original signals after delay processing, and finally outputted.

After that, M.R.BAI proposed in 2006 the paper Synthesis and Implementation of Virtual base systems with a Phase-Vocoder apparatus to generate harmonics instead of non-linear elements. The phase vocoder samples the input signal by a small time window, and performs correlation processing after Fast Fourier Transform (FFT) conversion, thereby effectively ensuring the consistency of signal phases. Compared with a nonlinear element system, the phase vocoder almost completely acts on the frequency domain of the signal, and therefore intermodulation distortion of the system output can be effectively avoided. However, the phase vocoder has a disadvantage in that it is difficult to balance the resolution relationship between the time domain and the frequency domain: f. of _res ＝1/t _w Wherein f is _res Is the frequency domain resolution (Hz), t _w Is the selected window length. Since the virtual bass conversion has a high requirement for the frequency domain resolution, the choice of the length of the time window is in fact a difficult matter to choose.

Finally, a scholars a.j.hill from the university of eiscoss, england, proposed a Hybrid virtual bass (Hybrid virtual bass) approach that mixes the two previous signal processing approaches. The method combines the advantages of high sensitivity of a nonlinear element system to time domain signal change and good processing effect of a phase encoder on a non-Transient signal, and designs a Transient content detector (Transient content detector). In other words, the so-called Hybrid system (Hybrid system) essentially assigns weights to the outputs of two systems in the same time window. Although the hybrid virtual bass system effectively combines the advantages of the non-linear element system and the phase vocoder, it has the disadvantages of time inefficiency, and the need to set a large number of parameters. This directly results in the inability of the technology to be widely used.

Disclosure of Invention

The invention aims to: in view of the above problems, a method for implementing virtual bass conversion by generating a network is provided, which simplifies the setting process of virtual bass and reduces the processing time of virtual bass.

The virtual bass conversion method based on the generation network comprises the following steps:

step 1: setting a network structure of an initial virtual bass generation network based on a cycle generation network:

the virtual bass generating network comprises a generator G _X→Y And generator G _Y→X And a discriminator D _X And a discriminator D _Y (ii) a Wherein, the generator G _X→Y Respectively with the generator G _Y→X And a discriminator D _Y Connected, generator G _Y→X Respectively with a discriminator D _X And a discriminator D _Y Connecting, wherein X represents a characteristic space where input data are located, and Y represents a characteristic space where output data are located;

step 2: deep learning training is carried out on the initial virtual bass generation network:

step 201: setting a first training data set:

collecting an original audio signal set, wherein the original audio signal set comprises a plurality of frames of original audio signals;

performing fast Fourier transform on an original audio signal of a current frame to obtain a frequency domain signal, and performing low-pass filtering on the frequency domain signal based on a preset cut-off frequency to obtain an original low-frequency signal;

according to a preset first virtual bass processing mode (based on hardware implementation, namely a traditional virtual bass processing mode), carrying out first virtual bass processing on an original audio signal of a current frame to obtain a first virtual bass signal;

adding the original low-frequency signal of the current frame and the first virtual bass signal to obtain a first reconstructed virtual bass signal of the current frame;

taking an original audio signal of a current frame as training sample data, and taking a first reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a first training data set;

step 202: performing first network parameter training on the initial virtual bass generation network based on a first training data set:

sampling the current training sample data x _i Respectively input generator G _X→Y And a discriminator D _X ；

Training sample data generator G _X→Y Obtaining the generated audio G _X→Y (x _i ) Then, the audio G will be generated _X→Y (x _i ) Respectively input generator G _Y→X And a discriminator D _Y ；

The generating audio G _X→Y (x _i ) Warp generator G _Y→X Obtaining the generated audio G _Y→X (G _X→Y (x _i ))；

Target data y of current training sample data _i Respectively input to a discriminator D _Y And generator G _Y→X Target data y _i Warp generator G _Y→X Obtaining the generated audio G _Y→X (y _i )；

Generating the audio G _Y→X (y _i ) Respectively input into the generator G _X→Y And a discriminator D _X Generating an audio G (y) via a generator G _X→Y Obtaining the generated audio G _X→Y (G _Y→X (y _i ))；

Wherein, a discriminator D _X For decision-making of audio G _Y→X (y _i ) And training sample data x _i Whether there is a difference therebetween, a discriminator D _Y For decision G _X→Y (x _i ) With the target data y _i Whether there is a difference between them;

during training, the loss function adopted is L _full ：

L _full ＝L _adv (G _X→Y ,D _Y )+L _adv (G _Y→X ,D _X )+λ _cyc L _cyc (G _X→Y ,G _Y→X )+λ _id L _id (G _X→Y ,G _Y→X )

Wherein λ is _cyc And λ _id Respectively representing a loss function L _cyc (G _X→Y ,G _Y→X ) And L _id (G _X→Y ,G _Y→X ) The weight of (c);

loss function

Loss function

Loss function

Loss function

Wherein, E2]Representing a mathematical expectation, P _Data () Denotes the distribution of objects in parentheses, D _Y (y _i ) Represents the scoring of the real target sample by the discriminator, D _Y (G _X→Y (x _i ) Denotes the score of the arbiter on the generation target sample, D _X (x _i ) Represents the scoring of the real original sample by the discriminator, D _X (G _Y→X (y _i ) Represents the scoring of the arbiter to generate the original sample, | | | survival | |) ₁ Represents the L1 norm;

when the preset convergence condition of the first network parameter training is met, the generator G is used _X→Y As a virtual bass-generating network;

and step 3: framing the original audio signal to be converted, and then performing fast Fourier transform on a single frame to enable the obtained single frame data to be matched with the input of the virtual bass generating network obtained by training in the step 2;

inputting the data of each frame into the virtual bass generating network to obtain the network output signal of the current frame;

and carrying out high-pass filtering processing on the network output signals of each frame to obtain virtual bass data of each frame, and splicing the single-frame virtual bass data subjected to the fast inverse Fourier transform according to the time sequence of the single-frame data to obtain a virtual bass signal corresponding to the original audio signal to be converted.

Further, step 2 of the present invention further comprises:

step 201 further includes dividing the original audio signal set into two parts, wherein the data volume of one part is larger than that of the other part, and marking the part with larger number as a first original audio signal subset, and the part with smaller data volume as a second original audio signal subset;

in step 202, only performing a first virtual bass process on each original audio signal in the first subset of original audio signals to obtain a first virtual bass signal; and when the preset convergence condition of the first network parameter training is met, executing step 203;

the step 203 comprises:

setting a second training data set:

performing second virtual bass processing on a second original audio signal subset according to a preset second virtual bass processing mode (based on mixed virtual bass parameter implementation, such as manual adjustment on the existing mixed virtual bass parameters) to obtain a second virtual bass signal of the current frame; adding the original low-frequency signal of the current frame and the second virtual bass signal to obtain a second reconstructed virtual bass signal of the current frame;

taking the original audio signal of the current frame as training sample data, and taking the second reconstructed virtual bass signal of the current frame as target data of the training sample to obtain a second training data set;

setting a second training data set to perform second network parameter training (namely, transfer learning) on the initial virtual bass generation network trained in the step 202, wherein during training, the data processing process is the same as that in the step 202, and training sample data is changed;

i.e. the loss function used during training is L _full When the preset convergence condition of the second network parameter training is met, the generator G is used _X→Y As virtual bassA network is generated.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the time domain waveform of the virtual bass generated by the invention is nearly identical to the time domain waveform of the virtual bass generated by the traditional method on the bass contour. The invention can greatly shorten the time required in the signal processing process, overcomes the defect of insufficient real-time performance of the virtual bass technology and further expands the application range of the virtual bass technology. In addition, the traditional virtual bass technology needs to perform complicated parameter setting and adjustment, but the invention only needs to be based on the trained virtual bass generation network, and does not need to perform complicated parameter setting and adjustment every time the virtual bass is generated, namely, an original audio signal is input into the trained virtual bass generation network, and a corresponding virtual bass signal can be obtained based on the output of the original audio signal.

Drawings

FIG. 1 is a conventional virtual bass process flow;

FIG. 2 is a schematic diagram of a processing procedure for processing audio data by using a CycleGan network according to an embodiment;

FIG. 3 is a schematic diagram of a forward-reverse network cycle consistency loss in an embodiment;

FIG. 4 is a flowchart illustrating a virtual bass production method according to the present invention;

FIG. 5 is a time domain waveform of an original signal in an embodiment;

FIG. 6 is a time domain waveform of a virtual bass signal generated by a countering network in an embodiment;

fig. 7 is a time domain waveform for generating a virtual bass signal processed by a conventional method in accordance with an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the embodiments and the accompanying drawings.

In the present invention, the generation network is used to learn a mapping relationship from input x to output y. For each input audiox _i Belongs to X, the invention can find a corresponding sample y _i E.g. Y. Wherein i represents the serial number of a group of parallel data, X represents the feature space in which the input data is located, and Y represents the feature space in which the output data is located.

Referring to fig. 2, in the present embodiment, during training, a processing process of processing audio data by using a CycleGan network includes:

inputting original audio X into Generator G _X-Y Obtaining a generated audio G (X), and inputting the generated audio G (X) into a Generator G _Y-X Thereby obtaining the generated audio Cyclic X; the conventional method generated audio y is input to the Generator Generator G _Y-X Obtaining a generated audio G (y), and inputting the generated audio G (y) into the generator G _X-Y Thereby obtaining the generated audio cycle y. Simultaneously inputting the original audio X and the generated audio G (y) into a discriminator D _X The device is used for distinguishing and outputting the difference between the two; and inputting the generated audio G (X) and the audio y into the discriminator D at the same time _Y And the judgment module is used for judging the difference between the two and outputting the difference.

The loss function adopted mainly comprises two parts of the antagonistic loss (adaptive loss) and the cyclic-consistency loss (Cycle-consistency loss).

Conversion data G as a measure against loss _x→Y (x _i ) With the target data y _i An important parameter for the distinction between them can be expressed as:

wherein G is _X→Y Representing a mapping function from X to Y, i.e. a generator from X to Y, G _X→Y () Representation generator G _X→Y () I.e. the generated audio, D, of the generator output _Y Denotes a discriminator about Y, discriminator D _Y Is represented as D _Y (),E[]Representing a mathematical expectation, P _Data () Indicating the distribution of objects in parentheses. I.e. G _X→Y (x _i ) Representation generator G _X→Y Will attempt to generate a new segment of the audio signalG(x _i ) To make it sound like its corresponding audio y pre-processed by virtual bass technique _i At the same time, the discriminator D _Y Will attempt to distinguish the generated audio signal G (x) _i ) And the actual signal y _i The relationship can be expressed by a mathematical expression as:

in the same way, a cyclic consistency loss L can be obtained _adv (G _Y→X ,D _X ) Expression (c):

wherein G is _Y→X A generator representing Y to X, the output of which is represented as G _Y→X ()；D _X Denotes a discriminator about X, the output of which is denoted D _X ()。

Due to the limitation of resistance loss, it is not guaranteed that the neural network can input x by a single input _i Mapping to the desired output y _i The above. Therefore, to further narrow down the space of possible mappings, cycleGan networks guarantee a mapping function (mapping function) G by introducing a cyclic consistency penalty _X→Y And G _Y→X Should be cycle-consistent, as shown in FIG. 3.

The present invention expresses this relationship as follows:

wherein | | | purple hair ₁ Representing the L1 norm.

In addition, in order to ensure the integrity of language information, L is introduced into the existing CycleGan VC _id To represent the mapping loss (Identity-mapping loss):

in summary, the invention combines equations (1) - (4) to set the total loss function L of the network _full Comprises the following steps:

L _full ＝L _adv (G _X→Y ,D _Y )+L _adv (G _Y→X ,D _X )+λ _cyc L _cyc (G _X→Y ,G _Y→X )+λ _id L _id (G _X→Y ,G _Y→X ) (5)

wherein λ is _cyc And λ _id Two weight parameters for regulating the cycle consistency loss L _cyc And a mapping penalty L _id Total loss to network L _full Of the relative importance of.

When the network converges (the training times reach the preset maximum training times or the total loss of the network reaches the specified value), the audio data to be processed can be input into the Generator G _X-Y And then converted virtual bass data is obtained based on the output thereof.

Referring to fig. 4, the present invention is based on the above setting process for the network, and the specific process for generating the virtual bass based on the set generation countermeasure network is as follows:

preprocessing training set data:

in order to improve the quality of the generated audio and to shorten the time required for training as much as possible.

The invention first randomly extracts 100 samples from the original data (original audio signal X [ n ]) as test data, which will not participate in the training.

Then, in the remaining data, the following data are obtained according to 7:1 into two data sets, one large and one small, and then generating virtual bass (in the manner shown in fig. 1) from the large data set, i.e. the most part of data, using default parameter settings to obtain an output x _di [n]. Using the adjusted parameters to generate virtual bass from the rest of small data to obtain output x _ai [n]。

Wherein, the adjusted parameters to generate the virtual bass may be: the nonlinear equation adopted by the harmonic generator is modified from an exponential to an arctangent square root; and raising the Phase Vocoder (PV) highest harmonic component (start harm) to ensure that satisfactory high frequency harmonics can be generated.

In other words, in the embodiment of the present invention, in order to improve the accuracy of the network, two sets of training data sets are obtained by respectively using a mixed virtual bass method based on a nonlinear element and a parameter-adjusted method, where the two sets of training data sets correspond to first and second network parameter training for generating the network, where the first network parameter training is conventional training, and the second network parameter training is transfer learning training. In the two modes of acquiring the virtual bass, the accuracy of the mixed virtual bass is higher than that of the mode based on the nonlinear element system, the virtual bass acquired in the mode with higher accuracy is used as the generated audio y during the transfer learning, and the virtual bass acquired in the mode with lower accuracy is used as the generated audio y during the first network parameter training.

Then, for x _di [n]And x _ai [n]X can be obtained by Fast Fourier Transform (FFT) _di [k]And X _ai [k]. Cut-off frequency (Cutoff frequency) f in the present embodiment _cutoff =120Hz, the Sampling rate (Sampling rate) for all the tones is adjusted to 8000Hz, and the single tone frame length is 32ms. It should be noted that, in this example, the frequency component of the audio sample lower than the cut-off frequency of 120Hz is the low-frequency component X _l 。

Then, for a frame X [ n ] of single audio samples]Performing fast Fourier transform to obtain frequency domain signal X [ k ]]And the frequency domain signal X [ k ]]Corresponding low frequency component X in _li [k]Extracted (frequency band less than cut-off frequency) and combined with X _ai [k]And X _di [k]After addition, the target generation signals of the two final data sets are obtained, which is to supplement X _ai [k]And X _di [k]The missing low frequency part facilitates convergence of the network. Since the values of the low-frequency part at the positions above the cut-off frequency are all 0, the purpose of addition (para-position addition) is to supplement the low frequency of the audio missing after the virtual bass processing, thereby ensuring that the network is not easy to generate numerical value during training. That is, in the present embodiment, X [ k ] is]As a first raw input signal X for the input Generator Generator G _X-Y To obtain a generated audio G (X), and inputting the first original signal into a discriminator D _X While the low-frequency component X is converted into _li [k]And X _di [k]The added result is used as a corresponding target virtual bass signal, namely, the generated audio y obtained by the traditional method, and the first network parameter training of the generating network is realized. And X [ k ]]As a second raw input signal X for the input Generator Generator G _Y-X To obtain the generated audio G (y), and X _li [k]And x _ai [n]The addition result is used as a corresponding real audio, namely a generated audio y obtained by the parameter-adjusted mixed virtual bass method, so that the second network parameter training of the generated network is realized.

Training: first, a large data set is used as a training set, and 3 rounds of training are performed, each round including 300 epochs, in this example, the entire audio signal is not directly used, but a data segment (256 points) with a fixed frame length is extracted from paired data.

In addition, in the present embodiment, λ is set _cyc And λ _id Set to 10 and 5, respectively _id Present only in the front 10 ⁴ In the second iteration, it is used to guide the network training, but it is set to zero to further reduce the amount of computation. During training, the optimizer is selected as Adam Optimize and the batch size is set to 1, the learning rate of the generator is 0.0005, and the learning rate decay is equal to 2.5 × 10 ^-9 (ii) a The learning rate of the discriminator is 0.0001, and the learning rate attenuation is 5 × 10 ^-10 . At the first 2X 10 of each round ⁵ In the secondary iteration, the learning rate is kept constant at the initial value, and then the learning rate is linearly attenuated in the iteration until the learning rate is 0, so that a good convergence effect is obtained. On the basis of the obtained model, migration learning is carried out by using a small data set (generated by using a traditional manual parameter adjusting method), a final network model is further obtained, and a trained generator G is used _X-Y As a final virtual bass producing network.

And finally, inputting the test data into the trained virtual bass generating network to obtain a corresponding network output signal, carrying out high-pass filtering (filtering out a low-frequency part) on the network output signal, and then carrying out restoration from a frequency domain to a time domain to obtain a virtual bass signal of the test data. Since virtual bass sounds like low frequencies, although it is not, it is essential that the network output signal is high-pass filtered.

In order to verify the generation performance of the present invention, pre-separated verification data is used, which is subjected to virtual bass processing by a conventional method (mixing virtual bass) and the generation network of the present invention, respectively, and after low-frequency components in the obtained virtual bass data are removed, an audio signal for verifying the virtual bass conversion performance is obtained. The time domain waveforms of the generated audio data are shown in fig. 7 and 6, respectively, and compared with the original signal shown in fig. 5 and the conventional manner, it can be found that the time domain waveform of the virtual bass generated by the present invention is nearly identical in bass profile to the time domain waveform of the virtual bass generated by the conventional method. Although the sound quality is to be improved, about 40 seconds are consumed for processing a 10-second audio signal by using a traditional method, and only about 3 seconds are required for processing an audio signal with the same length, so that the time required in the signal processing process can be greatly shortened, the defect of insufficient real-time performance of a virtual bass technology is overcome, and the application range of the virtual bass technology is further expanded. In addition, the traditional virtual bass technology needs to carry out fussy parameter setting and adjustment, and has certain requirements on professional knowledge capability of users. In conclusion, the audio signal processing method provided by the invention has considerable practical utilization value.

Where mentioned above are merely embodiments of the invention, any feature disclosed in this specification may, unless stated otherwise, be replaced by alternative features serving equivalent or similar purposes; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The virtual bass conversion method based on the generation network is characterized by comprising the following steps:

the virtual bass generating network comprises a generator G _X→Y Sum generator G _Y→X And a discriminator D _X And a discriminator D _Y (ii) a Wherein, the generator G _X→Y Respectively with the generator G _Y→X And a discriminator D _Y Connected generator G _Y→X Respectively and discriminator D _X And a discriminator D _Y Connecting, wherein X represents a feature space where input data are located, and Y represents a feature space where output data are located;

step 2: carrying out deep learning training on the initial virtual bass generation network:

step 201: setting a first training data set:

dividing an original audio signal set into two parts, wherein the data volume of one part is larger than that of the other part, marking the part with larger quantity as a first original audio signal subset, and marking the part with smaller data volume as a second original audio signal subset;

according to a preset first virtual bass processing mode, carrying out first virtual bass processing on original audio signals of a current frame in a first original audio signal subset to obtain first virtual bass signals;

wherein, the first virtual bass processing mode is as follows: generating a virtual bass with default parameter settings based on the non-linear element;

The generation of the audio G _X→Y (x _i ) Warp generator G _Y→X Obtaining the generated audio G _Y→X (G _X→Y (x _i ))；

Generating the audio G _Y→X (y _i ) Respectively input into the generator G _X→Y And a discriminator D _X Generating an audio G _Y→X (y _i ) Warp generator G _X→Y Obtaining the generated audio G _X→Y (G _Y→X (y _i ))；

During training, the loss function adopted is L _full ：

loss function

Loss function

Loss function

Loss function

Wherein, E2]Representing a mathematical expectation, P _Data () Denotes the distribution of objects in parentheses, D _Y (y _i ) Representation discriminator D _Y Scoring of real target samples, D _Y (G _X→Y (x _i ) ) represents a discriminator D _Y Scoring to generate target samples, D _X (x _i ) Representation discriminator D _X Scoring of real original samples, D _X (G _Y→X (y _i ) ) represents a discriminator D _X Scoring to generate original sample, | | | | calness ₁ Represents the L1 norm;

when the preset convergence condition of the first network parameter training is met, the generator G is used _X→Y As a virtual bass production network;

2. The method of claim 1, wherein step 2 further comprises:

in step 202, when a preset convergence condition of the first network parameter training is satisfied, executing step 203;

the step 203 comprises:

setting a second training data set:

performing second virtual bass processing on the single-frame original audio signals in the second original audio signal subset according to a preset second virtual bass processing mode to obtain second virtual bass signals of the current frame; adding the original low-frequency signal of the current frame and the second virtual bass signal to obtain a second reconstructed virtual bass signal of the current frame;

performing second network parameter training on the initial virtual bass generation network trained in the step 202 based on a second training data set, wherein a loss function adopted in the training is L _full When the preset convergence condition of the second network parameter training is met, the generator G is used _X→Y As a virtual bass production network;

wherein, the second virtual bass processing mode is as follows: and generating the virtual bass by adopting the adjusted parameters based on the nonlinear element.

3. The method of claim 2, wherein the weight λ is used when the number of training times reaches a value specified by the number of training times when the first or second network parameter training is performed _id The value of (d) is set to 0.

4. A method according to claim 3, wherein the specified number of training sessions is of the order of 10 ⁴ 。

5. The method of claim 3 or 4, wherein the weight λ is weighted when the number of training times does not reach a designated number of training times _cyc And λ _id Are set to 10 and 5, respectively.

6. A method according to claim 1 or 2, characterized in that in step 201, the length of a single frame is 32ms and the cut-off frequency of the low-pass filtering is 120Hz.

7. The method of claim 1, wherein the data volume ratio of the first subset of original audio signals to the second subset of original audio signals is 7.