CN114724574A

CN114724574A - Double-microphone noise reduction method with adjustable expected sound source direction

Info

Publication number: CN114724574A
Application number: CN202210157383.8A
Authority: CN
Inventors: 赵清颖; 陈喆; 殷福亮
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-07-08

Abstract

The invention discloses a double-microphone noise reduction method with adjustable expected sound source direction, which comprises the following steps: preprocessing the noisy signal x received by the dual microphones₁(t) and x₂(t) discrete sampling, pre-emphasis, framing and windowing, and performing short-time Fourier transform to obtain frequency domain signal X₁(omega) and X₂(ω); a beam forming process, introducing a virtual microphone at the middle point of the connection line of the two microphones, and performing frequency domain signal X according to a central difference format₁(omega) and X₂(omega) performing a differential transformation to construct a differential signal Y₁(omega) and Y₂(ω). Calculating a difference signal Y₁(omega) and Y₂(ω) and the ratio of the statistical averages is recorded as a directivity function Γ (ω, θ), the properties of the directivity function Γ (ω, θ) are analyzed,it is directly mapped to the noise masking value λ (ω) by the normalization function. Mixing X₁Multiplying (omega) by lambda (omega) to obtain a signal R with the competing direction noise eliminated₁(ω); post-wiener filtering process, for R₁Estimating signal energy and noise energy in (omega) to obtain channel signal-to-noise ratio and calculating gain function to further eliminate R₁Residual noise in (ω).

Description

Double-microphone noise reduction method with adjustable expected sound source direction

Technical Field

The invention relates to the technical field of voice signal noise reduction, in particular to a double-microphone noise reduction method with adjustable expected sound source direction.

Background

Portable devices such as bluetooth headsets have become good tools for improving efficiency in daily life, but when users make or receive calls with the portable devices, if the users are interfered by background noise, voice in a non-target direction, and the like, the call quality is rapidly reduced. In this case, it is desirable to keep the speech close to the speaking direction of the user and suppress the background noise and the speech in the non-target direction as much as possible while ensuring no distortion of the speech.

Existing generalized side lobe cancellers (GSCs) and delay beamformers use multiple microphone recorded signals for spatial filtering. For portable devices such as bluetooth headsets, GSCs are too complex to be able to handle the capabilities of micro devices. Delay beamforming techniques such as the first-order differential microphone (FDM) and adaptive null-forming (ANF) require only two microphones, which are suitable settings for size limitation and real-time processing. However, this fixed beamformer has a maximum gain at 0 ° and a null at 180 °, and cannot eliminate noise in directions other than the null. Algorithms based on coherence functions between input signals discuss the nature of the real and imaginary parts of the coherence function to produce different means of masking noise. The coherent function based approach does not rely on noise statistics, but the target direction is not adjustable. A competing directional noise cancellation method for hearing aids combines spectral estimation and array beamforming to suppress noise. The directivity coefficients are estimated in the pure noise interval and updated to adapt to the mobile noise. Similarly, this method can set the desired direction only to a limited range around 0 °. Since the position of the sound source is sometimes not constant, it is important to design a noise reduction algorithm with adjustable sound source direction in practical application.

In order to solve the problems that the sound in a non-target direction cannot be accurately eliminated when the beam forming is directly applied to a close-range double-microphone system, the direction of an expected sound source cannot be set according to requirements and the like, a two-step denoising method based on the beam forming and the wiener filtering is provided. The test result shows that: under the condition of low signal-to-noise ratio and coexistence of multiple types of noise sources, the method can effectively recover the energy distribution characteristics of the original signal, reduce background noise and non-target direction voice, and obviously improve the signal-to-noise ratio.

Disclosure of Invention

According to the problems in the prior art, the invention discloses a double-microphone noise reduction method with adjustable expected sound source direction, which specifically comprises the following steps:

the pretreatment process comprises the following steps: for noisy signal x received by dual microphones₁(t) and x₂(t) discrete sampling, pre-emphasis, framing and windowing, and performing short-time Fourier transform to obtain frequency domain signal X₁(omega) and X₂(ω)；

And (3) beam forming process: introducing a virtual microphone at the midpoint of the two-microphone line, and applying the frequency domain signal X according to the central difference format₁(omega) and X₂(omega) performing differential conversion to construct a differential signal Y₁(omega) and Y₂(ω) calculating a difference signal Y₁(omega) and Y₂(ω) a statistical average of the power spectrum of the frequency domain signal X, recording a ratio of the statistical average as a directivity function Γ (ω, θ), analyzing a property of the directivity function Γ (ω, θ), directly mapping it as a noise masking value λ (ω) by a normalization function, and converting the frequency domain signal X into a frequency domain signal X₁Multiplying (omega) by the noise masking value lambda (omega) to obtain a signal R with the competing direction noise eliminated₁(ω)；

Post wiener filtering process: to R₁Estimating signal energy and noise energy in (omega) to obtain channel signal-to-noise ratio and calculating gain function, thereby eliminating signal R₁Residual noise in (ω).

Further, the pretreatment process comprises the following steps:

will carry the noise signal x₁(t) and x₂(t) discrete sampling is carried out, and then pre-emphasis processing is carried out on the high-frequency part of the voice;

sampling signal x₁(n) and x₂(n) dividing the signals into frames with the length of 10ms, adding equal-length Hamming windows w (n), introducing the windowed signals into a buffer area for processing, obtaining the frequency domain signals of the current frame through short-time Fourier transform, and outputting the signals of the first 1/2 frequency points for beam forming processing according to the conjugate symmetry of real number sequence Fourier transform.

Further, the beamforming process includes amplitude alignment, power spectrum calculation, directivity function value calculation, threshold calculation, and normalized mapping;

the amplitude alignment mode is to the frequency domain signal X₁(omega) and X₂(ω) multiplying by a scaling factor respectively for amplitude alignment;

when calculating the power spectrum: assuming that the desired beam is S (omega) and the direction thereof is preset to be alpha, introducing a virtual microphone at the midpoint of the two-way microphone to receive the desired beam S (omega), and according to a central difference format and a frequency domain signal X₁(ω)、X₂The spatial relationship of (ω) to the desired beam S (ω) constructs a differential signal Y₁(omega) and Y₂(ω) calculating a difference signal Y₁(omega) and Y₂(ω) a power spectrum;

when calculating the directional function value: wherein the differential signal Y₁(omega) and Y₂(ω) the ratio of the statistical average of the power spectra is the value of the directivity function Γ (ω, θ), which tends to infinity when the actual sound source incidence direction θ is equal to the given desired sound source incidence direction α, and which functions monotonically and approximately symmetrically on both sides of the α -axis, discussing the nature of Γ (ω, θ);

when calculating the threshold and normalizing the mapping: as gamma (omega, theta) tends to be infinite, a threshold omega is calculated according to a preset main lobe width theta and passes through a sigmoid functionNormalized mapping, directly mapping gamma (omega, theta) to noise masking value lambda (omega) of corresponding frequency point, and mapping X₁Multiplying (omega) by lambda (omega) to obtain a signal R with the competing direction noise eliminated₁(ω)。

Further, the post-wiener filtering process comprises the steps of calculating a signal-to-noise ratio index, calculating a logarithmic spectrum deviation, modifying or resetting a noise flag and calculating a gain function value;

when calculating the signal-to-noise ratio index, the signal R is calculated₁(omega) dividing the channel into a plurality of channels according to a critical bandwidth criterion, estimating the energy of each channel, initializing the channel noise energy estimation into the channel energy of the first four frames, and calculating a channel signal-to-noise ratio index according to the channel noise energy estimation;

when calculating the logarithmic spectrum deviation, designing a nonlinear data table as a voice index table, mapping the signal-to-noise ratio index into a group of numbers for measuring the voice quality, taking the sum of the voice indexes in a certain frequency range as the voice quality evaluation result of the current channel, taking the logarithm of the signal energy of the current channel, and calculating the deviation of the long-time logarithmic spectrum energy and the short-time logarithmic spectrum energy;

modifying or resetting the noise mark, judging whether the current frame is a voice frame or a noise frame according to the calculated voice index sum, the signal-to-noise ratio index and the log spectrum deviation parameter information, resetting the noise updating mark, checking the updating marks of the previous frames, and if the noise cannot be updated for a long time and the result is unreliable, forcibly updating the signal-to-noise ratio index;

when the gain function value is calculated, the channel signal-to-noise ratio index is used for calculating the channel gain value to remove residual background noise, and the noise energy estimation of the next frame is updated according to the result of the noise updating mark.

Due to the adoption of the technical scheme, the method for reducing the noise of the double microphones with the adjustable expected sound source direction, provided by the invention, comprises the steps of firstly calculating the ratio of the constructed statistical average value of the power spectrum of the differential signal as the value of a directional function after preprocessing signals of the double microphones, and then obtaining the masking value of noise through the mapping of a normalization function. Meanwhile, a wiener filter is installed in the next step, and residual noise is reduced by estimating a signal-to-noise ratio index and calculating a gain function; the algorithm provided by the invention is simple and efficient, and the signal-to-noise ratio and the quality of the voice interfered by the non-target sound in different noise scenes are obviously improved after the voice is enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic overall view of the process of the present invention;

FIG. 2 is a schematic diagram of the pretreatment process of the present invention;

FIG. 3 is a schematic diagram of a beamforming process in the present invention;

FIG. 4 is a schematic view of the sound source propagation in the present invention;

FIG. 5 is a diagram illustrating directional functions in accordance with the present invention;

FIG. 6 is a diagram illustrating a post-wiener filtering process in accordance with the present invention;

FIG. 7 is a diagram showing the PESQ comparison results of the present invention with other noise reduction methods for a single noise source with different SNR;

FIG. 8 is a diagram showing the PESQ comparison result between the present invention and other noise reduction methods when multiple noise sources have different signal-to-noise ratios;

FIG. 9 is a graph of SegSNR comparison results for a single noise source with different SNR for the present invention and other noise reduction methods;

FIG. 10 is a graph of SegSNR comparison results for multiple noise sources with different SNR;

fig. 11 is a diagram illustrating the results of the present invention when the expected sound source directions are different.

Detailed Description

In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:

fig. 1 shows a method for reducing noise with two microphones, where the direction of a desired sound source is adjustable, and in implementation, the method includes: a preprocessing process, a beam forming process and a post-wiener filtering process. The method disclosed by the invention comprises the following specific steps:

s1: preprocessing the noisy signal x received by the two microphones, as shown in FIG. 2₁(t) and x₂(t) after discrete sampling, pre-emphasis, framing and windowing, obtaining a frequency domain signal X by short-time Fourier transform₁(omega) and X₂(ω), specifically in the following manner:

s11: continuous signal x with noise₁(t) and x₂(t) discrete sampling is carried out first, and the sampling frequency is 16 kHz. Pre-emphasis of the speech pitch part is achieved by a first order FIR high-pass digital filter, where EMP _ FAC is the pre-emphasis coefficient. Setting the sampling value of the voice signal at the time n as x (n), and the result after pre-emphasis processing is as follows:

z(n)＝x(n)-EMP_FAC*x(n-1) (1)

taking EMP _ FAC as 0.8, after the noise reduction process, synthesizing a time domain signal by using short-time fourier transform, and performing de-emphasis operation on the time domain signal to restore a high-frequency part.

S12: sampling signal x₁(n) and x₂(n) framing, the frame length is 10ms, hamming windows with equal length are added to the framed signals, the window function formula is as formula (2), the windowed signals are introduced into a buffer area, and the length of the buffer area is 5 times of the number of FFT points.

S13: and obtaining a current frame frequency domain signal through fast Fourier transform, and outputting signals of the first 1/2 frequency points for subsequent algorithm processing according to the conjugate symmetry of real number sequence fast Fourier transform.

S2: the beam forming process is as shown in fig. 3, a virtual microphone is introduced at the midpoint of the connection line of the two microphones, and the obtained frequency domain signal X is₁(omega) and X₂(omega) carrying out differential transformation according to the central differential format to construct a differential signal Y₁(omega) and Y₂(ω), calculating Y₁(omega) and Y₂(ω) and the ratio of the statistical averages is taken as the directivity function Γ (ω, θ). Analyzing the property of the gamma (omega, theta), and directly mapping the property of the gamma (omega, theta) into a noise masking value through a normalization function, wherein the following method is adopted:

s21: the amplitudes are aligned and although the sound field is assumed to be far-field, the received signals of the two microphones have slight differences in amplitude. In order to further conform to the hypothesis, two paths of frequency domain signals X are firstly processed₁(omega) and X₂And (ω) multiplying the respective scaling factors to perform amplitude alignment.

S22: assuming that the desired beam is S (ω) and its direction is preset to α, a virtual microphone is introduced at the midpoint of the two-way microphone to receive the signal, and the sound source propagation diagram is as shown in fig. 4. X₁(ω)、X₂The spatial relationship between (ω) and S (ω) is:

wherein d is the microphone spacing, v is the sound velocity, and theta is the actual sound source incidence direction; according to the center difference format and X₁(ω)、X₂The spatial relationship between (omega) and S (omega) to construct a differential signal Y₁(omega) and Y₂(ω), and calculate Y₁(omega) and Y₂(ω) power spectrum.

S23：Y₁(omega) and Y₂The ratio of the statistical average of the (ω) power spectra is the value of the directional function Γ (ω, θ).

Image of Γ (ω, θ) as in fig. 5, it was found that Γ (ω, θ) tends to infinity when the actual incident direction θ of the sound source is equal to the given desired incident direction α of the sound source; and the function value is monotonous and approximately symmetrical on two sides of the axis theta-alpha;

s24: since the infinity of Γ (ω, θ) tends to be not reached in actual calculation when θ is α, it is necessary to calculate a threshold value Ω based on a preset main lobe width Θ.

The sigmoid function is an S-shaped function with a value range of (0,1), and the gamma (omega, theta) can be directly mapped to the noise masking value lambda (omega) of the corresponding frequency point in a normalized mode through certain deformation.

S25: the sound after masking the noise of the competing direction is:

R₁(ω)＝λ(ω)X₁(ω) (8)

s3: post wiener filtering process as shown in FIG. 6, by pair R₁Estimating signal energy and noise energy in (omega) to obtain channel signal-to-noise ratio and calculate gain function, further eliminating R₁The residual noise in (ω) is specifically as follows:

s31: r is to be₁(ω) are divided into NUM _ CHAN channels according to a critical bandwidth criterion. Because the voice energy is mainly concentrated at 0.3-3.4 kHz and at low frequencyThe corresponding channel is narrower; at high frequencies, the corresponding channel is wider. And estimating the energy of each channel, wherein beta is a smoothing factor, M is the number of the frequency points in the current channel, M represents the label of the channel of the current frame, i is the label of the current frame, and k is the label of the frequency points in the current channel.

S32: initializing the channel noise estimate to the channel energy of the first four frames, the signal-to-noise ratio can be calculated by (10):

s33: a nonlinear data table is designed as a voice index table, and the signal-to-noise ratio index (quantized signal-to-noise ratio value) is mapped to a group of numbers for measuring voice quality. And when the signal-to-noise ratio is high, the voice quality is considered to be high, and the sum of the voice indexes in the frequency range of 0.3-3.4 kHz is calculated.

S34: the total noise energy estimate (tne) and the total energy estimate (tce) for the first HI _ CHAN channels are calculated, i.e., the sum of the noise energy and the sum of the channel energy over a frequency range of 0.3-3.4 kHz.

S35: and calculating a log spectrum of the current channel energy, and recording the deviation of the long-time log spectrum energy and the short-time log spectrum energy as ch _ enrg _ dev.

ch_enrg_db(i,m)＝10lg(ch_enrg(i,m)) (13)

S36: calculating a long-term integration constant alpha, which is a function of the total channel energy (tce), i.e., high tce (-40dB), slow smoothing (alpha 0.99); low tce (-60dB), fast smoothing (alpha ═ 0.50).

S37: and calculating and updating long-term log spectral energy.

S38: and resetting a noise updating flag Update _ flag through comparison according to the calculated parameters such as the voice index sum, the signal-to-noise ratio, the logarithmic spectrum deviation and the like. "Update _ flag" indicates that the current frame is a noise frame, and "Update _ flag" indicates that the current frame is a speech frame. And then, the noise updating marks of the previous frames need to be checked, if the noise cannot be updated for a long time, the current result is considered to be unreliable, and the signal-to-noise ratio index needs to be updated forcibly.

S39: and calculating a channel gain ftmp2 by using the obtained channel signal-to-noise ratio index.

If the noise Update flag Update _ flag is TRUE, the current frame is determined as a noise frame, and the energy estimation of the noise needs to be updated at this time.

To verify the effectiveness of the present invention, several tests were performed. It should be noted that, in order to verify that the method is applicable to various types of sounds, the voice data used for evaluation is derived from the TIMIT database, and the noise includes Babble noise and competitive directional voice. The experimental results in the invention are the results obtained by processing 10 sections of voice data averagely.

The present invention is compared with both the Coherence and SNR-Coherence methods, and α is first set to 0 °. Fig. 7 and 8 show PESQ scores of different methods after adding various noises (including competitive speech and Babble noise). It is clear that the present invention is superior to the Coherence method, and comparable to SNR-Coherence. In general, the PESQ results of the present invention are at least 0.5 higher than the unprocessed signal and the effect is maintained under multiple noise source conditions.

Fig. 9 and 10 show that the SegSNR value of the present invention is improved by at least 5dB over the unprocessed value at low signal-to-noise ratios (-5dB and 0dB) when the interference is non-target speech and Babble noise. The SegSNR result of the present invention is both higher than the SNR-Coherence method and almost equal to the Coherence method. Furthermore, the invention maintains optimal results in the presence of multiple noise sources.

Meanwhile, the evaluation results when the desired direction is set at other angles are shown in fig. 11. It can be seen that the invention still maintains a good noise suppression capability compared to the sound before processing.

The comparative tests show the good noise reduction performance and good working stability of the invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A method for reducing noise of a dual microphone with adjustable expected sound source direction is characterized by comprising the following steps:

And (3) beam forming process: introducing a virtual microphone at the midpoint of the two-microphone line, and applying the frequency domain signal X according to the central difference format₁(omega) and X₂(omega) performing differential conversion to construct a differential signal Y₁(omega) and Y₂(ω) calculating a difference signal Y₁(omega) and Y₂(ω) a statistical average of the power spectrum, and the ratio of the statistical average is recorded as a directivity function Γ (ω, θ), the properties of the directivity function Γ (ω, θ) are analyzed, it is directly mapped to a noise masking value λ (ω) by a normalization function, and the frequency domain signal X is converted into a frequency domain signal X₁Multiplying (omega) by the noise masking value lambda (omega) to obtain a signal R with the competing direction noise eliminated₁(ω)；

2. The method of claim 1, wherein: the pretreatment process comprises the following steps:

3. The method of claim 1, wherein: the beam forming process comprises amplitude alignment, power spectrum calculation, directivity function value calculation, threshold calculation and normalization mapping;

when calculating the power spectrum: assuming that the desired beam is S (co),the direction of the two-way microphone is preset to be alpha, a virtual microphone is introduced at the midpoint of the two-way microphone to receive a desired beam S (omega), and a central difference format and a frequency domain signal X are used₁(ω)、X₂The spatial relationship of (ω) to the desired beam S (ω) constructs a differential signal Y₁(omega) and Y₂(ω) calculating a difference signal Y₁(omega) and Y₂(ω) a power spectrum;

when calculating the threshold and normalizing the mapping: as gamma (omega, theta) tends to be infinite, a threshold omega is calculated according to a preset main lobe width theta, the gamma (omega, theta) is directly mapped into a noise masking value lambda (omega) of a corresponding frequency point through normalized mapping of a sigmoid function, and X is used for mapping X to the noise masking value lambda (omega) of the corresponding frequency point₁Multiplying (omega) by lambda (omega) to obtain a signal R with the competing direction noise eliminated₁(ω)

4. The method of claim 1, wherein: the post-wiener filtering process comprises the steps of calculating a signal-to-noise ratio index, calculating a logarithmic spectrum deviation, modifying or resetting a noise mark and calculating a gain function value;

when calculating the SNR index, it willSignal R₁(omega) dividing the channel into a plurality of channels according to a critical bandwidth criterion, estimating the energy of each channel, initializing the channel noise energy estimation into the channel energy of the first four frames, and calculating a channel signal-to-noise ratio index according to the channel noise energy estimation;