CN103440872B

CN103440872B - The denoising method of transient state noise

Info

Publication number: CN103440872B
Application number: CN201310357211.6A
Authority: CN
Inventors: 陈喆; 殷福亮; 周文颖
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2013-08-15
Filing date: 2013-08-15
Publication date: 2016-06-01
Anticipated expiration: 2033-08-15
Also published as: CN103440872A

Abstract

The present invention discloses the denoising method of transient state noise, belongs to signal processing technology field. First the present invention calculates the mel cepstrum coefficients of this frame signal, predicts the fundamental tone cycle of this frame signal simultaneously, then uses mel cepstrum coefficients to detect whether this frame signal exists noise, if there is noise, then uses fundamental tone period forecasting value to rebuild this frame signal.

Description

The denoising method of transient state noise

Technical field

The present invention relates to the denoising method of transient state noise, belong to signal processing technology field.

Background technology

Transient state adding property noise in sound signal, also referred to as transient noise, or impulse noise. Usually, transient state noise be in the time domain discontinuous, intermittently, pulsed, noise enerty mainly concentrates in shorter time interval, and in this interval, the energy of transient state noise is obviously more much larger than the energy of pure signal. Typical transient state noise as desk knock sound, sound of closing the door, brouhaha, keyboard hit key sound, mousebutton sound, hammer hit and beat sound etc., they often appear at a lot of application scenario, such as osophone, mobile phone, video signal meeting equipment etc. The existence of transient state noise seriously affects audio frequency quality, therefore, it is necessary to take measures to be suppressed by transient state noise, to strengthen the quality of audio frequency. Current noise suppression algorithm is for steady noise and continuous noise situation mostly, usually method described in document " research of speech enhan-cement and correlation technique thereof " is used to carry out speech enhan-cement, such as spectrum-subtraction, self-adaptive routing etc., but these algorithms are helpless to above-mentioned transient state noise, substantially do not have inhibition.

Summary of the invention

The present invention is directed to the proposition of above problem, and develop the denoising method of transient state noise.

The technical scheme that the present invention takes is: the mel cepstrum coefficients first calculating this frame signal, predict the fundamental tone cycle of this frame signal simultaneously, then use mel cepstrum coefficients to detect this frame signal whether to there is noise and namely carry out walkaway, if there is noise, then fundamental tone period forecasting value is used to carry out waveform reconstruction.

The useful effect of the present invention: using the noised audio of 20 first clean speech audio frequency (comprise adult man, grow up woman, children speech audio frequency) and 4 types to test, noise type is respectively: mouse sound, knock sound, metronome sound, keyboard sound. The time length of four kinds of noises is respectively: mouse sound is 10ms, knocks sound, metronome sound is 20ms, and keyboard sound is 30ms. Every first pure audio frequency is added this 4 kinds of noises respectively, obtains 80 first containing noise frequency. The number that every first audio frequency adds noise is 30, and the distance between noise is equal. The sampling rate of all audio frequency is f_s=48kHz, frame length is N=480. MFCC calculation stages, is NFFT=1024 point FFT, and the number of filter of Mel bank of filters is M=24, asks for L=12 and ties up MFCC; In the transient state walkaway stage, adaptive threshold is set to Thres=const ener, and for making thresholding be applicable to all noises, constant const is set to the energy that 10, ener is each frame input signal, and minimum value is set to 60.0; When thresholding upgrades, forgetting factor b is set to 0.4; In the fundamental tone phase estimate stage, searching for the fundamental tone cycle in (2ms, 12ms), correspondence is counted as (76,576); The waveform reconstruction stage, points N of being fade-in fade-out₁, N₂Being 32, buffer zone buf (n) length is 2240. After using the present invention that noisy speech is carried out denoising, increase substantially the intelligibility of voice, decrease the tired sense of hearer. Use segmentation signal to noise ratio snr_SegWith PEAQ two kinds of indexs, present method denoising effect is carried out assessment result to see shown in Figure 12 and the Figure 13 in accompanying drawing explanation.

Accompanying drawing explanation

The relation of Fig. 1 mel-frequency and linear frequency.

The technical scheme flow process of Fig. 2 prior art one.

The technical scheme flow process of Fig. 3 prior art two.

Fig. 4 the technical program block diagram.

Fig. 5 MFCC feature extracts block diagram.

Fig. 6 mel-frequency bank of filters.

Fig. 7 fundamental tone phase estimate block diagram.

The linear interpolation of Fig. 8 point-to-point transmission.

Signal when Fig. 9 (a) present frame is not repaired.

The new pitch cycle waveform pw of Fig. 9 (b)^(p)(n)��

Signal after Fig. 9 (c) present frame reparation.

Signal when Figure 10 (a) present frame is not repaired.

Figure 10 (b) current frame signal.

Figure 10 (c) repairs rear signal.

Figure 11 (a) denoising front signal.

Signal after Figure 11 (b) denoising.

Figure 12 denoising effect evaluation form (SNR).

Figure 13 denoising effect assessment (PEAQ).

Embodiment

Below in conjunction with accompanying drawing, the present invention will be further described:

Mel cepstrum coefficients:

The research of the sense of hearing mechanism of people being found, the sound wave of different frequency is had different hearing sensitivities by people's ear, and the sharpness of voice is affected maximum by the speech signal between 200Hz to 5kHz. In addition, people's ear has masking effect, and more weak speech signal is had by the speech signal that namely energy is big certain covers effect. Usually, the audio frequency of more low-frequency audio masking higher-frequency rate is easy, otherwise then more difficult, and that is, the critical bandwidth higher-frequency end of the sound mask at low frequency place is little. Accordingly, people according to the size of critical bandwidth by close to rare arrangement one group of bandpass filter, input signal is carried out filtering. If the energy each bandpass filter outputed signal is as the essential characteristic of signal, then after this feature being processed further, so that it may as the feature of voice, this is exactly mel cepstrum coefficients (MFCC). This kind of feature does not rely on the character of signal, namely input signal is not done any hypothesis and restriction, make use of again the auditory perception property of people's ear simultaneously, therefore, compared with the linear prediction residue error (LPCC) based on channel model, it has better robustness, and when signal to noise ratio is lower, still has good speech recognition performance.

MFCC is the cepstrum parameter extracted in Mel scale frequency territory. Mel scale describes the non-linear character of people's ear frequency, and the relation of it and frequency can approximate representation be

f_{mel} = 2595 \log_{10} (1 + \frac{f_{linear}}{700}) - - - (18)

In formula, f is frequency, and unit is Hz. The relation being mel-frequency and linear frequency shown in Fig. 1, along with f_linearLinear increase, f_melThe form of logarithm increases.

Letter packet loss concealment:

In the voice communication system based on IP agreement, such as based on IP net voice (VoIP) in, due to network is congested or transmitting procedure postpone shake, letter packet loss can be caused, namely some letter bag can not appear at receiving end on time, seriously affects the voice quality of receiving end. Therefore, must take some measures at receiving end, to reduce the voice distortion caused because of letter packet loss. Usually, the measure of this kind of process packet loss problem is called letter packet loss concealment algorithm (PLC) algorithm.

PLC algorithm is mainly divided into the Processing Algorithm based on sending end and Processing Algorithm two class based on receiving end. Jointly participate in by sending and receiving two ends based on sending end PLC algorithm; Based on receiving end PLC algorithm, then the letter bag that only normally receives according to receiving end, the coded system lost letter packet number and know in advance, recover original voice as far as possible. Owing to not needing the relevant data of sending end based on the PLC technology of receiving end, flow and the time delay of network therefore can not be increased. The conventional PLC method based on receiving end has quiet alternative method, a front letter bag repetition methods, template matching method, pitch waveform clone method and linear prediction method etc.

Pitch cycle waveform in this paper copies (PWR) method, belongs to the PLC method based on receiving end.

Prior art one related to the present invention

The technical scheme of prior art one

What will bravely waits in paper " based on the speech enhan-cement of Kalman filtering under impulse noise environment ", it is proposed that the sound enhancement method under a kind of transient noise conditions. The schema of the method as shown in Figure 2, first finds out the frequency range that transient state noise sample energy is maximum with the ratio of signals and associated noises sample energy, then utilizes the energy distribution situation of this frequency range, differentiates that whether speech signal is by transient state noise jamming frame by frame; On this basis, the method carries out denoising for the voice frame of transient state noise jamming, application card Kalman Filtering algorithm; In addition, autoregression (AR) model parameter estimation process has been improved by the method.

The shortcoming of prior art one

(1) for the noise that hangover is longer, trailing portion likely can detect out.

(2) when denoising, Kalman filtering used is applicable to steady noise is carried out denoising, is not suitable for non-stable transient state noise, and therefore denoising effect is limited, and noise residual is more, have impact on voice quality.

Prior art two related to the present invention

The technical scheme of prior art two

Hetherington etc. are in patent of invention " Repetitivetransientnoiseremoval ", it is proposed to a kind of transient state noise suppressing method. The schema of Hetherington method is as shown in Figure 3. The method first carries out modeling according to noise behavior, then utilizes the relation conefficient of modeling signal and signal to be detected to determine that whether data to be tested are containing noise, if there is noise, then remove the noise contribution in signal to be detected according to modeling signal.

The shortcoming of prior art two

The noise repeated can be carried out denoising by Hetherington method effectively, but owing to transient state noise type is varied, when there is the transient state noise of number of different types in the short period of time, modeling being caused inaccurate, now the denoising effect of Hetherington method is poor.

Elaborating of technical solution of the present invention

Technical problem to be solved by this invention

The audio frequency of transient state noise jamming is carried out speech enhan-cement, transient suppression noise, improve voice quality, it is to increase audio frequency intelligibility.

Complete skill scheme provided by the invention:

Fig. 4 is shown in by technical solution of the present invention block diagram: utilize input audio signal, extracts MFCC parameter; Then whether detect in sound signal containing noise by MFCC parameter; If detected result is for containing noise, then using PWR method to replace containing making an uproar frame data, carry out waveform reconstruction; If detected result is not containing noise, sound signal is former state output then.

Technical solution of the present invention performing step:

The sampling rate of input monophonic audio signal is f_s=48kHz. Containing the sound signal x (n) that makes an uproar, input can represent that wherein s (n) represents clean speech signal for x (n)=s (n)+d (n), d (n) represents transient state noise signal.

(1) the MFCC feature of sound signal is extracted

As shown in Figure 5, gray-scale map can better understand the technique effect of the present invention to the leaching process of MFCC, the special technique effect providing gray-scale map that the present invention is described. Technique effect spy in order to allow auditor more clearly understand the present invention provides gray-scale map Fig. 5 so that the technique effect of the present invention to be described. For reference. First time-domain audio signal is carried out time-frequency conversion, calculate its energy spectrum; Then being multiplied with the triangle filter group of Mel scale by this energy spectrum, then the logarithm energy of multiplied result is done discrete cosine transform (DCT), the front L dimensional vector obtained like this is called MFCC, calculates the concrete steps of MFCC:

1) input signal framing, frame length is set to 10ms, owing to sample frequency is f_s=48kHz, so the data length of a frame is N=480 point; Then data are normalized: if signal quantization figure place is 16bit, then by data divided by 2¹⁵, the scope of data is narrowed down to (-1,1), namely completes the normalization method of data. If current frame signal is p frame signal, then have

x^(p)(n)=x[p��(N-1)+n],n=0,1,��,N-1(19)

2) pre-treatment. Current frame signal is carried out pre-emphasis and windowing process, namely

y^(p)(n)=x^(p)(n)-��x^(p)(n-1)(20)

y_{w}^{(p)} (n) = y^{(p)} (n) w (n) - - - (21)

Wherein pre-emphasis factor-beta=0.938; W (n) is Hamming window, i.e. w (n)=0.54-0.46cos (n ��/N).

3) pretreated signal is N=1024 point FFT, obtains frequency domain signal Y^(p)(k)��

4) frequency domain signal Y is calculated^(p)The energy spectrum of (k) | Y^(p)(k)|²��

5) by the energy spectrum of frequency domain signal by the triangle filter group H of one group of Mel scale, frequency domain filtering is carried out.

In bank of filters, having M wave filter, each wave filter is triangular filter, overlapped between wave filter, as shown in Figure 6: the mid-frequency of each wave filter is f (m), m=1,2 ..., M, the present invention gets M=24. Filter design method: by input signal end frequency f_s/ 2, i.e. 24kHz, transform to Mel scale frequency territory by formula (1), obtain F_smel; By interval (0, F_smel) it is divided into 25 parts, remove 0 and F_smelTwo end points, 24 points of remaining cutpoints are respectively as the mid-frequency of 24 wave filters. Each point of cutpoint f (m) is evenly distributed in Mel scale frequency, then transforms to linear frequency scale by formula (1). After conversion, the interval between f (m) reduces along with the reduction of m value, broadening along with the increase of m value.

According to frequency division point f (m), the frequency response that can obtain triangular filter group H (m, k) is

H (m, k) = \{\begin{matrix} 0, & f (k) < f (m + 1) \\ \frac{2 [f (k) - f (m - 1)]}{[f (m + 1) - f (m - 1)] [f (m) - f (m - 1)]}, & f (m - 1) \leq f (k) < f (m) \\ \frac{2 [f (m + 1) - f (k)]}{[f (m + 1) - f (m - 1)] [f (m + 1) - f (m)]}, & f (m) \leq f (k) \leq f (m + 1) \\ 0, & f (k) > f (m + 1) \end{matrix} - - - (22)

6) calculate energy and logarithm that each filters H (m, k) exports, obtain E (m), namely

E (m) = \log_{10} [\underset{k}{Σ} H (m, k) {| Y^{(p)} (k) |}^{2}], m = 1,2, \cdot \cdot \cdot, M - - - (23)

E (m) is done discrete cosine transform, L=12 rank MFCC can be obtained, be designated as C (l)

C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{M - 1} E (m), l = 0

(24)

C^{(p)} (l) = \frac{2}{\sqrt{L}} Σ_{m = 0}^{M - 1} E (m) \cos (\frac{πl (2 m + 1)}{2 M}), 1 \leq l \leq L - 1

(2) walkaway:

Calculate the Euclidean distance dist between MFCC and the MFCC of former frame signal of current frame signal

dist = \sqrt{Σ_{l = 0}^{L} {[C^{(p)} (l) - C^{(p - 1)} (l)]}^{2}}, - - - (25)

Judge that whether present frame is containing noise according to distance value and threshold T hres. Threshold T hres is determined by following formula self-adaptation

Thres=10 ener, (26)

Wherein ener is the energy after each frame signal normalization method, and its minimum value is set to 60.0.

After having detected, upgrade the MFCC feature of present frame, namely

C^(p)(l)=b��C^(p-1)(l)+(1-b)��C^(p-1)(l),(27)

Wherein forgetting factor b=0.4. When next frame of noise frame is voice frame, this update method can prevent inspection by mistake.

(3) fundamental tone period forecasting:

Each frame speech signal is estimated the fundamental tone cycle. If present frame is noise frame, then fundamental tone cycle according to front cross frame signal predicts the present frame fundamental tone cycle. Fundamental tone phase estimate block diagram is as shown in Figure 7: for different speaker, and the fundamental tone cycle, generally in 2-12ms, therefore, searches for the fundamental tone cycle herein in 2-12ms. If PMAX is the data amount check corresponding to 12ms, i.e. PMAX=576; PMIN is the data amount check corresponding to 2ms, i.e. PMIN=96. Using buffer zone buf (n) that length is 3PMAX+N=2208 to estimate the fundamental tone cycle, wherein buffer zone buf (n) is used for storing the data exported.

Fundamental tone phase estimate method is as follows:

1) buf (n) is carried out low-pass filtering, obtain buf_d(n). Wherein the limiting frequency of low-pass filter (LPF) is 900Hz.

2) to buf_dN () carries out center slicing, obtain buf_c(n), namely

{buf}_{c} (n) = \{\begin{matrix} {buf}_{d} (n) - C_{L}, & {buf}_{d} (n) {> C}_{L} \\ {buf}_{d} (n) + C_{L}, & {buf}_{d} (n) < - C_{L} \\ 0, & | {buf}_{d} (n) | {\leq C}_{L} \end{matrix}, - - - (28)

Wherein C_LFor clipping lever, usually it is set to the 68% of normalization data maximum value.

3) to buf_cN () carries out auto-correlation computation, search for autocorrelative maximum value position in (96,576) scope, it can be used as fundamental tone phase estimate value Pitch.

r_{{buf}_{c}} (n) = Σ_{m = 0}^{2 PMAX - 1} {buf}_{c} (m) {buf}_{c} (m + n), PMIN \leq n \leq PMAX - - - (29)

Pitch = \arg \max_{PMIN \leq n \leq PMAX} r_{{buf}_{c}} (n) - - - (30)

4) for preventing frequency multiplication from occurring, by formula (13) to front cross frame fundamental tone period forecasting value Pitch^(p-1)And Pitch^(p-2)Carry out smoothing processing, namely

Present frame fundamental tone cycle Pitch is predicted according to two fundamental tone cycles after level and smooth^(p), namely

Pitch^(p)=Pitch^(p-1)+(Pitch^(p-1)-Pitch^(p-2))��(32)

(4) waveform reconstruction:

Extract last pitch cycle waveform of former frame, it is carried out linear interpolation, obtain new pitch cycle waveform.

1) owing to buf (n) storing output frame data, so the pitch cycle waveform of former frame can be extracted from buf (n), i.e. the last Pitch of former frame output signal^(p-1)Individual, its Wave data is designated as pw^(p-1)(n). To pw^(p-1) (n) carry out linear interpolation, obtaining length is Pitch^(p)New waveform, be designated as pw^(p)(n). As shown in Figure 8, interpolation formula is the linear interpolation of point-to-point transmission

{pw}^{(p)} (n^{'}) = (\frac{{pitch}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} - n + 1) \cdot [{pw}^{(p - 1)} (n) - {pw}^{(p - 1)} (n - 1)] + {pw}^{(p - 1)}, n - 1 \leq \frac{{pitct}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} < n - - - (33)

2) new waveform is used to carry out wave period duplication:

D. the principle that wave period copies is if Fig. 9 (a) is to Fig. 9 (c): if present frame is noise frame (no matter former frame is noise frame or clean speech frame), treating processes is: according to formula (15), AB section data in buf (n) are carried out overlapping addition with CD section data, and carry out process of being fade-in fade-out, to ensure the data of D both sides, there is continuity, namely

buf_CD(n)=�� buf_CD(n)+(1-��)��buf_AB(n)(34)

=�� buf_CD(n)+(1-��)��buf_CD(n-Pitch) 0��n < N₁

α = \frac{N_{1} - i}{N_{1}}, i = 0,1, \cdot \cdot \cdot, N_{1} - 1, - - - (35)

Wherein, �� is decay factor, linearly decays to 0 from 1; AB section and CD section data length N₁=32��

E. according to cycle Pitch^(p), with new waveform pw^(p)N () constantly copies in DF region. Wherein, DE section is the present frame after repairing; EF section data length is N₂=32, its role is to, when next frame is voice frame, it is fade-in fade-out for data, to ensure E two ends and the continuity between frame and frame.

F. the frame data started with C point are exported in buf (n). This method exports to exist and postpones, time of lag and CD segment length. Move forward N point (a frame length degree) again by all for buf (n) data.

As shown in Figure 10 (a) to Figure 10 (c), Figure 10 (a) for present frame be multi-frame to be repaired time, present frame is abandoned signal diagram; The current frame signal that Figure 10 (b) rebuilds for using this patent method; Figure 10 (c) is for repairing rear signal. If present frame is clean speech frame, and former frame is noise frame, treating processes is as follows,

D. now DG section data are the EF section data of previous frame in buf (n). By the front N of DG section and present frame input₂Individual data point carries out data fusion (calculating with formula (15) similar), is stored in DG.

E. by after present frame remainder strong point slavish copying to the G point in buf (n).

F. the data of the frame length degree started with C point are exported, then the data length of a frame signal that all for buf (n) data are moved forward, i.e. N point.

If present frame and former frame are all clean speech frame, then present frame is inputted area to be repaired in data slavish copying to buf (n), i.e. DE region in Fig. 8; Export the data of the frame length degree started with C point.

The useful effect that technical solution of the present invention is brought:

Using the noised audio of 20 first clean speech audio frequency (comprise adult man, grow up woman, children speech audio frequency) and 4 types to test, noise type is respectively: mouse sound, knock sound, metronome sound, keyboard sound. The time length of four kinds of noises is respectively: mouse sound is 10ms, knocks sound, metronome sound is 20ms, and keyboard sound is 30ms. Every first pure audio frequency is added this 4 kinds of noises respectively, obtains 80 first containing noise frequency. The number that every first audio frequency adds noise is 30, and the distance between noise is equal.

The sampling rate of all audio frequency is f_s=48kHz, frame length is N=480. MFCC calculation stages, is NFFT=1024 point FFT, and the number of filter of Mel bank of filters is M=24, asks for L=12 and ties up MFCC; In the transient state walkaway stage, adaptive threshold is set to Thres=const ener, and for making thresholding be applicable to all noises, constant const is set to the energy that 10, ener is each frame input signal, and minimum value is set to 60.0; When thresholding upgrades, forgetting factor b is set to 0.4; In the fundamental tone phase estimate stage, searching for the fundamental tone cycle in (2ms, 12ms), correspondence is counted as (76,576); The waveform reconstruction stage, points N of being fade-in fade-out₁, N₂Being 32, buffer zone buf (n) length is 2240.

After using the present invention that noisy speech is carried out denoising, increase substantially the intelligibility of voice, decrease the tired sense of hearer. Use segmentation signal to noise ratio snr_SegBeing assessed by present method denoising effect with PEAQ two kinds of indexs, wherein segmentation signal-noise ratio computation method is

{SNR}_{seg}^{in} = \frac{1}{R} Σ_{i = 1}^{R} 10 \log_{10} \frac{\underset{n &Element; {frame}_{i}}{Σ} {| s (n) |}^{2}}{\underset{n &Element; {frame}_{i}}{Σ} {| x (n) - s (n) |}^{2}}, - - - (36)

{SNR}_{seg}^{out} = \frac{1}{R} Σ_{i = 1}^{R} 10 \log_{10} \frac{\underset{n &Element; {frame}_{i}}{Σ} {| s (n) |}^{2}}{\underset{n &Element; {frame}_{i}}{Σ} {| \hat{s} (n) - s (n) |}^{2}}, - - - (37)

By two kinds of indexs, present method denoising effect is assessed, result as shown in Figure 12 and Figure 13, Figure 12 be use signal to noise ratio to before signals and associated noises denoising with denoising after objective audio frequency quality compare; Figure 13 be use PEAQ to before signals and associated noises denoising with denoising after objective audio frequency quality compare.

Signals and associated noises with the language spectrogram of signal after the denoising of this scheme as Figure 11 (a) and Figure 11 (b) and shown in; Gray-scale map can better understand the technique effect of the present invention, the special technique effect providing gray-scale map that the present invention is described. Technique effect spy in order to allow auditor more clearly understand the present invention provides gray-scale map and Figure 11 (a) and Figure 11 (b) that the technique effect of the present invention is described. For reference. Figure 11 (a) is the language spectrogram of the audio frequency by the pollution of mouse click sound; Figure 11 (b) for frequently carrying out the audio frequency language spectrogram after denoising to the noise of band shown in Figure 11 (a).

The above; it is only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; any it is familiar with those skilled in the art in the technical scope that the present invention discloses; technical scheme and invention design thereof according to the present invention are equal to replacement or are changed, and all should be encompassed within protection scope of the present invention.

The shortenings that the present invention relates to and Key Term definition

AR:AutoregressiveModel, autoregressive model.

DCT:DiscreteCosineTransform, discrete cosine transform.

FFT:FastFourierTransform, Fast Fourier Transform (FFT).

LPF:LowPassFilter, low-pass filter.

LPCC:LinearPredictionCepstrumCoefficient, linear prediction residue error.

MFCC:MelFrequencyCepstrumCoefficient, mel cepstrum coefficients.

VoIP:VoiceoverIP, based on the voice of IP net.

PLC:PacketLossConcealment, letter packet loss concealment algorithm.

PWR:PitchWaveformReplication, pitch cycle waveform copies.

SNR:Signal_to_NoiseRatio, signal to noise ratio.

A kind of objective evaluation standard for audio frequency quality perception of PEAQ:PerceptualEvaluationofAudioQuality, ITU-RBS.1387 suggestion.

Claims

1. the denoising method of transient state noise, it is characterized in that: the mel cepstrum coefficients first calculating this frame signal, predict the fundamental tone cycle of this frame signal simultaneously, then use mel cepstrum coefficients to detect this frame signal whether to there is noise and namely carry out walkaway, if there is noise, then fundamental tone period forecasting value is used to carry out waveform reconstruction;

The method of fundamental tone period forecasting is as follows:

1) buf (n) is carried out low-pass filtering, obtain buf_d(n); Wherein the limiting frequency of low-pass filter (LPF) is 900Hz;

2) to buf_dN () carries out center slicing, obtain buf_c(n), namely

{buf}_{c} (n) = \{\begin{matrix} {buf}_{d} (n) - C_{L}, & {buf}_{d} (n) > C_{L} \\ {buf}_{d} (n) + C_{L}, & {buf}_{d} (n) < - C_{L} \\ 0, & | {buf}_{d} (n) | \leq C_{L} \end{matrix} - - - (1)

Wherein C_LFor clipping lever, usually it is set to the 68% of normalization data maximum value;

3) to buf_cN () carries out auto-correlation computation, search for autocorrelative maximum value position in (96,576) scope, it can be used as fundamental tone phase estimate value Pitch;

r_{{buf}_{c}} (n) = Σ_{m = 0}^{2 P M A X - 1} {buf}_{c} (m) {buf}_{c} (m + n), P M I N \leq n \leq P M A X - - - (2)

P i t c h = \arg \underset{P M I N \leq n \leq P M A X}{m a x} r_{{buf}_{c}} (n) - - - (3)

4) for preventing frequency multiplication from occurring, by formula (4) to front cross frame fundamental tone period forecasting value Pitch^(p-1)And Pitch^(p-2)Carry out smoothing processing, namely

Pitch^(p)=Pitch^(p-1)+(Pitch^(p-1)-Pitch^(p-2))(5)

The method of waveform reconstruction is:

1) owing to buf (n) storing output frame data, so the pitch cycle waveform of former frame can be extracted from buf (n), i.e. the last Pitch of former frame output signal^(p-1)Individual, its Wave data is designated as pw^(p-1)(n); To pw^(p-1)N () carries out linear interpolation, obtaining length is Pitch^(p)New waveform, be designated as pw^(p)N () interpolation formula is

\begin{matrix} {pw}^{(p)} (n^{'}) = (\frac{{pitch}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} - n + 1) \cdot [{pw}^{(p - 1)} (n) - {pw}^{(p - 1)} (n - 1)] + {pw}^{(p - 1)} (n - 1), \\ n - 1 \leq \frac{{pitch}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} < n, \end{matrix} - - - (6)

2) new waveform is used to carry out wave period duplication; The method that new waveform carries out wave period duplication is as follows:

No matter if a. present frame is noise frame and former frame is noise frame or clean speech frame, treating processes is: according to formula (7), AB section data in buf (n) are carried out overlapping addition with CD section data, and carry out process of being fade-in fade-out, to ensure the data of D both sides, there is continuity, namely

\begin{matrix} {buf}_{C D} (n) = α \cdot {buf}_{C D} (n) + (1 - α) \cdot {buf}_{A B} (n) \\ = α \cdot {buf}_{C D} (n) + (1 - α) \cdot {buf}_{C D} (n - P i t c h), 0 \leq n < N_{1} \end{matrix} - - - (7)

α = \frac{N_{1} - i}{N_{1}}, i = 0, 1, ..., N_{1} - 1; - - - (8)

Wherein, �� is decay factor, linearly decays to 0 from 1; AB section and CD section data length N₁=32;

B. according to cycle Pitch^(p), with new waveform pw^(p)N () constantly copies in DF region; Wherein, DE section is the present frame after repairing; EF section data length is N₂=32, its role is to, when next frame is voice frame, it is fade-in fade-out for data, to ensure E two ends and the continuity between frame and frame;

C. the frame data started with C point are exported in buf (n); This method exports to exist and postpones, time of lag and CD segment length, then the N point that all for buf (n) data moved forward;

If present frame is clean speech frame, and former frame is noise frame, treating processes is as follows,

A. now DG section data are the EF section data of previous frame in buf (n); By the front N of DG section and present frame input₂It is identical with the process that CD section data overlap is added with AB section in formula (7) and be stored in DG that individual data point carries out its method of calculation of data fusion;

B. by after present frame remainder strong point slavish copying to the G point in buf (n);

C. the data of the frame length degree started with C point are exported, then a N point i.e. frame length degree that all for buf (n) data are moved forward; If present frame and former frame are all clean speech frame, then present frame is inputted area to be repaired in data slavish copying to buf (n); Export the data of the frame length degree started with C point.

2. the denoising method of transient state noise according to claim 1, it is characterised in that: mel cepstrum coefficients method of calculation are as follows:

1) input signal framing, frame length is set to N=480 and data length is 10ms, and data is normalized; If current frame signal is p frame signal, then have

x^(p)(n)=x [p (N-1)+n], n=0,1 ..., N-1; (9)

2) pre-treatment, carries out pre-emphasis and windowing process, namely to current frame signal

y^(p)(n)=x^(p)(n)-��x^(p)(n-1); (10)

Wherein pre-emphasis factor-beta=0.938; W (n) is Hamming window, i.e. w (n)=0.54-0.46cos (n ��/N);

3) pretreated signal is N=1024 point FFT, obtains frequency domain signal Y^(p)(k);

4) frequency domain signal Y is calculated^(p)The energy spectrum of (k) | Y^(p)(k)|²;

5) by the energy spectrum of frequency domain signal by the triangle filter group H of one group of Mel scale, frequency domain filtering is carried out;

In bank of filters, having M wave filter, each wave filter is triangular filter, overlapped between wave filter, and the mid-frequency of each wave filter is f (m), m=1,2 ..., M, M=24;

Filter design method: by input signal end frequency f_s/ 2, i.e. 24kHz, passes through formula

f_{m e l} = 2595 \log_{10} (1 + \frac{f_{l i n e a r}}{700}), - - - (12)

In formula, f is frequency, and unit is Hz; Transform to Mel scale frequency territory, obtain F_smel; By interval (0, F_smel) it is divided into 25 parts, remove 0 and F_smelTwo end points, 24 points of remaining cutpoints are respectively as the mid-frequency of 24 wave filters; Each point of cutpoint f (m) is evenly distributed in Mel scale frequency, then transforms to linear frequency scale by formula (12); After conversion, the interval between f (m) reduces along with the reduction of m value, broadening along with the increase of m value; According to frequency division point f (m), the frequency response that can obtain triangular filter group H (m, k) is

H (m, k) = \{\begin{matrix} 0, & f (k) < f (m + 1) \\ \frac{2 [f (k) - f (m - 1)]}{[f (m + 1) - f (m - 1)] [f (m) - f (m - 1)]}, & f (m - 1) \leq f (k) < f (m) \\ \frac{2 [f (m + 1) - f (k)]}{[f (m + 1) - f (m - 1)] [f (m + 1) - f (m)]}, & f (m) \leq f (k) \leq f (m + 1) \\ 0, & f (k) > f (m + 1) \end{matrix} - - - (13)

E (m) = \log_{10} [\underset{k}{Σ} H (m, k) | Y^{(p)} (k) |^{2}], m = 1, 2, ..., M - - - (14)

\begin{matrix} C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{M - 1} E (m), & l = 0 \\ C^{(p)} (l) = \frac{2}{\sqrt{L}} Σ_{m = 0}^{M - 1} E (m) \cos (\frac{π l (2 m + 1)}{2 M}), & 1 \leq l \leq L - 1. \end{matrix} - - - (15)

3. the denoising method of transient state noise according to claim 1, it is characterised in that: the process of walkaway is as follows:

d i s t = \sqrt{Σ_{l = 0}^{L} {[C^{(p)} (l) - C^{(p - 1)} (l)]}^{2}}, - - - (16)

Judge that whether present frame is containing noise according to distance value and threshold T hres; Threshold T hres is determined by following formula self-adaptation

Thres=10 ener, (17)

Wherein ener is the energy after each frame signal normalization method, and its minimum value is set to 60.0; After having detected, upgrade the MFCC feature of present frame, namely

C^(p)(l)=b C^(p-1)(l)+(1-b)��C^(p-1)(l), (18)

Wherein forgetting factor b=0.4; When next frame of noise frame is voice frame, this update method can prevent inspection by mistake.